查看Githubhtml
歡迎回看第一篇和第二篇。python
Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第一篇git
Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第二篇github
這一篇是終極篇,看完這個你就知道一個爬蟲,是那麼地須要,靈活運用,各類各樣的工具。web
動動歪腦筋,如今開始main函數。ajax
使用Python3哈,你們先看下面的代碼,我再放一段如何debug http數據傳輸的入口。json
if __name__ == '__main__': begin() password() today=time.strftime('%Y%m%d', time.localtime()) a=time.clock() keyword = input('請輸入關鍵字:') sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:') try: pages =int(input('須要抓取的頁數(默認100頁):')) if pages>100 or pages<=0: print('頁數應該在1-100之間') pages=100 except: pages=100 try: man=int(input('請設置抓取暫停時間:默認4秒(4):')) if man<=0: man=4 except: man=4 zp=input('抓取圖片按1,不抓取按2:') if sort == '1': sortss = '_sale' elif sort == '2': sortss = 'bid' elif sort=='3': sortss='_bid' elif sort=='4': sortss='_ratesum' elif sort=='5': sortss='' else: sortss = '_sale' namess=time.strftime('%Y%m%d%H%S', time.localtime()) root = '../data/'+today+'/'+namess+keyword roota='../excel/'+today mulu='../image/'+today+'/'+namess+keyword createjia(root) createjia(roota) for page in range(0, pages): time.sleep(man) print('暫停'+str(man)+'秒') if sortss=='': postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'page': page } else: postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'sort': sortss, 'page': page } postdata = urllib.parse.urlencode(postdata) taobao = "http://s.m.taobao.com/search?" + postdata print(taobao) try: content1 = getHtml(taobao) file = open(root + '/' + str(page) + '.json', 'wb') file.write(content1) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) # files=listfiles('201512171959','.json') files = listfiles(root, '.json') total = [] total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像']) for filename in files: try: doc = open(filename, 'rb') doccontent = doc.read().decode('utf-8', 'ignore') product = doccontent.replace(' ', '').replace('\n', '') product = json.loads(product) onefile = product['listItem'] except: print('抓不到' + filename) continue for item in onefile: itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']] itemlist.append(item['originalPrice']) itemlist.append(item['mobileDiscount']) itemlist.append(item['sold']) itemlist.append(item['zkType']) itemlist.append(item['act']) itemlist.append(item['coinLimit']) itemlist.append(item['auctionURL']) picpath=item['pic_path'].replace('60x60','720x720') itemlist.append(picpath) #http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg if zp=='1': if os.path.exists(mulu): pass else: createjia(mulu) url=urllib.parse.quote(picpath).replace('%3A',':') urllib.request.urlcleanup() try: pic=urllib.request.urlopen(url) picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg' filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) itemlist.append('') else: itemlist.append('') # print(itemlist) total.append(itemlist) if len(total) > 1: writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total) else: print('什麼都抓不到') b=time.clock() print('運行時間:'+timetochina(b-a)) input('請關閉窗口')
好,先打開火狐瀏覽器,輸入api
http://s.m.taobao.com/
Shift+Ctrl+M變成手機形式,而後模擬觸摸事件瀏覽器
如今好像在PC機上不能搜索寶貝了,誰怕,按F12服務器
看到下面的post參數沒有
而後看到JSON數據沒有。
咱們從中間代碼剖析。
else: postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'sort': sortss, 'page': page } postdata = urllib.parse.urlencode(postdata) taobao = "http://s.m.taobao.com/search?" + postdata print(taobao)
keyword是搜索關鍵字,sort是排序方式,page是第幾頁,默認100頁以後是沒有的,要觀察。
而後還支持價格索引,發貨地址索引,你們本身抓包Debug哈。
urlencode是由於關鍵字多是漢字或非法字符,須要先轉義一下。
打印出來的效果是這樣的
好!!咱們從一開始的main剖析。
if __name__ == '__main__': begin() password() today=time.strftime('%Y%m%d', time.localtime()) a=time.clock() b=time.clock() print('運行時間:'+timetochina(b-a)) input('請關閉窗口')
begin開始歡迎信息
password開始驗證用戶
clock()開始計時,看程序運行時間
today是今天的日期,格式爲年月日,建文件夾要用到
最後的input是爲了防止運行後直接就結束了,你都沒時間看運行時間
函數參考上篇。
接下來搜索限制:
keyword = input('請輸入關鍵字:') sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:') try: pages =int(input('須要抓取的頁數(默認100頁):')) if pages>100 or pages<=0: print('頁數應該在1-100之間') pages=100 except: pages=100 try: man=int(input('請設置抓取暫停時間:默認4秒(4):')) if man<=0: man=4 except: man=4 zp=input('抓取圖片按1,不抓取按2:') if sort == '1': sortss = '_sale' elif sort == '2': sortss = 'bid' elif sort=='3': sortss='_bid' elif sort=='4': sortss='_ratesum' elif sort=='5': sortss='' else: sortss = '_sale'
逐行分析:
try: pages =int(input('須要抓取的頁數(默認100頁):')) if pages>100 or pages<=0: print('頁數應該在1-100之間') pages=100 except: pages=100
頁數若是輸出的不是數字,異常,默認100頁,
是數字可是超過100或者是負數,也是默認100頁,不爽來戰。
try: man=int(input('請設置抓取暫停時間:默認4秒(4):')) if man<=0: man=4 except: man=4
一樣,抓取要暫停時間,不能抓太快啊,會被反爬的!四秒是默認。
keyword = input('請輸入關鍵字:') sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:') zp=input('抓取圖片按1,不抓取按2:') if sort == '1': sortss = '_sale' elif sort == '2': sortss = 'bid' elif sort=='3': sortss='_bid' elif sort=='4': sortss='_ratesum' elif sort=='5': sortss='' else: sortss = '_sale'
上面這個重點在於sort,debug時總結的,若是綜合排序那麼sortss='',默認按銷量排序。
抓圖片,不抓圖片,是抓仍是不抓,本身決定!
下面是抓取數據的儲存地
namess=time.strftime('%Y%m%d%H%S', time.localtime()) root = '../data/'+today+'/'+namess+keyword roota='../excel/'+today mulu='../image/'+today+'/'+namess+keyword createjia(root) createjia(roota)
看上面再看下面,today是今天的日期,namess+keyword存放今天哪一個小時那一分鐘抓的什麼關鍵字的數據。
root變量存放原始數據
roota存放Excel
mulu存放圖片
createjia是建立文件夾,不存在會報錯的!!
關鍵boss來了,看好:
for page in range(0, pages): time.sleep(man) print('暫停'+str(man)+'秒') if sortss=='': postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'page': page } else: postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'sort': sortss, 'page': page } postdata = urllib.parse.urlencode(postdata) taobao = "http://s.m.taobao.com/search?" + postdata print(taobao) try: content1 = getHtml(taobao) file = open(root + '/' + str(page) + '.json', 'wb') file.write(content1) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e)
先睡覺一段時間,再抓,循環是從0到pages,pages是頁數,構造參數會使用到。
for page in range(0, pages): time.sleep(man) print('暫停'+str(man)+'秒') if sortss=='': postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'page': page }
由於綜合排序和其餘排序有差別,它沒有sort這個post參數,因此弄了個if和else作區分。
else: postdata = { 'event_submit_do_new_search_auction': 1, 'search': '提交查詢', '_input_charset': 'utf-8', 'topSearch': 1, 'atype': 'b', 'searchfrom': 1, 'action': 'home:redirect_app_action', 'from': 1, 'q': keyword, 'sst': 1, 'n': 20, 'buying': 'buyitnow', 'm': 'api4h5', 'abtest': 16, 'wlsort': 16, 'style': 'list', 'closeModues': 'nav,selecthot,onesearch', 'sort': sortss, 'page': page } postdata = urllib.parse.urlencode(postdata) taobao = "http://s.m.taobao.com/search?" + postdata
下面開始抓這個連接,而後把抓到的數據放在data下,保存爲json。
try: content1 = getHtml(taobao) file = open(root + '/' + str(page) + '.json', 'wb') file.write(content1) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e)
若是出現錯誤了,看看錯誤有沒有code這個屬性,有的話就證實訪問服務器成功,可是會出現404,403等東西。
若是有reason則是你網絡有問題,沒法訪問服務器。
保存的數據以下(太長不放了):
http://s.m.taobao.com/search?q=1&abtest=16&search=%E6%8F%90%E4%BA%A4%E6%9F%A5%E8%AF%A2&topSearch=1&style=list&sst=1&atype=b&n=20&page=0&closeModues=nav%2Cselecthot%2Conesearch&_input_charset=utf-8&sort=bid&buying=buyitnow&searchfrom=1&from=1&m=api4h5&event_submit_do_new_search_auction=1&action=home%3Aredirect_app_action&wlsort=16
稍後須要解析這些東西,拆分插到Excel。
files = listfiles(root, '.json') total = [] total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像']) for filename in files: try: doc = open(filename, 'rb') doccontent = doc.read().decode('utf-8', 'ignore') product = doccontent.replace(' ', '').replace('\n', '') product = json.loads(product) onefile = product['listItem'] except: print('抓不到' + filename) continue for item in onefile: itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']] itemlist.append(item['originalPrice']) itemlist.append(item['mobileDiscount']) itemlist.append(item['sold']) itemlist.append(item['zkType']) itemlist.append(item['act']) itemlist.append(item['coinLimit']) itemlist.append(item['auctionURL']) picpath=item['pic_path'].replace('60x60','720x720') itemlist.append(picpath) #http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg if zp=='1': if os.path.exists(mulu): pass else: createjia(mulu) url=urllib.parse.quote(picpath).replace('%3A',':') urllib.request.urlcleanup() try: pic=urllib.request.urlopen(url) picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg' filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) itemlist.append('') else: itemlist.append('') # print(itemlist) total.append(itemlist) if len(total) > 1: writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total) else: print('什麼都抓不到')
逐行分析。
files = listfiles(root, '.json') total = [] total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像'])
root變量是存放原始數據目錄,從該目錄找出全部格式爲json的文件。
total變量存放Excel數據,待生成Excel
首行固然是解釋啦,頁數,店名,商品標題什麼的。。。
for filename in files: try: doc = open(filename, 'rb') doccontent = doc.read().decode('utf-8', 'ignore') product = doccontent.replace(' ', '').replace('\n', '') product = json.loads(product) onefile = product['listItem'] except: print('抓不到' + filename) continue
開始循環原始數據文件,以二進制open()打開,爲何?由於不那樣的話有些數據編碼亂七八糟,仍是二進制,而後decode轉成utf-8,而且加上ignore參數,忽視可能出現的
轉碼出錯。
doc = open(filename, 'rb') doccontent = doc.read().decode('utf-8', 'ignore')
而後替換掉一些空格,使其更符合json數據
product = doccontent.replace(' ', '').replace('\n', '') product = json.loads(product) onefile = product['listItem']
使用json.loads加載這個數據,而後就能夠像對象同樣操做json數據,'listItem'存放了咱們須要的數據,看JSON數據格式組成:
好的!好多碼呀,好複雜。。。。
for item in onefile: itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']] itemlist.append(item['originalPrice']) itemlist.append(item['mobileDiscount']) itemlist.append(item['sold']) itemlist.append(item['zkType']) itemlist.append(item['act']) itemlist.append(item['coinLimit']) itemlist.append(item['auctionURL']) picpath=item['pic_path'].replace('60x60','720x720') itemlist.append(picpath)
循環出每個商品信息,組裝到itemlist列表裏面,json裏面還有不少隱藏的字段沒有用到。
json每個商品信息格式以下:
{ "pos": 0, "sold": "2", "userType": "0", "item_id": "521020560570", "nick": "wangjingli327", "userId": "85356923", "quantity": "", "shipping": "12.00", "ratesum": "", "isCod": "", "isprepay": "", "promotedService": "", "auctionTag": "", "clickUrl": "", "dsrScore": "", "zkType": "", "zkGroup": "", "autoPost": "", "commentCount": "2", "ordinaryPostFee": "", "distance": "", "zkRate": "", "zkTime": "", "promotions": "", "isInLimitPromotion": "", "pre_title_color": "", "pre_title": "", "h5Url": "", "isO2o": "", "recommendReason": "", "recommendColor": "", "recommendType": "", "location": "浙江 紹興", "price": "298.00", "priceColor": "#000000", "long_title": "", "isP4p": "false", "sellerLoc": "浙江 紹興", "fastPostFee": "12.00", "title": "性感透視奢華蕾絲深v領長袖公主新娘婚紗禮服2015冬季新款2518", "sameCount": "", "spuId": "", "similarCount": "", "priceWithRate": "", "pic_path": "http://g.search.alicdn.com/img/bao/uploaded/i4/i3/TB1c1E8IFXXXXbsXFXXXXXXXXXX_!!0-item_pic.jpg_60x60.jpg", "uprightImg": "", "priceWap": "298.00", "auctionFlag": "", "mobileDiscount": "", "coinInfo": "", "auctionURL": "http://a.m.taobao.com/i521020560570.htm?&abtest=16&sid=6334516654ffcd5000185f5604cc1d25", "type": "fixed", "isB2c": "0", "iconList": "xfbug", "uniqpid": "", "maxShopGift": "", "goodRate": "", "category": "162701", "newDsr": "", "desScore": "", "o2oShopId": "", "scoref": "", "sellerCount": "", "tagInfo": "", "area": "紹興", "realSales": "", "banditScore": "", "totalSold": "", "name": "性感透視奢華蕾絲深v領長袖公主新娘婚紗禮服2015冬季新款2518", "img2": "//gw1.alicdn.com/bao/uploaded/i3/TB1c1E8IFXXXXbsXFXXXXXXXXXX_!!0-item_pic.jpg", "iswebp": "", "url": "//a.m.taobao.com/i521020560570.htm?sid=6334516654ffcd5000185f5604cc1d25&rn=b360d891c5ca145a452cf547fa10ef24&abtest=16", "previewUrl": "//a.m.taobao.com/ajax/pre_view.do?sid=6334516654ffcd5000185f5604cc1d25&itemId=521020560570&abtest=16", "favoriteUrl": "//fav.m.taobao.com/favorite/to_collection.htm?sid=6334516654ffcd5000185f5604cc1d25&itemNumId=521020560570&abtest=16", "originalPrice": "298.00", "freight": "12.00", "act": "2", "itemNumId": "521020560570", "wwimUrl": "//im.m.taobao.com/ww/ad_ww_dialog.htm?item_num_id=521020560570&to_user=d2FuZ2ppbmdsaTMyNw%3D%3D", "isMobileEcard": "false", "auctionType": "b", "coinLimit": "100", "collect": "", "assess": "", "recommendGuy": "", "pricePerUnit": "", "collocation": "", "daySold": "", "inStock": "", "extendPid": "", "from": "", "recommendLabel": "" }
固然商品圖片url
http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_60x60.jpg
能夠改爲
http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_720x720.jpg
哈哈哈,好了!!數據都解析好了。
如今開始抓圖:
if zp=='1': if os.path.exists(mulu): pass else: createjia(mulu) url=urllib.parse.quote(picpath).replace('%3A',':') urllib.request.urlcleanup() try: pic=urllib.request.urlopen(url) picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg' filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) itemlist.append('') else: itemlist.append('')
若是須要抓圖那麼執行抓圖程序,不然後面append('')表示沒有圖片。
if zp=='1': #抓圖 else: itemlist.append('') # print(itemlist) total.append(itemlist)
抓完圖後,total須要把這些商品信息拼在一塊兒,原本是一行行的,如今拼在一塊兒就像一個矩陣,和EXCEL裏面如出一轍,等着寫入EXCEL。
抓圖代碼:
if zp=='1': if os.path.exists(mulu): pass else: createjia(mulu) url=urllib.parse.quote(picpath).replace('%3A',':') urllib.request.urlcleanup() try: pic=urllib.request.urlopen(url) picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg' filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) itemlist.append('')
首先建立存放圖片的文件夾
if os.path.exists(mulu): pass else: createjia(mulu)
而後須要將圖片url變成正常規則的url,而後:被轉義成%3A,須要再變回來
urlcleanup則是清除全局設置,由於淘寶天貓抓圖片不須要任何附加頭部,session什麼的,哈哈哈,不加這個可能有問題喔。
url=urllib.parse.quote(picpath).replace('%3A',':') urllib.request.urlcleanup()
quote函數說明:
最後,建立文件夾,設置圖片名字,存放圖片,關鍵在於圖片取名和濃縮圖設置。
try: pic=urllib.request.urlopen(url) picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg' filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename) except Exception as e: if hasattr(e, 'code'): print('頁面不存在或時間太長.') print('Error code:', e.code) elif hasattr(e, 'reason'): print("沒法到達主機.") print('Reason: ', e.reason) else: print(e) itemlist.append('')
圖片命名:
picno=time.strftime('%H%M%S', time.localtime()) filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title']) filenamepp=filenamep+'.jpeg' sfilename=filenamep+'s.jpeg'
圖片路徑大概是 抓取日期年月日/抓取日期年月日時分秒關鍵字/時分秒店鋪名-商品標題,有轉義哦!
如2060105/201601051545婚紗禮服/153648wangjingli327-法國蕾絲釘珠深v領公主新娘甜美修身齊地婚紗禮服2015冬季新款.jpeg
而濃縮圖爲2060105/201601051545婚紗禮服/153648wangjingli327-法國蕾絲釘珠深v領公主新娘甜美修身齊地婚紗禮服2015冬季新款s.jpeg
濃縮圖是爲了插入EXCEL而設立,使用了一個庫。
大小對比:
pic=urllib.request.urlopen(url) filess=open(filenamepp,'wb') filess.write(pic.read()) filess.close() img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename) print('抓到圖片:'+sfilename)
先pic=urllib.request.urlopen(url)打開url得到二進制數據,而後filess=open(filenamepp,'wb')打開一個新文件,將數據讀出來再寫進去filess.write(pic.read()),這樣就保存好了一張圖片。
img = Image.open(filenamepp) w, h = img.size size=w/6,h/6 img.thumbnail(size, Image.ANTIALIAS) img.save(sfilename,'jpeg') itemlist.append(sfilename)
而後Image打開這張保存的圖片,得到高度寬度w,h,而後以六分之一開始剪切,再保存!
你們來看程序最後幾行代碼:
if len(total) > 1: writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total) else: print('什麼都抓不到')
total有數據,那麼寫入Excel,Excel命名本身分析哈,結果以下:
以後這樣要怎樣,我想讓任何人均可以直接運行,不要安裝python環境啦,好啦,看好。
安裝cx_Freeze,本身安裝!
源代碼同級目錄新建setup.py
寫入:
import sys from cx_Freeze import setup, Executable base = None executables = [ Executable('mtaobao.py', base=base) ] setup ( name = "mtaobao", version = "1.0", description = "sangjin", executables=executables )
而後cmd轉到該文件夾下:
生成的文件
把exe.win32-3.4移到根目錄,任意更名,如下改成exe
可執行exe文件夾下的後綴爲exe的可執行文件,可是咱們仍是建一個批處理腳本run.bat吧,
run.bat裏面寫入
cd exe mtaobao.exe
而後直接運行run.bat就能夠執行咱們的工具了!!
好了,咱們的Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具就結束了,好累!
Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第一篇
Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第二篇
等不及就上github:
git clone https://github.com/hunterhug/taobaoscrapy.git
歡迎收看新的爬蟲文章!!!還有不少。。。