Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第三篇)

查看Githubhtml

歡迎回看第一篇和第二篇。python

Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第一篇git

Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第二篇github

這一篇是終極篇,看完這個你就知道一個爬蟲,是那麼地須要,靈活運用,各類各樣的工具。web

動動歪腦筋,如今開始main函數。ajax

使用Python3哈,你們先看下面的代碼,我再放一段如何debug http數據傳輸的入口。json

if __name__ == '__main__':
    begin()
    password()
    today=time.strftime('%Y%m%d', time.localtime())
    a=time.clock()
    keyword = input('請輸入關鍵字:')
    sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:')
    try:
        pages =int(input('須要抓取的頁數(默認100頁):'))
        if pages>100 or pages<=0:
            print('頁數應該在1-100之間')
            pages=100
    except:
        pages=100
    try:
        man=int(input('請設置抓取暫停時間:默認4秒(4):'))
        if man<=0:
            man=4
    except:
        man=4
    zp=input('抓取圖片按1,不抓取按2:')
    if sort == '1':
        sortss = '_sale'
    elif sort == '2':
        sortss = 'bid'
    elif sort=='3':
        sortss='_bid'
    elif sort=='4':
        sortss='_ratesum'
    elif sort=='5':
        sortss=''
    else:
        sortss = '_sale'
    namess=time.strftime('%Y%m%d%H%S', time.localtime())
    root = '../data/'+today+'/'+namess+keyword
    roota='../excel/'+today
    mulu='../image/'+today+'/'+namess+keyword
    createjia(root)
    createjia(roota)
    for page in range(0, pages):
        time.sleep(man)
        print('暫停'+str(man)+'秒')
        if sortss=='':
            postdata = {
                'event_submit_do_new_search_auction': 1,
                'search': '提交查詢',
                '_input_charset': 'utf-8',
                'topSearch': 1,
                'atype': 'b',
                'searchfrom': 1,
                'action': 'home:redirect_app_action',
                'from': 1,
                'q': keyword,
                'sst': 1,
                'n': 20,
                'buying': 'buyitnow',
                'm': 'api4h5',
                'abtest': 16,
                'wlsort': 16,
                'style': 'list',
                'closeModues': 'nav,selecthot,onesearch',
                'page': page
            }
        else:
            postdata = {
                'event_submit_do_new_search_auction': 1,
                'search': '提交查詢',
                '_input_charset': 'utf-8',
                'topSearch': 1,
                'atype': 'b',
                'searchfrom': 1,
                'action': 'home:redirect_app_action',
                'from': 1,
                'q': keyword,
                'sst': 1,
                'n': 20,
                'buying': 'buyitnow',
                'm': 'api4h5',
                'abtest': 16,
                'wlsort': 16,
                'style': 'list',
                'closeModues': 'nav,selecthot,onesearch',
                'sort': sortss,
                'page': page
            }
        postdata = urllib.parse.urlencode(postdata)
        taobao = "http://s.m.taobao.com/search?" + postdata
        print(taobao)
        try:
            content1 = getHtml(taobao)
            file = open(root + '/' + str(page) + '.json', 'wb')
            file.write(content1)
        except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)

    # files=listfiles('201512171959','.json')
    files = listfiles(root, '.json')
    total = []
    total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像'])
    for filename in files:
        try:
            doc = open(filename, 'rb')
            doccontent = doc.read().decode('utf-8', 'ignore')
            product = doccontent.replace(' ', '').replace('\n', '')
            product = json.loads(product)
            onefile = product['listItem']
        except:
            print('抓不到' + filename)
            continue
        for item in onefile:
            itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']]
            itemlist.append(item['originalPrice'])
            itemlist.append(item['mobileDiscount'])
            itemlist.append(item['sold'])
            itemlist.append(item['zkType'])
            itemlist.append(item['act'])
            itemlist.append(item['coinLimit'])
            itemlist.append(item['auctionURL'])
            picpath=item['pic_path'].replace('60x60','720x720')
            itemlist.append(picpath)
            #http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg
            if zp=='1':
                if os.path.exists(mulu):
                    pass
                else:
                    createjia(mulu)
                url=urllib.parse.quote(picpath).replace('%3A',':')
                urllib.request.urlcleanup()
                try:
                    pic=urllib.request.urlopen(url)
                    picno=time.strftime('%H%M%S', time.localtime())
                    filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                    filenamepp=filenamep+'.jpeg'
                    sfilename=filenamep+'s.jpeg'
                    filess=open(filenamepp,'wb')
                    filess.write(pic.read())
                    filess.close()
                    img = Image.open(filenamepp)
                    w, h = img.size
                    size=w/6,h/6
                    img.thumbnail(size, Image.ANTIALIAS)
                    img.save(sfilename,'jpeg')
                    itemlist.append(sfilename)
                    print('抓到圖片:'+sfilename)
                except Exception as e:
                    if hasattr(e, 'code'):
                        print('頁面不存在或時間太長.')
                        print('Error code:', e.code)
                    elif hasattr(e, 'reason'):
                            print("沒法到達主機.")
                            print('Reason:  ', e.reason)
                    else:
                        print(e)
                    itemlist.append('')
            else:
                itemlist.append('')
            # print(itemlist)
            total.append(itemlist)
    if len(total) > 1:
        writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total)
    else:
        print('什麼都抓不到')
    b=time.clock()
    print('運行時間:'+timetochina(b-a))
    input('請關閉窗口')

好,先打開火狐瀏覽器,輸入api

http://s.m.taobao.com/

Shift+Ctrl+M變成手機形式,而後模擬觸摸事件瀏覽器

圖片描述

如今好像在PC機上不能搜索寶貝了,誰怕,按F12服務器

看到下面的post參數沒有

圖片描述

而後看到JSON數據沒有。

圖片描述

咱們從中間代碼剖析。

else:
        postdata = {
            'event_submit_do_new_search_auction': 1,
            'search': '提交查詢',
            '_input_charset': 'utf-8',
            'topSearch': 1,
            'atype': 'b',
            'searchfrom': 1,
            'action': 'home:redirect_app_action',
            'from': 1,
            'q': keyword,
            'sst': 1,
            'n': 20,
            'buying': 'buyitnow',
            'm': 'api4h5',
            'abtest': 16,
            'wlsort': 16,
            'style': 'list',
            'closeModues': 'nav,selecthot,onesearch',
            'sort': sortss,
            'page': page
        }
    postdata = urllib.parse.urlencode(postdata)
    taobao = "http://s.m.taobao.com/search?" + postdata
    print(taobao)

keyword是搜索關鍵字,sort是排序方式,page是第幾頁,默認100頁以後是沒有的,要觀察。

而後還支持價格索引,發貨地址索引,你們本身抓包Debug哈。

urlencode是由於關鍵字多是漢字或非法字符,須要先轉義一下。

打印出來的效果是這樣的

圖片描述

好!!咱們從一開始的main剖析。

if __name__ == '__main__':
    begin()
    password()
    today=time.strftime('%Y%m%d', time.localtime())
    a=time.clock()
    b=time.clock()
    print('運行時間:'+timetochina(b-a))
    input('請關閉窗口')

begin開始歡迎信息

password開始驗證用戶

clock()開始計時,看程序運行時間

today是今天的日期,格式爲年月日,建文件夾要用到

最後的input是爲了防止運行後直接就結束了,你都沒時間看運行時間

函數參考上篇。

接下來搜索限制:

keyword = input('請輸入關鍵字:')
sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:')
try:
    pages =int(input('須要抓取的頁數(默認100頁):'))
    if pages>100 or pages<=0:
        print('頁數應該在1-100之間')
        pages=100
except:
    pages=100
try:
    man=int(input('請設置抓取暫停時間:默認4秒(4):'))
    if man<=0:
        man=4
except:
    man=4
zp=input('抓取圖片按1,不抓取按2:')
if sort == '1':
    sortss = '_sale'
elif sort == '2':
    sortss = 'bid'
elif sort=='3':
    sortss='_bid'
elif sort=='4':
    sortss='_ratesum'
elif sort=='5':
    sortss=''
else:
    sortss = '_sale'

逐行分析:

try:
    pages =int(input('須要抓取的頁數(默認100頁):'))
    if pages>100 or pages<=0:
        print('頁數應該在1-100之間')
        pages=100
except:
    pages=100

頁數若是輸出的不是數字,異常,默認100頁,

是數字可是超過100或者是負數,也是默認100頁,不爽來戰。

try:
    man=int(input('請設置抓取暫停時間:默認4秒(4):'))
    if man<=0:
        man=4
except:
    man=4

一樣,抓取要暫停時間,不能抓太快啊,會被反爬的!四秒是默認。

keyword = input('請輸入關鍵字:')
sort = input('按銷量優先請按1,按價格低到高抓取請按2,價格高到低按3,信用排序按4,綜合排序按5:')
zp=input('抓取圖片按1,不抓取按2:')
if sort == '1':
    sortss = '_sale'
elif sort == '2':
    sortss = 'bid'
elif sort=='3':
    sortss='_bid'
elif sort=='4':
    sortss='_ratesum'
elif sort=='5':
    sortss=''
else:
    sortss = '_sale'

上面這個重點在於sort,debug時總結的,若是綜合排序那麼sortss='',默認按銷量排序。

抓圖片,不抓圖片,是抓仍是不抓,本身決定!

下面是抓取數據的儲存地

namess=time.strftime('%Y%m%d%H%S', time.localtime())
root = '../data/'+today+'/'+namess+keyword
roota='../excel/'+today
mulu='../image/'+today+'/'+namess+keyword
createjia(root)
createjia(roota)

看上面再看下面,today是今天的日期,namess+keyword存放今天哪一個小時那一分鐘抓的什麼關鍵字的數據。

root變量存放原始數據

roota存放Excel

mulu存放圖片

createjia是建立文件夾,不存在會報錯的!!

圖片描述

關鍵boss來了,看好:

for page in range(0, pages):
        time.sleep(man)
        print('暫停'+str(man)+'秒')
        if sortss=='':
            postdata = {
                'event_submit_do_new_search_auction': 1,
                'search': '提交查詢',
                '_input_charset': 'utf-8',
                'topSearch': 1,
                'atype': 'b',
                'searchfrom': 1,
                'action': 'home:redirect_app_action',
                'from': 1,
                'q': keyword,
                'sst': 1,
                'n': 20,
                'buying': 'buyitnow',
                'm': 'api4h5',
                'abtest': 16,
                'wlsort': 16,
                'style': 'list',
                'closeModues': 'nav,selecthot,onesearch',
                'page': page
            }
        else:
            postdata = {
                'event_submit_do_new_search_auction': 1,
                'search': '提交查詢',
                '_input_charset': 'utf-8',
                'topSearch': 1,
                'atype': 'b',
                'searchfrom': 1,
                'action': 'home:redirect_app_action',
                'from': 1,
                'q': keyword,
                'sst': 1,
                'n': 20,
                'buying': 'buyitnow',
                'm': 'api4h5',
                'abtest': 16,
                'wlsort': 16,
                'style': 'list',
                'closeModues': 'nav,selecthot,onesearch',
                'sort': sortss,
                'page': page
            }
        postdata = urllib.parse.urlencode(postdata)
        taobao = "http://s.m.taobao.com/search?" + postdata
        print(taobao)
        try:
            content1 = getHtml(taobao)
            file = open(root + '/' + str(page) + '.json', 'wb')
            file.write(content1)
        except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)

先睡覺一段時間,再抓,循環是從0到pages,pages是頁數,構造參數會使用到。

for page in range(0, pages):
    time.sleep(man)
    print('暫停'+str(man)+'秒')
    if sortss=='':
        postdata = {
            'event_submit_do_new_search_auction': 1,
            'search': '提交查詢',
            '_input_charset': 'utf-8',
            'topSearch': 1,
            'atype': 'b',
            'searchfrom': 1,
            'action': 'home:redirect_app_action',
            'from': 1,
            'q': keyword,
            'sst': 1,
            'n': 20,
            'buying': 'buyitnow',
            'm': 'api4h5',
            'abtest': 16,
            'wlsort': 16,
            'style': 'list',
            'closeModues': 'nav,selecthot,onesearch',
            'page': page
        }

由於綜合排序和其餘排序有差別,它沒有sort這個post參數,因此弄了個if和else作區分。

else:
        postdata = {
            'event_submit_do_new_search_auction': 1,
            'search': '提交查詢',
            '_input_charset': 'utf-8',
            'topSearch': 1,
            'atype': 'b',
            'searchfrom': 1,
            'action': 'home:redirect_app_action',
            'from': 1,
            'q': keyword,
            'sst': 1,
            'n': 20,
            'buying': 'buyitnow',
            'm': 'api4h5',
            'abtest': 16,
            'wlsort': 16,
            'style': 'list',
            'closeModues': 'nav,selecthot,onesearch',
            'sort': sortss,
            'page': page
        }
    postdata = urllib.parse.urlencode(postdata)
    taobao = "http://s.m.taobao.com/search?" + postdata

下面開始抓這個連接,而後把抓到的數據放在data下,保存爲json。

try:
        content1 = getHtml(taobao)
        file = open(root + '/' + str(page) + '.json', 'wb')
        file.write(content1)
    except Exception as e:
            if hasattr(e, 'code'):
                print('頁面不存在或時間太長.')
                print('Error code:', e.code)
            elif hasattr(e, 'reason'):
                    print("沒法到達主機.")
                    print('Reason:  ', e.reason)
            else:
                print(e)

若是出現錯誤了,看看錯誤有沒有code這個屬性,有的話就證實訪問服務器成功,可是會出現404,403等東西。

若是有reason則是你網絡有問題,沒法訪問服務器。

保存的數據以下(太長不放了):

http://s.m.taobao.com/search?q=1&abtest=16&search=%E6%8F%90%E4%BA%A4%E6%9F%A5%E8%AF%A2&topSearch=1&style=list&sst=1&atype=b&n=20&page=0&closeModues=nav%2Cselecthot%2Conesearch&_input_charset=utf-8&sort=bid&buying=buyitnow&searchfrom=1&from=1&m=api4h5&event_submit_do_new_search_auction=1&action=home%3Aredirect_app_action&wlsort=16

稍後須要解析這些東西,拆分插到Excel。

files = listfiles(root, '.json')
total = []
total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像'])
for filename in files:
    try:
        doc = open(filename, 'rb')
        doccontent = doc.read().decode('utf-8', 'ignore')
        product = doccontent.replace(' ', '').replace('\n', '')
        product = json.loads(product)
        onefile = product['listItem']
    except:
        print('抓不到' + filename)
        continue
    for item in onefile:
        itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']]
        itemlist.append(item['originalPrice'])
        itemlist.append(item['mobileDiscount'])
        itemlist.append(item['sold'])
        itemlist.append(item['zkType'])
        itemlist.append(item['act'])
        itemlist.append(item['coinLimit'])
        itemlist.append(item['auctionURL'])
        picpath=item['pic_path'].replace('60x60','720x720')
        itemlist.append(picpath)
        #http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg
        if zp=='1':
            if os.path.exists(mulu):
                pass
            else:
                createjia(mulu)
            url=urllib.parse.quote(picpath).replace('%3A',':')
            urllib.request.urlcleanup()
            try:
                pic=urllib.request.urlopen(url)
                picno=time.strftime('%H%M%S', time.localtime())
                filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                filenamepp=filenamep+'.jpeg'
                sfilename=filenamep+'s.jpeg'
                filess=open(filenamepp,'wb')
                filess.write(pic.read())
                filess.close()
                img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)
                print('抓到圖片:'+sfilename)
            except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)
                itemlist.append('')
        else:
            itemlist.append('')
        # print(itemlist)
        total.append(itemlist)
if len(total) > 1:
    writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total)
else:
    print('什麼都抓不到')

逐行分析。

files = listfiles(root, '.json')
total = []
total.append(['頁數', '店名', '商品標題', '商品打折價', '發貨地址', '評論數', '原價', '手機折扣', '售出件數', '政策享受', '付款人數', '金幣折扣','URL地址','圖像URL','圖像'])

root變量是存放原始數據目錄,從該目錄找出全部格式爲json的文件。

total變量存放Excel數據,待生成Excel

首行固然是解釋啦,頁數,店名,商品標題什麼的。。。

for filename in files:
    try:
        doc = open(filename, 'rb')
        doccontent = doc.read().decode('utf-8', 'ignore')
        product = doccontent.replace(' ', '').replace('\n', '')
        product = json.loads(product)
        onefile = product['listItem']
    except:
        print('抓不到' + filename)
        continue

開始循環原始數據文件,以二進制open()打開,爲何?由於不那樣的話有些數據編碼亂七八糟,仍是二進制,而後decode轉成utf-8,而且加上ignore參數,忽視可能出現的

轉碼出錯。

doc = open(filename, 'rb')
        doccontent = doc.read().decode('utf-8', 'ignore')

而後替換掉一些空格,使其更符合json數據

product = doccontent.replace(' ', '').replace('\n', '')
        product = json.loads(product)
        onefile = product['listItem']

使用json.loads加載這個數據,而後就能夠像對象同樣操做json數據,'listItem'存放了咱們須要的數據,看JSON數據格式組成:

圖片描述

圖片描述

好的!好多碼呀,好複雜。。。。

for item in onefile:
        itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']]
        itemlist.append(item['originalPrice'])
        itemlist.append(item['mobileDiscount'])
        itemlist.append(item['sold'])
        itemlist.append(item['zkType'])
        itemlist.append(item['act'])
        itemlist.append(item['coinLimit'])
        itemlist.append(item['auctionURL'])
        picpath=item['pic_path'].replace('60x60','720x720')
        itemlist.append(picpath)

循環出每個商品信息,組裝到itemlist列表裏面,json裏面還有不少隱藏的字段沒有用到。

json每個商品信息格式以下:

{
        "pos": 0,
        "sold": "2",
        "userType": "0",
        "item_id": "521020560570",
        "nick": "wangjingli327",
        "userId": "85356923",
        "quantity": "",
        "shipping": "12.00",
        "ratesum": "",
        "isCod": "",
        "isprepay": "",
        "promotedService": "",
        "auctionTag": "",
        "clickUrl": "",
        "dsrScore": "",
        "zkType": "",
        "zkGroup": "",
        "autoPost": "",
        "commentCount": "2",
        "ordinaryPostFee": "",
        "distance": "",
        "zkRate": "",
        "zkTime": "",
        "promotions": "",
        "isInLimitPromotion": "",
        "pre_title_color": "",
        "pre_title": "",
        "h5Url": "",
        "isO2o": "",
        "recommendReason": "",
        "recommendColor": "",
        "recommendType": "",
        "location": "浙江 紹興",
        "price": "298.00",
        "priceColor": "#000000",
        "long_title": "",
        "isP4p": "false",
        "sellerLoc": "浙江 紹興",
        "fastPostFee": "12.00",
        "title": "性感透視奢華蕾絲深v領長袖公主新娘婚紗禮服2015冬季新款2518",
        "sameCount": "",
        "spuId": "",
        "similarCount": "",
        "priceWithRate": "",
        "pic_path": "http://g.search.alicdn.com/img/bao/uploaded/i4/i3/TB1c1E8IFXXXXbsXFXXXXXXXXXX_!!0-item_pic.jpg_60x60.jpg",
        "uprightImg": "",
        "priceWap": "298.00",
        "auctionFlag": "",
        "mobileDiscount": "",
        "coinInfo": "",
        "auctionURL": "http://a.m.taobao.com/i521020560570.htm?&abtest=16&sid=6334516654ffcd5000185f5604cc1d25",
        "type": "fixed",
        "isB2c": "0",
        "iconList": "xfbug",
        "uniqpid": "",
        "maxShopGift": "",
        "goodRate": "",
        "category": "162701",
        "newDsr": "",
        "desScore": "",
        "o2oShopId": "",
        "scoref": "",
        "sellerCount": "",
        "tagInfo": "",
        "area": "紹興",
        "realSales": "",
        "banditScore": "",
        "totalSold": "",
        "name": "性感透視奢華蕾絲深v領長袖公主新娘婚紗禮服2015冬季新款2518",
        "img2": "//gw1.alicdn.com/bao/uploaded/i3/TB1c1E8IFXXXXbsXFXXXXXXXXXX_!!0-item_pic.jpg",
        "iswebp": "",
        "url": "//a.m.taobao.com/i521020560570.htm?sid=6334516654ffcd5000185f5604cc1d25&rn=b360d891c5ca145a452cf547fa10ef24&abtest=16",
        "previewUrl": "//a.m.taobao.com/ajax/pre_view.do?sid=6334516654ffcd5000185f5604cc1d25&itemId=521020560570&abtest=16",
        "favoriteUrl": "//fav.m.taobao.com/favorite/to_collection.htm?sid=6334516654ffcd5000185f5604cc1d25&itemNumId=521020560570&abtest=16",
        "originalPrice": "298.00",
        "freight": "12.00",
        "act": "2",
        "itemNumId": "521020560570",
        "wwimUrl": "//im.m.taobao.com/ww/ad_ww_dialog.htm?item_num_id=521020560570&amp;to_user=d2FuZ2ppbmdsaTMyNw%3D%3D",
        "isMobileEcard": "false",
        "auctionType": "b",
        "coinLimit": "100",
        "collect": "",
        "assess": "",
        "recommendGuy": "",
        "pricePerUnit": "",
        "collocation": "",
        "daySold": "",
        "inStock": "",
        "extendPid": "",
        "from": "",
        "recommendLabel": ""
    }

固然商品圖片url

http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_60x60.jpg

能夠改爲

http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_180x180.jpg
http://g.search2.alicdn.com/img/bao/uploaded/i4/i4/TB13O7bJVXXXXbJXpXXXXXXXXXX_%21%210-item_pic.jpg_720x720.jpg

哈哈哈,好了!!數據都解析好了。

如今開始抓圖:

if zp=='1':
            if os.path.exists(mulu):
                pass
            else:
                createjia(mulu)
            url=urllib.parse.quote(picpath).replace('%3A',':')
            urllib.request.urlcleanup()
            try:
                pic=urllib.request.urlopen(url)
                picno=time.strftime('%H%M%S', time.localtime())
                filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                filenamepp=filenamep+'.jpeg'
                sfilename=filenamep+'s.jpeg'
                filess=open(filenamepp,'wb')
                filess.write(pic.read())
                filess.close()
                img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)
                print('抓到圖片:'+sfilename)
            except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)
                itemlist.append('')
        else:
            itemlist.append('')

若是須要抓圖那麼執行抓圖程序,不然後面append('')表示沒有圖片。

if zp=='1':
            #抓圖
        else:
            itemlist.append('')
        # print(itemlist)
        total.append(itemlist)

抓完圖後,total須要把這些商品信息拼在一塊兒,原本是一行行的,如今拼在一塊兒就像一個矩陣,和EXCEL裏面如出一轍,等着寫入EXCEL。

抓圖代碼:

if zp=='1':
            if os.path.exists(mulu):
                pass
            else:
                createjia(mulu)
            url=urllib.parse.quote(picpath).replace('%3A',':')
            urllib.request.urlcleanup()
            try:
                pic=urllib.request.urlopen(url)
                picno=time.strftime('%H%M%S', time.localtime())
                filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                filenamepp=filenamep+'.jpeg'
                sfilename=filenamep+'s.jpeg'
                filess=open(filenamepp,'wb')
                filess.write(pic.read())
                filess.close()
                img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)
                print('抓到圖片:'+sfilename)
            except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)
                itemlist.append('')

首先建立存放圖片的文件夾

if os.path.exists(mulu):
                pass
            else:
                createjia(mulu)

而後須要將圖片url變成正常規則的url,而後:被轉義成%3A,須要再變回來

urlcleanup則是清除全局設置,由於淘寶天貓抓圖片不須要任何附加頭部,session什麼的,哈哈哈,不加這個可能有問題喔。

url=urllib.parse.quote(picpath).replace('%3A',':')
            urllib.request.urlcleanup()

quote函數說明:

圖片描述

最後,建立文件夾,設置圖片名字,存放圖片,關鍵在於圖片取名和濃縮圖設置。

try:
                pic=urllib.request.urlopen(url)
                picno=time.strftime('%H%M%S', time.localtime())
                filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                filenamepp=filenamep+'.jpeg'
                sfilename=filenamep+'s.jpeg'
                filess=open(filenamepp,'wb')
                filess.write(pic.read())
                filess.close()
                img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)
                print('抓到圖片:'+sfilename)
            except Exception as e:
                if hasattr(e, 'code'):
                    print('頁面不存在或時間太長.')
                    print('Error code:', e.code)
                elif hasattr(e, 'reason'):
                        print("沒法到達主機.")
                        print('Reason:  ', e.reason)
                else:
                    print(e)
                itemlist.append('')

圖片命名:

圖片描述

picno=time.strftime('%H%M%S', time.localtime())
                filenamep=mulu+'/'+picno+validateTitle(item['nick']+'-'+item['title'])
                filenamepp=filenamep+'.jpeg'
                sfilename=filenamep+'s.jpeg'

圖片路徑大概是 抓取日期年月日/抓取日期年月日時分秒關鍵字/時分秒店鋪名-商品標題,有轉義哦!

如2060105/201601051545婚紗禮服/153648wangjingli327-法國蕾絲釘珠深v領公主新娘甜美修身齊地婚紗禮服2015冬季新款.jpeg

而濃縮圖爲2060105/201601051545婚紗禮服/153648wangjingli327-法國蕾絲釘珠深v領公主新娘甜美修身齊地婚紗禮服2015冬季新款s.jpeg

濃縮圖是爲了插入EXCEL而設立,使用了一個庫。

大小對比:

圖片描述

pic=urllib.request.urlopen(url)
                filess=open(filenamepp,'wb')
                filess.write(pic.read())
                filess.close()
                img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)
                print('抓到圖片:'+sfilename)

先pic=urllib.request.urlopen(url)打開url得到二進制數據,而後filess=open(filenamepp,'wb')打開一個新文件,將數據讀出來再寫進去filess.write(pic.read()),這樣就保存好了一張圖片。

img = Image.open(filenamepp)
                w, h = img.size
                size=w/6,h/6
                img.thumbnail(size, Image.ANTIALIAS)
                img.save(sfilename,'jpeg')
                itemlist.append(sfilename)

而後Image打開這張保存的圖片,得到高度寬度w,h,而後以六分之一開始剪切,再保存!

你們來看程序最後幾行代碼:

if len(total) > 1:
    writeexcel(roota +'/'+namess+keyword+ '淘寶手機商品.xlsx', total)
else:
    print('什麼都抓不到')

total有數據,那麼寫入Excel,Excel命名本身分析哈,結果以下:

圖片描述

圖片描述

以後這樣要怎樣,我想讓任何人均可以直接運行,不要安裝python環境啦,好啦,看好。

安裝cx_Freeze,本身安裝!

源代碼同級目錄新建setup.py

寫入:

import sys
from cx_Freeze import setup, Executable

base = None

executables = [
    Executable('mtaobao.py', base=base)
]

setup (
name = "mtaobao",
version = "1.0",
description = "sangjin",
executables=executables
)

而後cmd轉到該文件夾下:

圖片描述

生成的文件

圖片描述

把exe.win32-3.4移到根目錄,任意更名,如下改成exe

圖片描述

可執行exe文件夾下的後綴爲exe的可執行文件,可是咱們仍是建一個批處理腳本run.bat吧,

run.bat裏面寫入

cd exe
mtaobao.exe

而後直接運行run.bat就能夠執行咱們的工具了!!

好了,咱們的Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具就結束了,好累!

Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第一篇

Python3中級玩家:Python3中級玩家:淘寶天貓商品搜索爬蟲自動化工具(第二篇

等不及就上github:

git clone https://github.com/hunterhug/taobaoscrapy.git

歡迎收看新的爬蟲文章!!!還有不少。。。

相關文章
相關標籤/搜索