python爬取淘寶商品信息並加入購物車

時間 2019-11-06

原文原文鏈接

先說一下最終要達到的效果：谷歌瀏覽器登錄淘寶後，運行python項目，將任意任意淘寶商品的連接傳入，並手動選擇商品屬性，輸出其價格與剩餘庫存，而後選擇購買數，自動加入購物車。html

在開始爬取淘寶連接以前，咱麼要先作一些準備工做，我項目中使用的是 python2.7 ，開發與運行環境都是win10，瀏覽器是64位chrome 59.0.3。因爲淘寶的模擬登錄涉及到一些複雜的UA碼算法以及滑塊登錄驗證，能力有限，爲了圖省事就使用瀏覽器手動登陸淘寶而後python獲取生成的cookie的方式來爬取登陸後的數據。python

獲取cookie

要獲取登陸成功後的cookie，一種辦法是瀏覽器打開開發者工具，手動複製cookie，可是這樣比較麻煩，由於淘寶裏面的請求不少不是同一個域，因此發送給服務器的cookie也不同，若是換一個地址就手動複製一次也太low了，我們爲了有big一些，直接讀取chrome存儲在本地的cookie，而後拼出來。git

網上查了一下，chrome的中的cookie以sqlite的方式存儲在 %LOCALAPPDATA%\Google\Chrome\User Data\Default\Cookies 目錄下，cookie的value也通過了CryptUnprotectData加密。知道了這些，咱們就能夠往下走了。web

這裏須要用到sqlite3模塊ajax

def get_cookie(url):
    """
    獲取該的可用cookie
    :param url:
    :return:
    """

    domain = urllib2.splithost(urllib2.splittype(url)[1])[0]
    domain_list = ['.' + domain, domain]
    if len(domain.split('.')) > 2:
        dot_index = domain.find('.')
        domain_list.append(domain[dot_index:])
        domain_list.append(domain[dot_index + 1:])

    conn = None
    cookie_str = None
    try:
        conn = sqlite3.connect(r'%s\Google\Chrome\User Data\Default\Cookies' % os.getenv('LOCALAPPDATA'))
        cursor = conn.cursor()
        print '-' * 50
        sql = 'select host_key, name, value, encrypted_value, path from cookies where host_key in (%s)' % ','.join(['"%s"' % x for x in domain_list])
        row_list = cursor.execute(sql).fetchall()
        print u'一共找到 %d 條' % len(row_list)
        print '-' * 50
        print u'%-20s\t%-5s\t%-5s\t%s' % (u'域', u'鍵', u'值', u'路徑')
        cookie_list = []
        for host_key, name, value, encrypted_value, path in row_list:
            decrypted_value = win32crypt.CryptUnprotectData(encrypted_value, None, None, None, 0)[1].decode(print_charset) or value
            cookie_list.append(name + '=' + decrypted_value)
            print u'%-20s\t%-5s\t%-5s\t%s' % (host_key, name, decrypted_value, path)
        print '-' * 50
        cookie_str = '; '.join(cookie_list)
    except Exception:
        raise CookieException()
    finally:
        conn.close()
        return cookie_str, domain

get_cookie函數的開頭須要獲取domain_list是由於淘寶的一些請求是跨域共享cookie的，因此要把該url全部可用的cookie提取出來。而後，咱們就能夠拿着這個cookie去請求登錄後的數據啦。算法

設置代理

不過通常在請求數據前須要再作一件事——設置代理，使用高匿代理能夠有效地避免淘寶的反爬蟲機制封禁本機IP。代理的獲取途徑有不少，網上能找到很多免費的代理，雖然不是很穩定，不過我們只是玩玩，就不計較這些了，隨便百度了一下，找到一個www.xicidaili.com網站，我們就爬取這個網站的高匿代理來做爲咱們的代理，獲取成功後訪問bing來測試我們的代理是否可用。sql

def set_proxy():
    """
    設置代理
    """
    # 獲取xicidaili的高匿代理
    proxy_info_list = []  # 抓取到的ip列表
    for page in range(1, 2):  # 暫時只抓第一頁
        request = urllib2.Request('http://www.xicidaili.com/nn/%d' % page)
        request.add_header('Accept-Encoding', 'gzip, deflate')
        request.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8')
        request.add_header('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')
        request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36')
        response = urllib2.urlopen(request, timeout=5)

        headers = response.info()
        content_type = headers.get('Content-Type')
        if content_type:
            charset = re.findall(r"charset=([\w-]+);?", content_type)[0]
        else:
            charset = 'utf-8'
        if headers.get('Content-Encoding') == 'gzip':
            gz = gzip.GzipFile(fileobj=StringIO.StringIO(response.read()))
            content = gz.read().decode(charset)
            gz.close()
        else:
            content = response.read().decode(charset)
        response.close()
        print u'獲取第 %d 頁' % page
        ip_page = re.findall(r'<td>(\d.*?)</td>', content)
        proxy_info_list.extend(ip_page)
        time.sleep(random.choice(range(1, 3)))

    # 打印抓取的內容
    print u'代理IP地址\t端口\t存活時間\t驗證時間'
    for i in range(0, len(proxy_info_list), 4):
        print u'%s\t%s\t%s\t%s' % (proxy_info_list[i], proxy_info_list[i + 1], proxy_info_list[i + 2], proxy_info_list[i + 3])

    all_proxy_list = []  # 待驗證的代理列表
    # proxy_list = []  # 可用的代理列表
    for i in range(0, len(proxy_info_list), 4):
        proxy_host = proxy_info_list[i] + ':' + proxy_info_list[i + 1]
        all_proxy_list.append(proxy_host)

    # 開始驗證

    # 單線程方式
    for i in range(len(all_proxy_list)):
        proxy_host = test(all_proxy_list[i])
        if proxy_host:
            break
    else:
        # TODO 進入下一頁
        print u'沒有可用的代理'
        return None

    # 多線程方式
    # threads = []
    # # for i in range(len(all_proxy_list)):
    # for i in range(5):
    #     thread = threading.Thread(target=test, args=[all_proxy_list[i]])
    #     threads.append(thread)
    #     time.sleep(random.uniform(0, 1))
    #     thread.start()
    #
    # # 等待全部線程結束
    # for t in threading.enumerate():
    #     if t is threading.currentThread():
    #         continue
    #     t.join()
    #
    # if not proxy_list:
    #     print u'沒有可用的代理'
    #     # TODO 進入下一頁
    #     sys.exit(0)
    print u'使用代理： %s' % proxy_host
    urllib2.install_opener(urllib2.build_opener(urllib2.ProxyHandler({'http': proxy_host})))

原本是想使用多線程來驗證代理可用性，當有一個測試經過後想關閉其餘線程，結束驗證請求，可是沒找到解決辦法，就暫時先用循環一個一個來驗證了，有其餘思路想法的小夥伴能夠在留言區指教。chrome

分析頁面

接下來，咱們須要獲取商品每一個sku所對應的價格和庫存，先說一下個人分析過程，有更好的分析方法能夠留言或私信交流。打開 Chrome Web Developer Tools ，輸入某個淘寶商品的地址，將工具切換到Network標籤，而後在標籤內容中右鍵 - > save as HAR with content，將當前全部請求和響應的文本內容保存到本地，用編輯器打開，而後經過關鍵字搜索想要的東西而後分析猜想。json

通過推測，淘寶中每一個商品屬性有一個id保存在data-value中，跨域

經過不一樣的屬性值能夠組合成不一樣的「sku_key」，而後通關頁面中的skuMap來獲取該sku_key對應的sku信息

這裏有三步比較關鍵

經過正則來獲取頁面中的skuMap數據，而後轉化爲python的字典對象；
對不一樣的屬性值進行排列組合，尋找是否有sku_key（經過itertools模塊的permutations函數實現）
將中文的屬性名和屬性值對應起來，以便於後續經過輸入屬性來獲取sku數據

獲取商品信息

def get_base_info(page_url):
    """
    從頁面獲取基本的參數
    :param page_url: 
    :return:
    """
    page_url = page_url.strip()
    cookie_str = get_cookie(page_url)
    print u'page cookie ：', cookie_str

    request = urllib2.Request(page_url)
    request.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8')
    request.add_header('Accept-Encoding', 'gzip, deflate, br')
    request.add_header('Cookie', cookie_str)
    request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36')
    response = urllib2.urlopen(request)
    headers = response.info()
    content_type = headers.get('Content-Type')
    if content_type:
        charset = re.findall(r"charset=([\w-]+);?", content_type)[0]
    else:
        charset = 'utf-8'
    if headers.get('Content-Encoding') == 'gzip':
        gz = gzip.GzipFile(fileobj=StringIO.StringIO(response.read()))
        content = gz.read().decode(charset)
        gz.close()
    else:
        content = response.read().decode(charset)
    with open(u'.\content.html', 'w+') as page_file:
        page_file.write(content.encode(print_charset))  # 寫入文件待觀察
    # 獲取item_id
    item_id = re.findall(r'(^|&)id=([^&]*)(&|$)', page_url[page_url.find('?'):])[0][1]
    # 獲取sku_dict
    sku_map = json.loads(re.findall(r'skuMap\s*:\s*(\{.*\})\n\r?', content)[0])
    sku_dict = {}
    for k, v in sku_map.items():
        sku_dict[k] = {
            'price': v['price'],  # 非推廣價
            'stock': v['stock'],  # 庫存
            'sku_id': v['skuId']  # skuId
        }
    ct = re.findall(r'"ct"\s*:\s*"(\w*)",', content)[0]
    timestamp = int(time.time())
    doc = pq(content)
    # ========== 獲取每一個類別屬性及其屬性值的集合 start ==========
    prop_to_values = {}
    for prop in doc('.J_Prop ').items():
        values = []
        for v in prop.find('li').items():
            values.append({
                'name': v.children('a').text(),
                'code': v.attr('data-value')
            })
            prop_to_values[prop.find('.tb-property-type').text()] = values
    # ========== end ==========
    return {
        'ct': ct,
        'nekot': timestamp,
        'item_id': item_id,
        'prop_to_values': prop_to_values,
        'sku_dict': sku_dict
    }

這個函數取名叫get_base_info是由於通過幾個案例分析，發現頁面上的價格和庫存並非淘寶最後顯示的數據，真實的數據須要經過https://detailskip.taobao.com/service/getData/1/p1/item/detail/sib.htm來獲取，這裏麪包括了一些庫存、推廣信息。

獲取能夠出售的商品信息：

def get_sale_info(sib_url):
    """
    獲取能夠出售的商品信息
    :param sib_url: 
    :return:
    """
    cookie_str, host = get_cookie(sib_url)
    print u'sale cookie ：', cookie_str

    request = urllib2.Request(sib_url)
    request.add_header('Accept', '*/*')
    request.add_header('Accept-Encoding', 'gzip, deflate, br')
    request.add_header('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')
    request.add_header('Connection', 'keep - alive')
    request.add_header('Cookie', cookie_str)
    request.add_header('Host', host)
    request.add_header('Referer', 'https://item.taobao.com')
    request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36')
    response = urllib2.urlopen(request)

    headers = response.info()

    # 獲取字符集
    content_type = headers.get('Content-Type')
    if content_type:
        charset = re.findall(r"charset=([\w-]+);?", content_type)[0]
    else:
        charset = 'utf-8'

    if headers.get('Content-Encoding') == 'gzip':
        gz = gzip.GzipFile(fileobj=StringIO.StringIO(response.read()))
        content = gz.read().decode(charset)
        gz.close()
    else:
        content = response.read().decode(charset)
    json_obj = json.loads(content)
    sellable_sku = json_obj['data']['dynStock']['sku']  # 能夠售賣的sku
    promo_sku = json_obj['data']['promotion']['promoData']  # 推廣中的sku
    if promo_sku:
        for k, v in sellable_sku.items():
            promos = promo_sku[k]
            if len(promos) > 1:
                print u'有多個促銷價，建議手動確認'
            price = min([float(x['price']) for x in promos])
            v['price'] = price
            # TODO amountRestriction 限購數量
    return sellable_sku

ps：不久前才發現還有限購數量的限制，不過並不影響後續加入購物車的操做，就暫時先加入TODO，之後再更新

根據用戶輸入獲取商品單價以及庫存

最後經過終端輸入地址和屬性獲取商品信息，看看能不能成功。

def check_item():
    page_url = raw_input(u'請輸入商品地址：'.encode(print_charset)).decode(print_charset)
    base_info = get_base_info(page_url)
    item_id = base_info['item_id']
    modules = ['dynStock', 'qrcode', 'viewer', 'price', 'duty', 'xmpPromotion', 'delivery', 'activity',
               'fqg', 'zjys', 'couponActivity', 'soldQuantity', 'contract', 'tradeContract']
    ajax_url = 'https://detailskip.taobao.com/service/getData/1/p1/item/detail/sib.htm?itemId=%s&modules=%s' % (item_id, ','.join(modules))
    print 'request -> ', ajax_url
    sku_dict = get_sale_info(ajax_url)
    prop_to_values = base_info['prop_to_values']
    base_info_sku_dict = base_info['sku_dict']
    info = {
        'ct': base_info['ct'],
        'nekot': base_info['nekot'],
        'item_id': base_info['item_id'],
        'prop_to_values': prop_to_values,
    }
    for k, v in sku_dict.items():
        v['sku_id'] = base_info_sku_dict[k].get('sku_id')
        if 'price' not in v.keys():
            v['price'] = base_info_sku_dict[k].get('price')
    info['sku_dict'] = sku_dict
    with open(u'.\sku.json', 'w+') as sku_file:
        sku_file.write(json.dumps(info, ensure_ascii=False).encode('utf-8'))
    code_list = []
    item_prop = raw_input(u'請輸入商品類別名稱（如：顏色分類，直接回車結束輸入）：'.encode(print_charset)).decode(print_charset)
    while item_prop:
        if item_prop in prop_to_values.keys():
            prop_values = prop_to_values[item_prop]
            sku_value = raw_input(u'請輸入商品屬性（如：可愛粉）：'.encode(print_charset)).decode(print_charset)
            for v in prop_values:
                if v.get('name') == sku_value:
                    code_list.append(v.get('code'))
                    break
            else:
                print u'沒有該屬性'
        else:
            print u'沒有該類別'
        item_prop = raw_input(u'請輸入商品類別名稱（如：顏色分類，直接回車結束輸入）：'.encode(print_charset)).decode(print_charset)

    sku_id = None
    price = None
    stock = None
    for x in list(permutations(code_list, len(code_list))):
        if ';' + ';'.join(map(str, x)) + ';' in sku_dict.keys():
            item_prop = ';' + ';'.join(map(str, x)) + ';'
            sku_id = sku_dict[item_prop]['sku_id']
            price = sku_dict[item_prop]['price']
            stock = sku_dict[item_prop]['stock']
            print u'%s\t%s' % (u'sku_key爲：', item_prop)
            print u'%s\t%s' % (u'sku_id爲：', sku_id)
            print u'%s\t%s' % (u'單價爲：', price)
            print u'%s\t%s' % (u'庫存爲：', stock)
            break
    else:
        print u'沒有該款式。'

加入購物車

若是以上輸出沒毛病，就能夠正式進入下一步了，加入購物車，試着本身在瀏覽器上加入購物車，而後觀察請求，發現請求是經過cart.taobao.com/add_cart_item.htm這個地址發送的，分析了一下關鍵的發送參數：

item_id —— 商品ID（頁面的g_config中）
outer_id —— skuId（skuMap中）
outer_id_type —— outer_id類型，2表示outer_id傳遞的是skuid
quantity —— 須要加入購物車的數量
opp —— 單價
_tb_token_ —— 能夠從cookie中獲取
ct —— （g_config.vdata.viewer中）
deliveryCityCode —— 發貨城市（sib.htm中，暫時取默認或者不填）
nekot —— 當前時間戳

既然須要的東西都找到了，那就好辦了：

def add_item_to_cart(item_id, sku_id, price, ct, nekot, tb_token, stock):
    """
    添加商品到購物車
    :param item_id: 
    :param sku_id: 
    :param price: 
    :param ct: 
    :param nekot: 
    :param tb_token: 
    :param stock: 
    :return: 
    """
    add_cart_url = 'https://cart.taobao.com/add_cart_item.htm?'
    cookie_str, host = get_cookie(add_cart_url)
    print u'cookie串：', cookie_str

    quantity = raw_input(u'請輸入須要加到購物車的數量：'.encode(print_charset))
    if quantity.isdigit():
        quantity = int(quantity)
        if quantity > stock:
            print u'超過最大庫存', stock
    else:
        print u'非法輸入'

    params = {
        'item_id': item_id,
        'outer_id': sku_id,
        'outer_id_type': '2',
        'quantity': quantity,
        'opp': price,
        'nekot': nekot,
        'ct': ct,
        '_tb_token_': tb_token,
        'deliveryCityCode': '',
        'frm': '',
        'buyer_from': '',
        'item_url_refer': '',
        'root_refer': '',
        'flushingPictureServiceId': '',
        'spm': '',
        'ybhpss': ''
    }
    request = urllib2.Request(add_cart_url + urllib.urlencode(params))
    request.add_header('Accept', '*/*')
    request.add_header('Accept-Encoding', 'gzip, deflate, br')
    request.add_header('Accept-Language', 'zh-CN,zh;q=0.8,en;q=0.6')
    request.add_header('Connection', 'keep - alive')
    request.add_header('Cookie', cookie_str)
    request.add_header('Host', host)
    request.add_header('Referer', 'https://item.taobao.com/item.htm')
    request.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36')
    response = urllib2.urlopen(request)
    headers = response.info()
    content_type = headers.get('Content-Type')
    if content_type:
        charset = re.findall(r"charset=([\w-]+);?", content_type)[0]
    else:
        charset = 'utf-8'
    http_code = response.getcode()
    if http_code == 200:
        print u'添加到購物車成功'

這裏須要注意http_code=200時不必定是加入成功，若是沒有登陸也會返回200，後續研究後會完善。

————————————————

就寫到這裏，須要完整的代碼戳這裏 →→ Python淘寶爬蟲程序，有興趣的小夥伴歡迎隨時過來打臉。

————————————————

項目最後更新於 2017年6月20日，不排除淘寶業務發生變更致使代碼執行錯誤的可能