Python網絡爬蟲2 - 爬取新浪微博用戶圖片

時間 2019-11-16

標籤 python 網絡爬蟲新浪微博用戶圖片欄目 Python 简体版

原文原文鏈接

該博客首發於 www.litreily.topcss

其實，新浪微博用戶圖片爬蟲是我學習python以來寫的第一個爬蟲，只不過當時懶，後來爬完Lofter後以爲有必要總結一下，因此就有了第一篇爬蟲博客。如今暫時閒下來了，準備把新浪的這個也補上。html

言歸正傳，既然選擇爬新浪微博，那固然是有需求的，這也是學習的主要動力之一，沒錯，就是美圖。sina用戶多數微博都是包含圖片的，並且是組圖居多，單個圖片的較少。python

爲了不侵權，本文以本人微博litreily爲例說明整個爬取過程，雖然圖片較少，質量較低，但爬取方案是絕對ok的，使用時只要換個用戶ID就能夠了。linux

分析sina站點

獲取用戶ID

在爬取前，咱們須要知道的是每一個用戶都有一個用戶名，而一個用戶名又對應一個惟一的整型數字ID，相似於學生的學號，本人的是2657006573。至於怎麼根據用戶名去獲取ID，有如下兩種方法：git

進入待爬取用戶主頁，在瀏覽器網址欄中便可看到一串數據，那就是用戶ID
Ctrl-U查看待爬取用戶的源碼，搜索"uid，注意是雙引號

實際上是能夠在已知用戶名的狀況下經過爬蟲自動獲取到uid的，可是我當時初學python，並無考慮充分，因此後面的源碼是以用戶ID做爲輸入參數的。github

圖片存儲參數解析

用戶全部的圖片都被存放至這樣的路徑下，真的是全部圖片哦！！！正則表達式

https://weibo.cn/{uid}/profile?filter={filter_type}&page={page_num}

# example
https://weibo.cn/2657006573/profile?filter=0&page=1
uid: 2657006573
filter_type: 0
page_num: 1
複製代碼

注意，是weibo.cn而不是weibo.com，至於我是怎麼找到這個頁面的，說實話，我也忘了。。。數據庫

連接中包含3個參數，uid, filter_mode 以及 page_num。其中，uid就是前面說起的用戶ID，page_num也很好理解，就是分頁的當前頁數，從1開始增長，那麼，這個filter_mode是什麼呢？express

不着急，咱們先來看看頁面↓windows

能夠看到，濾波類型filter_mode指的就是篩選條件，一共三個：

filter=0 所有微博（包含純文本微博，轉載微博）
filter=1 原創微博（包含純文本微博）
filter=2 圖片微博（必須含有圖片，包含轉載）

我一般會選擇原創，由於我並不但願爬取結果中包含轉載微博中的圖片。固然，你們依照本身的須要選擇便可。

圖鏈解析

好了，參數來源都知道了，咱們回過頭看看這個網頁。頁面是否是感受就是個空架子？毫無css痕跡，不要緊，新浪原本就沒打算把這個頁面主動呈現給用戶。但對於爬蟲而言，這倒是極好的，爲何這麼說？緣由以下：

圖片齊全，沒有遺漏，就是個可視化的數據庫
樣式少，頁面簡單，省流量，爬取快
靜態網頁，分頁存儲，所見即所得
源碼包含了全部微博的首圖和組圖連接

這樣的網頁用來練手再合適不過。但要注意的是上面第4點，什麼是首圖和組圖連接呢，很好理解。每篇博客可能包含多張圖片，那就是組圖，但該頁面只顯示博客的第一張圖片，即所謂的首圖，組圖連接指向的是存儲着該組圖全部圖片的網址。

因爲本人微博沒組圖，因此此處以劉亦菲微博爲例，說明單圖及組圖的圖鏈格式

圖中的上面一篇微博只有一張圖片，能夠輕易獲取到原圖連接，注意是原圖，由於咱們在頁面能看到的是縮略圖，但要爬取的固然是原圖啦。

圖中下面的微博包含組圖，在圖片右側的Chrome開發工具能夠看到組圖連接。

https://weibo.cn/mblog/picAll/FCQefgeAr?rl=2

打開組圖連接，能夠看到圖片以下圖所示：

能夠看到縮略圖連接以及原圖連接，而後咱們點擊原圖看一下。

能夠發現，彈出頁面的連接與上圖顯示的不一樣，但與上圖中的縮略圖連接極爲類似。它們分別是：

縮略圖：http://ww1.sinaimg.cn/thumb180/c260f7ably1fn4vd7ix0qj20rs1aj1kx.jpg
原圖： http://wx1.sinaimg.cn/large/c260f7ably1fn4vd7ix0qj20rs1aj1kx.jpg

能夠看出，只是一個thumb180和large的區別。既然發現了規律，那就好辦多了，咱們只要知道縮略圖的網址，就能夠將域名後的第一級子域名替換成large就能夠了，而不用獲取原圖連接再跳轉一次。

並且，屢次嘗試能夠發現組圖連接及縮略圖連接知足正則表達式：

# 1. 組圖連接：
imglist_reg = r'href="(https://weibo.cn/mblog/picAll/.{9}\?rl=2)"'

# 2. 縮略圖
img_reg = r'src="(http://w.{2}\.sinaimg.cn/(.{6,8})/.{32,33}.(jpg|gif))"'
複製代碼

到此，新浪微博的解析過程就結束了，圖鏈的格式以及獲取方式也都清楚了。下面就能夠設計方案進行爬取了。

肯定爬取方案

根據解析結果，很容易制定出如下爬取方案：

給定微博用戶名litreily
進入待爬取用戶主頁，便可從網址中獲取uid: 2657006573
獲取本人登陸微博後的cookies（請求報文須要用到cookies）
逐一爬取 https://weibo.cn/2657006573/profile?filter=0&page={1,2,3,...}
解析每一頁的源碼，獲取單圖連接及組圖連接，

單圖：直接獲取該圖縮略圖連接；
組圖：爬取組圖連接，循環獲取組圖頁面全部圖片的縮略圖連接

循環將第5步獲取到的圖鏈替換爲原圖連接，並下載至本地
重複第4-6步，直至沒有圖片

獲取cookies

針對以上方案，其中有幾個重點內容，其一就是cookies的獲取，我暫時還沒學怎麼自動獲取cookies，因此目前是登陸微博後手動獲取的。

下載網頁

下載網頁用的是python3自帶的urllib庫，當時沒學requests，之後可能也不多用urllib了。

def _get_html(url, headers):
    try:
        req = urllib.request.Request(url, headers = headers)
        page = urllib.request.urlopen(req)
        html = page.read().decode('UTF-8')
    except Exception as e:
        print("get %s failed" % url)
        return None
    return html
複製代碼

獲取存儲路徑

因爲我是在win10下編寫的代碼，可是我的比較喜歡用bash，因此圖片的存儲路徑有如下兩種格式，_get_path函數會自動判斷當前操做系統的類型，而後選擇相應的路徑。

def _get_path(uid):
    path = {
        'Windows': 'D:/litreily/Pictures/python/sina/' + uid,
        'Linux': '/mnt/d/litreily/Pictures/python/sina/' + uid
    }.get(platform.system())

    if not os.path.isdir(path):
        os.makedirs(path)
    return path
複製代碼

幸虧windows是兼容linux系統的斜槓符號的，否則程序中的相對路徑替換還挺麻煩。

下載圖片

因爲選用的urllib庫，因此下載圖片就使用urllib.request.urlretrieve了

# image url of one page is saved in imgurls
for img in imgurls:
    imgurl = img[0].replace(img[1], 'large')
    num_imgs += 1
    try:
        urllib.request.urlretrieve(imgurl, '{}/{}.{}'.format(path, num_imgs, img[2]))
        # display the raw url of images
        print('\t%d\t%s' % (num_imgs, imgurl))
    except Exception as e:
        print(str(e))
        print('\t%d\t%s failed' % (num_imgs, imgurl))
複製代碼

源碼

其它細節詳見源碼

#!/usr/bin/python3
# -*- coding:utf-8 -*-
# author: litreily
# date: 2018.02.05
"""Capture pictures from sina-weibo with user_id."""

import re
import os
import platform

import urllib
import urllib.request

from bs4 import BeautifulSoup


def _get_path(uid):
    path = {
        'Windows': 'D:/litreily/Pictures/python/sina/' + uid,
        'Linux': '/mnt/d/litreily/Pictures/python/sina/' + uid
    }.get(platform.system())

    if not os.path.isdir(path):
        os.makedirs(path)
    return path


def _get_html(url, headers):
    try:
        req = urllib.request.Request(url, headers = headers)
        page = urllib.request.urlopen(req)
        html = page.read().decode('UTF-8')
    except Exception as e:
        print("get %s failed" % url)
        return None
    return html


def _capture_images(uid, headers, path):
    filter_mode = 1      # 0-all 1-original 2-pictures
    num_pages = 1
    num_blogs = 0
    num_imgs = 0

    # regular expression of imgList and img
    imglist_reg = r'href="(https://weibo.cn/mblog/picAll/.{9}\?rl=2)"'
    imglist_pattern = re.compile(imglist_reg)
    img_reg = r'src="(http://w.{2}\.sinaimg.cn/(.{6,8})/.{32,33}.(jpg|gif))"'
    img_pattern = re.compile(img_reg)
    
    print('start capture picture of uid:' + uid)
    while True:
        url = 'https://weibo.cn/%s/profile?filter=%s&page=%d' % (uid, filter_mode, num_pages)

        # 1. get html of each page url
        html = _get_html(url, headers)
        
        # 2. parse the html and find all the imgList Url of each page
        soup = BeautifulSoup(html, "html.parser")
        # <div class="c" id="M_G4gb5pY8t"><div>
        blogs = soup.body.find_all(attrs={'id':re.compile(r'^M_')}, recursive=False)
        num_blogs += len(blogs)

        imgurls = []        
        for blog in blogs:
            blog = str(blog)
            imglist_url = imglist_pattern.findall(blog)
            if not imglist_url:
                # 2.1 get img-url from blog that have only one pic
                imgurls += img_pattern.findall(blog)
            else:
                # 2.2 get img-urls from blog that have group pics
                html = _get_html(imglist_url[0], headers)
                imgurls += img_pattern.findall(html)

        if not imgurls:
            print('capture complete!')
            print('captured pages:%d, blogs:%d, imgs:%d' % (num_pages, num_blogs, num_imgs))
            print('directory:' + path)
            break

        # 3. download all the imgs from each imgList
        print('PAGE %d with %d images' % (num_pages, len(imgurls)))
        for img in imgurls:
            imgurl = img[0].replace(img[1], 'large')
            num_imgs += 1
            try:
                urllib.request.urlretrieve(imgurl, '{}/{}.{}'.format(path, num_imgs, img[2]))
                # display the raw url of images
                print('\t%d\t%s' % (num_imgs, imgurl))
            except Exception as e:
                print(str(e))
                print('\t%d\t%s failed' % (num_imgs, imgurl))
        num_pages += 1
        print('')


def main():
    # uids = ['2657006573','2173752092','3261134763','2174219060']
    uid = '2657006573'
    path = _get_path(uid)

    # cookie is form the above url->network->request headers
    cookies = ''
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
            'Cookie': cookies}

    # capture imgs from sina
    _capture_images(uid, headers, path)


if __name__ == '__main__':
    main()

複製代碼