【Python爬蟲實例學習篇】——五、【超詳細記錄】從爬取微博評論數據（免登錄）到生成詞雲

時間 2020-02-01

標籤 Python爬蟲實例學習篇超詳細記錄微博評論數據登錄生成欄目 Python 简体版

原文原文鏈接

【Python爬蟲實例學習篇】——五、【超詳細記錄】從爬取微博評論數據（免登錄）到生成詞雲

近段時間新型冠狀病毒的問題引發了全國人民的普遍關注，對於這一高傳染性的病毒，人們有着不一樣的聲音，而我想經過大數據看看大多數人是怎麼想的。html

精彩部分提醒：

（1）微博評論頁詳情連接爲一個js腳本
（2）獲取js腳本連接須要該條微博的mid參數
（3）獲取mid參數須要訪問微博主頁
（4）訪問微博主頁須要先進行訪客認證
（5）微博主頁幾乎是由彈窗構成，全部html代碼被隱藏在FM.view()函數的參數中，該參數是json格式python

工具：chrome

Python 3.6
requests 庫
json 庫
lxml 庫
urllib 庫
jieba 庫（進行分詞）
WordCloud 庫（產生詞雲）

目錄：json

#### 一、爬取微博篇論數據segmentfault

以央視新聞官方微博置頂的第一條微博爲例，爬取其評論數據。微信

(1) 尋找評論頁

第一步：尋找評論頁cookie

先用Ctrl+Shift+C 選取評論標籤查看其html代碼，發現其連接爲一個js腳本，那麼嘗試用fddler看看能不能抓到這個js腳本的包，獲得這個js的地址。session

找到疑似js包，將數據解碼，確認是咱們要找的包。app

同時，根據「查看更多」能夠肯定跳轉的連接，將這一結果在json解析結果中搜索，能夠進一步肯定這個js包就是咱們要找的包。接下來須要肯定這個js包是來源於哪。dom

第二步：找到js包地址

js包地址爲：「https://weibo.com/aj/v6/comme...」，連接很長，且參數不少，根據以往經驗，咱們嘗試刪除一些參數進行訪問測試。
通過測試發現只需==mid==這一個參數便可獲取該數據包。
因此有效js包地址爲：」https://weibo.com/aj/v6/comme...「。

那麼，接下來的工做就是去尋找mid這以參數的值（猜想應該是微博的惟一序列號）。在Fiddler中搜索 「mid=4465267293291962」能夠發如今央視新聞首頁中，每條微博裏面都包含了該微博的mid信息。

用Ctrl+Shift+C 任意選取一條微博，能夠發現有一個 「mid」 屬性，裏面包含mid的數據

用XPath Helper進行調試，沒有問題，接下來在python上實現這部分代碼，==（通過後面測試發現獲取評論數據只需獲取微博的mid便可，所以下面這幾步能夠跳過，可是爲例保證探索過程的完整性，我將其留在了這裏）==

獲得評論頁的代碼：

import requests
from lxml import etree

requests.packages.urllib3.disable_warnings()

name="cctvxinwen"
home_url='https://weibo.com/'+name
headers={
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36"
}

session=requests.session()
res=session.get(home_url,headers=headers,timeout=20,verify=False)
html=etree.HTML(res.content.decode('gb2312',"ignore"))
mid_list=html.xpath("//div/@mid")

本覺得能夠輕鬆獲取新聞主頁，結果發現如果沒有附帶cookie的話，會自動跳轉到微博登錄驗證頁面。

用chrome的無痕訪問結合fiddler，從新抓包，能夠清晰的發現微博的驗證登錄流程。其中最重要的是第11號包，其他6號包用於提供參數解析，9號包提供==參數s==和參數==sp==的數據，8號包用於提供9號包訪問連接中參數的數據。

對於咱們的爬蟲來講，只須要從第8號包開始訪問便可，8號包須要提交的數據爲(常量)：

cb=gen_callback&fp={"os":"1","browser":"Chrome80,0,3970,5","fonts":"undefined","screenInfo":"1920*1080*24","plugins":"Portable Document Format::internal-pdf-viewer::Chrome PDF Plugin|::mhjfbmdgcfjbbpaeojofohoefgiehjai::Chrome PDF Viewer|::internal-nacl-plugin::Native Client"}

須要注意的是，在使用requests庫時，須要向協議頭中添加：
=='Content-Type': 'application/x-www-form-urlencoded'==，不然返回數據爲空。

9號包的連接爲：
「https://passport.weibo.com/vi...」，其中_rand參數能夠忽略。這裏須要注意，添加tid參數時，tid參數須要 url編碼。
在完成9號包的訪問後，就能夠獲取央視新聞微博的主頁了：

第三步：獲取評論頁連接
在這裏有一個頗有意思的現象，當我在網頁用xpath helper調試的時候，可以很是容易的獲取對應的屬性值，可是一旦將該xpath語法應用與python中進行解析時，老是獲得空的數據。通過一番調試發現，是因爲微博這個 ==「歐盟隱私彈窗」==所致。全部咱們須要的數據所有被隱藏在這個彈窗之中，微博頁面的全部內容經過調用 FM.view() 這個函數顯示出來，網頁的html代碼就隱藏在 FM.view() 函數的json格式的參數中。

第一部分的代碼爲：

import requests
import json
from urllib import parse
from lxml import etree

requests.packages.urllib3.disable_warnings()
session = requests.session()
session.verify = False
session.timeout = 20

name = "cctvxinwen"
home_url = 'https://weibo.com/' + name
url1 = "https://passport.weibo.com/visitor/genvisitor"
urljs='https://weibo.com/aj/v6/comment/small?mid='

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36"
}

data = "cb=gen_callback&fp=%7B%22os%22%3A%221%22%2C%22browser%22%3A%22Chrome80%2C0%2C3970%2C5%22%2C%22fonts%22%3A%22undefined%22%2C%22screenInfo%22%3A%221920*1080*24%22%2C%22plugins%22%3A%22Portable%20Document%20Format%3A%3Ainternal-pdf-viewer%3A%3AChrome%20PDF%20Plugin%7C%3A%3Amhjfbmdgcfjbbpaeojofohoefgiehjai%3A%3AChrome%20PDF%20Viewer%7C%3A%3Ainternal-nacl-plugin%3A%3ANative%20Client%22%7D"

# 獲取tid
headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
tid = json.loads(session.post(url=url1, headers=headers, data=data).content.decode('utf-8')[36:-2])['data']['tid']
del headers['Content-Type']

# 獲取訪客cookie
url2 = "https://passport.weibo.com/visitor/visitor?a=incarnate&t=" + parse.quote(
    tid) + "&w=2&c=095&gc=&cb=cross_domain&from=weibo"
session.get(url=url2.encode('utf-8'), headers=headers)

# 訪問微博主頁，解析獲取評論頁面
res = session.get(url=home_url, headers=headers)
html = etree.HTML(res.content.decode('utf-8', "ignore"))
# 含有mid的html代碼被隱藏在這一個json中
mid_json = json.loads(html.xpath("//script")[38].text[8:-1])
mid_html=etree.HTML(mid_json['html'])
mids=mid_html.xpath("//div/@mid")

# 獲取第一條微博的js包地址
urljs=urljs+str(mids[0])
res=session.get(url=urljs, headers=headers)
js_json_html=json.loads(res.content)['data']['html']
print(js_json_html)
print("該微博當前評論數爲："+str(json.loads(res.content)['data']['count']))

# 解析獲取評論頁地址
js_html=etree.HTML(js_json_html)
url_remark=js_html.xpath("//a[@target='_blank']/@href")[-1]
url_remark="https:"+url_remark

(2) 獲取並評論

獲取評論頁後，咱們很是容易的就能找到評論數據的json包，如圖所示：

該json包的連接爲：「https://weibo.com/aj/v6/comme...」
==其中，有效json包連接爲：==「https://weibo.com/aj/v6/comme...」。獲取json包以後，既能夠提取出評論數據，其代碼以下 ==(因爲該json包不須要其餘參數，只須要額外提供mid便可，所以咱們在獲取mid後能夠直接跳到這一步)== ：
爬取評論以下：

二、 GetWeiBoRemark.py

import requests
import json
from urllib import parse
from lxml import etree

# 預設
requests.packages.urllib3.disable_warnings()
session = requests.session()
session.verify = False
session.timeout = 20
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3970.5 Safari/537.36"
}


def getweiboremark(name, index=1, num=-1):
    # name參數表明微博主頁的名稱
    # index參數表明目標微博的序號
    # num表明要爬的評論數，-1表明全部評論
    # 返回結果爲評論列表
    home_url = 'https://weibo.com/' + name
    tid = get_tid()
    get_cookie(tid=tid)
    mids = get_mids(home_url=home_url)
    remark_data = get_remarkdata(num=num, mids=mids, index=index)
    return remark_data


def get_tid():
    # 獲取tid
    url1 = "https://passport.weibo.com/visitor/genvisitor"
    data = "cb=gen_callback&fp=%7B%22os%22%3A%221%22%2C%22browser%22%3A%22Chrome80%2C0%2C3970%2C5%22%2C%22fonts%22%3A%22undefined%22%2C%22screenInfo%22%3A%221920*1080*24%22%2C%22plugins%22%3A%22Portable%20Document%20Format%3A%3Ainternal-pdf-viewer%3A%3AChrome%20PDF%20Plugin%7C%3A%3Amhjfbmdgcfjbbpaeojofohoefgiehjai%3A%3AChrome%20PDF%20Viewer%7C%3A%3Ainternal-nacl-plugin%3A%3ANative%20Client%22%7D"

    headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
    tid = json.loads(session.post(url=url1, headers=headers, data=data).content.decode('utf-8')[36:-2])['data']['tid']
    del headers['Content-Type']
    return tid


def get_cookie(tid, session=session):
    # 獲取訪客cookie
    url2 = "https://passport.weibo.com/visitor/visitor?a=incarnate&t=" + parse.quote(
        tid) + "&w=2&c=095&gc=&cb=cross_domain&from=weibo"
    session.get(url=url2.encode('utf-8'), headers=headers)


def get_mids(home_url, session=session):
    # 訪問微博主頁，解析獲取評論頁面
    # 要想獲取mids，必須先獲取cookie
    res = session.get(url=home_url, headers=headers)
    try:
        html = etree.HTML(res.content.decode('utf-8', "ignore"))
        # 含有mid的html代碼被隱藏在這一個json中
        mid_json = json.loads(html.xpath("//script")[38].text[8:-1])
        mid_html = etree.HTML(mid_json['html'])
        mids = mid_html.xpath("//div/@mid")
        mids[0]
        return mids
    except Exception:
        mids = get_mids(home_url, session=session)
        return mids


def get_remarkdata(num, mids, index=1):
    # 獲取評論數據
    url_remarkdata = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={mid}&page={page}&from=singleWeiBo'
    page = 1
    remark_data_new = []
    current_num = 0
    while True:
        print("-" * 50)
        url_remarkdata_new = url_remarkdata.format(mid=str(mids[index]), page=str(page))
        page = page + 1
        print("正在採集第 " + str(page - 1) + " 頁評論！")
        res = session.get(url=url_remarkdata_new, headers=headers)
        remark_html = etree.HTML(json.loads(res.content.decode(encoding='utf-8'))['data']['html'])
        remark_data = remark_html.xpath("//div[@class='list_con']/div[1]/text()")
        remark_num = json.loads(res.content.decode(encoding='utf-8'))['data']['count']
        if page == 2:
            print("本條微博共有 "+str(remark_num)+' 個評論！')
            if num == -1:
                num = remark_num
            elif num > remark_num:
                num = remark_num
        for i in remark_data:
            if i[0:1] == '：':
                i = i[1:]
                remark_data_new.append(i.strip())
        current_num = len(remark_data_new)
        print("當前已採集：" + str(current_num) + " 個評論，剩餘：" + str(num - current_num) + "個待採集！")
        if (num <= current_num):
            break
    return remark_data_new

def save_remarkdata(name,data):
    with open(name,'w',encoding='utf-8') as fp:
        fp.write(data)
        fp.flush()
        fp.close()

結果展現：

三、生成詞雲

代碼以下：

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba
import GetWeiBoRemark

def producewordcloud(data,mask=None):
    word_cloud=WordCloud(background_color="white",font_path='msyh.ttc',mask=mask,max_words=200,max_font_size=100,width=1000,height=860).generate(' '.join(jieba.cut(data,cut_all=False)))
    plt.figure()
    plt.imshow(word_cloud, interpolation='bilinear')
    plt.axis("off")  # 不顯示座標軸
    plt.show()

if __name__ == '__main__':
    # cctvxinwen
    remark_data=GetWeiBoRemark.getweiboremark(name="cctvxinwen",index=2,num=100)
    str_remark_data=''
    for i in remark_data:
        str_remark_data=str_remark_data+str(i)
    GetWeiBoRemark.save_remarkdata(name='cctvxinwen.txt',data=str_remark_data)
    producewordcloud(str_remark_data)

==目標微博==

==下面是爬3000條數據作出的詞雲：==