驚聞馬大姐婚變，我連夜爬了微博評論，淪陷

時間 2019-11-06

標籤大姐連夜微博評論淪陷简体版

原文原文鏈接

娛樂圈的瓜真的是一波又一波，此次又輪到文章和馬伊琍了。他們具體爲啥會婚變，咱也不知道，啥也不敢問，啥也不幹說。不過他們微博下面仍是開鍋了，下面就一塊兒來看看吧。html

微博頁面分析

首先咱們先來看看微博頁面，爬蟲要從何處下手。git

頁面分析

咱們直接進入到馬伊琍微博的評論頁面github

weibo.com/1196235387/…json

能夠看到頁面以下：bash

而後咱們使用 Chrome 的調試工具（F12），切換到 Network 頁籤，再次刷新頁面，可以看到一條請求，以下：app

先拷貝出這個請求 URL，放到 Postman 裏試一試，如圖：工具

這都是些啥？？？ ui

果真沒那麼簡單，看來有反爬在做怪，那麼反反爬三板斧先用起來，headers 加一哈url

再來繼續繼續觀察 Network 中的請求 headers，發現有一個 Cookie 是那麼的長，拷貝出來添加上試試吧spa

再次使用 Postman 調用

哎呦，不錯哦，有正常數據返回了

URL 分析

如今再來看看咱們使用的 URL:

weibo.com/aj/v6/comme…

總共有4各參數，ajwvr、id、from 和 __rnd。

1.精簡 URL

咱們先從後往前一個一個的去掉每一個參數試試，發現去掉後面兩個，咱們均可以獲取到評論記錄，那麼後面兩個參數咱們就去掉它，如今的 URL 變爲：

weibo.com/aj/v6/comme…

2.增長 page 參數

再次觀察如今獲取到的數據，發現返回的數據還有一個 page 的數據域，以下：

並且當前是在 "pagenum": 1 的，那麼咱們要怎麼控制到不一樣的 page 頁面呢，試着增長一個 page 參數到 URL 中，如：

weibo.com/aj/v6/comme…

果真，真的訪問到 page 2 了，是否是很香啊

至此，咱們的頁面分析就基本完成了，下面就是拿數據嘍。

獲取並保存數據

獲取保存數據的部分就比較常規了，直接看看代碼

import requests
import json
from bs4 import BeautifulSoup
import pandas as pd
import time


Headers = {'Cookie': 'SINAGLOBAL=4979979695709.662.1540896279940; SUB=_2AkMrYbTuf8PxqwJRmPkVyG_nb45wwwHEieKdPUU1JRMxHRl-yT83qnI9tRB6AOGaAcavhZVIZBiCoxtgPDNVspj9jtju; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5d4hHnVEbZCn4G2L775Qe1; _s_tentry=-; Apache=1711120851984.973.1564019682028; ULV=1564019682040:7:2:1:1711120851984.973.1564019682028:1563525180101; login_sid_t=8e1b73050dedb94d4996a67f8d74e464; cross_origin_proto=SSL; Ugrow-G0=140ad66ad7317901fc818d7fd7743564; YF-V5-G0=95d69db6bf5dfdb71f82a9b7f3eb261a; WBStorage=edfd723f2928ec64|undefined; UOR=bbs.51testing.com,widget.weibo.com,www.baidu.com; wb_view_log=1366*7681; WBtopGlobal_register_version=307744aa77dd5677; YF-Page-G0=580fe01acc9791e17cca20c5fa377d00|1564363890|1564363890'}


def mayili(page):
    mayili = []
    for i in range(0, page):
        print("page: ", i)
        url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4399042567665659&page=%s' % int(i)
        req = requests.get(url, headers=Headers).text
        html = json.loads(req)['data']['html']
        content = BeautifulSoup(html, "html.parser")
        # comment = content.find_all('div', attrs={'class': 'list_li S_line1 clearfix'})
        comment_text = content.find_all('div', attrs={'class': 'WB_text'})
        for c in comment_text:
            mayili_text = c.text.split("：")[1]
            mayili.append(mayili_text)
        time.sleep(5)

    return mayili


def wenzhang(page):
    wenzhang = []
    for i in range(0, page):
        print("page: ", i)
        url = 'https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4399042089738682&page=%s' % int(i)
        req = requests.get(url, headers=Headers).text
        html = json.loads(req)['data']['html']
        content = BeautifulSoup(html, "html.parser")
        # comment = content.find_all('div', attrs={'class': 'list_li S_line1 clearfix'})
        comment_text = content.find_all('div', attrs={'class': 'WB_text'})
        for c in comment_text:
            wenzhang_text = c.text.split("：")[1]
            wenzhang.append(wenzhang_text)
        time.sleep(5)

    return wenzhang


if __name__ == '__main__':
    print("start")
    ma_comment = mayili(1001)
    mayili_pd = pd.DataFrame(columns=['mayili_comment'], data=ma_comment)
    mayili_pd.to_csv('mayili.csv', encoding='utf-8')

    wen_comment = wenzhang(1001)
    wenzhang_pd = pd.DataFrame(columns=['wenzhang_comment'], data=wen_comment)
    wenzhang_pd.to_csv('wenzhang.csv', encoding='utf-8')
複製代碼

總共 page 頁面有 2000 多頁，要爬完還真是須要一段時間，我這裏配置了 1000，應該是夠了。並且還作了 sleep 5 的操做，一個是爬取太快，會被微博視爲異常請求，被禁，並且也不會對人家的正常服務產生影響，畢竟盜亦有道嘛！

詞雲作成

等爬蟲跑完以後，咱們簡單看下數據的內容

馬伊琍微博評論

文章微博評論

數據都拿到了，下面就作成詞雲看看，各路粉絲的態度吧

這裏就不對評論內容作過多置喙了，畢竟說多了都是錯

def wordcloud_m():
    df = pd.read_csv('mayili.csv', usecols=[1])
    df_copy = df.copy()
    df_copy['mayili_comment'] = df_copy['mayili_comment'].apply(lambda x: str(x).split())  # 去掉空格
    df_list = df_copy.values.tolist()
    comment = jieba.cut(str(df_list), cut_all=False)
    words = ' '.join(comment)
    wc = WordCloud(width=2000, height=1800, background_color='white', font_path=font,
                   stopwords=STOPWORDS, contour_width=3, contour_color='steelblue')
    wc.generate(words)
    wc.to_file('m.png')
複製代碼

馬伊琍評論詞雲

文章評論詞雲

最後，我把全部的代碼都上傳到 GitHub 上了，須要的能夠自取 github.com/zhouwei713/…

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。