python爬蟲學習：爬蟲QQ說說並生成詞雲圖，回憶滿滿

時間 2019-11-17

原文原文鏈接

自學過一段時間的python，用django本身作了個網站，也用requests+BeautifulSoup爬蟲過些簡單的網站，週末研究學習了一波，準備爬取QQ空間的說說，並把內容存在txt中，讀取生成雲圖。
很久不登qq了，空間說說更是幾年不玩了，裏面滿滿的都是上學時候的回憶，看着看着就笑了，笑着笑着就...哈哈哈~~
無圖言虛空html

當年的我仍是那麼風華正茂、幽默風趣...
言歸正傳，本次使用的是 selenium模擬登陸+ BeautifulSoup4爬取數據+ wordcloud生成詞雲圖

BeautifulSoup安裝

pip install beautifulsoup4
這裏有beautifulsoup4 的官方文檔
還須要用到解析器，我選擇的是html5lib解析器pip install html5lib
下表列出了主要的解析器,以及它們的優缺點:html5

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

selenium模擬登陸

使用selenium模擬登陸QQ空間，安裝pip install selenium
我用的是chrom瀏覽器，webdriver.Chrome()，獲取Chrome瀏覽器的驅動。
這裏還須要下載安裝對應瀏覽器的驅動，不然在運行腳本時，會提示 chromedriver executable needs to be in PATH錯誤，用的是mac，網上找的一篇下載驅動的文章，https://blog.csdn.net/zxy987872674/article/details/53082896
同理window的也同樣，下載對應的驅動，解壓後，將下載的**.exe 放到Python的安裝目錄，例如 D:\python 。同時須要將Python的安裝目錄添加到系統環境變量裏。python

qq登陸頁http://i.qq.com，利用webdriver打開qq空間的登陸頁面git

driver = webdriver.Chrome()
driver.get("http://i.qq.com")
複製代碼

打開以後右擊檢查查看頁面元素，發現賬號密碼登陸在 login_frame裏，先定位到所在的frame， driver.switch_to.frame("login_frame") ，再自動點擊賬號密碼登陸按鈕，自動輸入賬號密碼登陸，而且打開說說頁面，詳細代碼以下

friend = '' # 朋友的QQ號，**朋友的空間要求容許你能訪問**，這裏能夠輸入本身的qq號
user = ''  # 你的QQ號
pw = ''  # 你的QQ密碼

 # 獲取瀏覽器驅動
driver = webdriver.Chrome()
 # 瀏覽器窗口最大化
driver.maximize_window()
 # 瀏覽器地址定向爲qq登錄頁面
driver.get("http://i.qq.com")

 # 定位到登陸所在的frame
driver.switch_to.frame("login_frame")

 # 自動點擊帳號登錄方式
driver.find_element_by_id("switcher_plogin").click()
 # 帳號輸入框輸入已知qq帳號
driver.find_element_by_id("u").send_keys(user)
 # 密碼框輸入已知密碼
driver.find_element_by_id("p").send_keys(pw)
 # 自動點擊登錄按鈕
driver.find_element_by_id("login_button").click()
 # 讓webdriver操縱當前頁
driver.switch_to.default_content()
 # 跳到說說的url, friend能夠任意改爲你想訪問的空間，好比這邊訪問本身的qq空間
driver.get("http://user.qzone.qq.com/" + friend + "/311")
複製代碼

這個時候能夠看到已經打開了qq說說的頁面了，注意部分空間打開以後會出現一個提示框，須要先模擬點擊事件關閉這個提示框github

tm我之前居然還有個黃鑽，好可怕~~，空間頭像也是那麼的年輕、主流...

try:
    #找到關閉按鈕，關閉提示框
    button = driver.find_element_by_id("dialog_button_111").click()
except:
    pass
複製代碼

同時由於說說內容是動態加載的，須要自動下拉滾動條，加載出所有的內容，再模擬點擊下一頁加載內容。具體代碼見下面。web

BeautifulSoup爬取說說

F12查看內容，能夠找到說說在feed_wrap這個<div>，<ol>裏面的<li>標籤數組裏面，具體每條說說內容在<div> class="bd"的<pre>標籤中。
chrome

next_num = 0  # 初始「下一頁」的id
while True:
    # 下拉滾動條，使瀏覽器加載出所有的內容，
    # 這裏是從0開始到5結束 分5 次加載完每頁數據
    for i in range(0, 5):
        height = 20000 * i  # 每次滑動20000像素
        strWord = "window.scrollBy(0," + str(height) + ")"
        driver.execute_script(strWord)
        time.sleep(2)

    # 這裏須要選中 說說 所在的frame，不然找不到下面須要的網頁元素
    driver.switch_to.frame("app_canvas_frame")
    # 解析頁面元素
    content = BeautifulSoup(driver.page_source, "html5lib")
    # 找到"feed_wrap"的div裏面的ol標籤
    ol = content.find("div", class_="feed_wrap").ol
    # 經過find_all遍歷li標籤數組
    lis = ol.find_all("li", class_="feed")

    # 將說說內容寫入文件，使用 a 表示內容能夠連續不清空寫入
    with open('qq_word.txt', 'a', encoding='utf-8') as f:
        for li in lis:
            bd = li.find("div", class_="bd")
            #找到具體說說所在標籤pre，獲取內容
            ss_content = bd.pre.get_text()
            f.write(ss_content + "\n")

    # 當已經到了尾頁，「下一頁」這個按鈕就沒有id了，能夠結束了
    if driver.page_source.find('pager_next_' + str(next_num)) == -1:
        break
    # 找到「下一頁」的按鈕，由於下一頁的按鈕是動態變化的，這裏須要動態記錄一下
    driver.find_element_by_id('pager_next_' + str(next_num)).click()
    # 「下一頁」的id
    next_num += 1
    # 由於在下一個循環裏首先還要把頁面下拉，因此要跳到外層的frame上
    driver.switch_to.parent_frame()

複製代碼

至此QQ說說已經爬取下來，而且保存在了qq_word文件裏
接下來生成詞雲圖django

詞雲圖

使用wordcloud包生成詞雲圖，pip install wordcloud
這裏還可使用jieba分詞，我並無使用，由於我以爲qq說說的句子讀起來纔有點感受，我的喜愛，用jieba分詞能夠看到說說高頻次的一些詞語。
設置下wordcloud的一些屬性，注意這裏要設置font_path屬性，不然漢字會出現亂碼。
這裏還有個要提醒的是，若是使用了虛擬環境的，不要在虛擬環境下運行如下腳本，不然可能會報錯 RuntimeError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information. ，我就遇到了這種狀況，deactivate 退出了虛擬環境再跑的canvas

# coding:utf-8

from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 生成詞雲
def create_word_cloud(filename):
    # 讀取文件內容
    text = open("{}.txt".format(filename), encoding='utf-8').read()
    # 設置詞雲
    wc = WordCloud(
        # 設置背景顏色
        background_color="white",
        # 設置最大顯示的詞雲數
        max_words=2000,
        # 這種字體都在電腦字體中，window在C:\Windows\Fonts\下，mac我選的是/System/Library/Fonts/PingFang.ttc 字體
        font_path='/System/Library/Fonts/PingFang.ttc',
        height=1200,
        width=2000,
        # 設置字體最大值
        max_font_size=100,
        # 設置有多少種隨機生成狀態，即有多少種配色方案
        random_state=30,
    )

    myword = wc.generate(text)  # 生成詞雲
    # 展現詞雲圖
    plt.imshow(myword)
    plt.axis("off")
    plt.show()
    wc.to_file('qq_word.png')  # 把詞雲保存下


if __name__ == '__main__':
    create_word_cloud('qq_word')

複製代碼