爬取網易雲音樂評論！python 爬蟲入門實戰（六）selenium 入門！

時間 2019-12-28

標籤網易音樂評論 python 爬蟲入門實戰 selenium 欄目 Python 简体版

原文原文鏈接

說到爬蟲，第一時間可能就會想到網易雲音樂的評論。網易雲音樂評論裏藏了許多寶藏，那麼讓咱們一塊兒學習如何用 python 挖寶藏吧！python

既然是寶藏，確定是用要用鑰匙加密的。打開 Chrome 分析 Headers 以下。git

這參數看起來挺複雜的，咱們就不用 requests 去調用這個連接了。github

此次使用的是 selenium ! 一個瀏覽器自動化測試框架！經過它能夠模擬手動操做瀏覽器！web

爲此咱們要準備好驅動器 chromedriver 和 chrome 瀏覽器。chrome

chromedriver 能夠在淘寶鏡像中下載，選擇與 chrome 瀏覽器對應的版本進行下載。下載地址以下。
http://npm.taobao.org/mirrors...npm

整個項目使用了 python3 與一些第三方庫。參考以下。json

from selenium import webdriver
import jieba
from wordcloud import WordCloud
from PIL import Image
import numpy as np

而後配置 config.json segmentfault

{
  "id":"1336789644",
  "page": 200,
  "useCache": true,
  "font_path": "SimHei.ttf",
  "mask": "mask.png",
  "chromedriver": "chromedriver"
}

運行 sound.py 就會生成詞雲圖。瀏覽器

以及全部的評論數據併發

看了使用方法，接下來進入分析環節！

找到網易雲音樂的地址並發現規律，並使用 webdriver 打開！

driver = webdriver.Chrome(CONFIG['chromedriver'])
driver.get(f'https://music.163.com/#/song?id={SOUND_ID}')

接着讓 driver 跳入到評論框的 frame 裏。

driver.switch_to.frame('g_iframe')

爲什麼這麼作？由於在 frame 結構裏沒法用 xpath 解析到。而評論數據正好在這個 iframe 中。

選中其中一個評論，分析其格式結構，能夠看到都是在同一個 class 名內。

編寫對應的 xpath ，獲得全部的評論列表。

element_list = driver.find_elements_by_xpath('//div[@class="cnt f-brk"]')

選擇下一頁按鈕，分析其格式結構，能夠看到 class 名是以一個前綴爲開頭的。

編寫對應的 xpath ，獲得下一頁按鈕，並在須要的時候模擬點擊。

next_button = driver.find_element_by_xpath('//a[starts-with(@class,"zbtn znxt js-n-")]')
driver.execute_script('arguments[0].click();', next_button)

數據分析結束後，該生成結果嘍。

將評論列表保存爲 json。

with open(filePath,'w') as f:
    json.dump(comments_list,f, ensure_ascii=False, indent=4)

使用 jieba 分詞和 wordcloud 生成詞雲圖。

# 詞雲處理
image_mask = np.array(Image.open(CONFIG['mask']))
wordlist = jieba.cut(';'.join(comments_list))
wordcloud = WordCloud(font_path=CONFIG['font_path'], background_color='white', mask=image_mask, scale=1.5).generate(' '.join(wordlist))
# 保存圖
wordcloud.to_file(f'./result/{SOUND_ID}-{PAGES}.png')

以上就是使用 selenium 爬取網易雲音樂評論的整個步驟嘍！

本文僅供我的學習交流使用，請勿用於其餘用途！

完整代碼
 參考資料