上一份的推送是關於QQ音樂全部評論的獲取,這一次講講QQ音樂精彩評論的獲取。翻了一下QQ音樂熱歌排行榜上的歌,發現QQ音樂熱評多的歌很少,全部評論和網易雲音樂比起來也挺寒磣的,只能說網易雲不愧是文藝小青年的彙集地。不過不能由於評論少就不爬了,見面就是莽,不虛。html
經過本次爬取,學習了非關係型數據庫,文檔型數據庫MongoDB。另外公衆號的排版也變的更加美觀了,尤爲是底部多了推薦閱讀和掃碼關注。在借鑑其餘公衆號的排版以後,用本身半吊子的PPT水平作了底部的照片,喜歡的能夠點個贊呀!mongodb
就目前的瞭解,MongoDB的優勢:無需設置固定結構(嵌套),也沒必要考慮數據類型匹不匹配,對數據的可操做性也更大。這回也是同樣在Mac上操做,在Mac上安裝了MongoDB以及MongoDB的可視化工具Robo 3T,Robo 3T做爲一款免費輕量級GUI,簡單且易上手。數據庫
建立數據庫,表格及插入數據。json
import pymongo client = pymongo.MongoClient(host='localhost', port=27017) db = client.QQ_Music collection = db.comments comments = { "nike": "꧁༺詩光༻꧂", "comment": "釋迦摩尼說 :不管你碰見誰, 他都是你生命中該出現的人 ,絕非偶然。", "praisenum": "7817", "comment_id": "song_7072290_1772758010_1486708168", "time": "2017-02-10 14:29:28" } result = collection.insert(comments) print(result)
針對QQ音樂中平凡之路的網頁進行分析,經過不斷點擊加載更多,發現了請求網址的變化參數:pagenum、jsoncallback。經過上期的爬取,咱們知道jsoncallback對於請求是沒有影響的,因此這回更簡單,只需改變頁碼便可。而lasthotcommentid則是第一頁精彩評論的最後一個ID,短期內基本不變。dom
平凡之路精彩評論一共有624條。每個請求頁10條數據,第一頁例外,有15個,可是最後卻只獲取了595條,這是由於評論中有追評的,沒有原創評論,因此直接剔除,固然還有評論已經刪除的,直接就沒有評論信息了。ide
爬取代碼以下:工具
import re import json import time import pymongo import requests client = pymongo.MongoClient(host='localhost', port=27017) db = client.QQ_Music collection = db.comments def get_html(url, headers): try: response = requests.get(url=url, headers=headers) response.raise_for_status() response.encoding = 'utf-8' except requests.HTTPError: print("connect failed") return response def parse_html(html): data = {} content = json.loads(html[30:-3]) for item in content['comment']['commentlist']: if item.get("rootcommentcontent"): data["nike"] = item.get("nick") data["comment"] = re.sub(r"\\n", " ", item.get("rootcommentcontent")) data["_id"] = (re.sub(r"\n", " ", data["comment"])) data["comment"] = (re.sub(r"\n", " ", data["comment"])) data["praisenum"] = item.get("praisenum") data["commentid"] = item.get("commentid") data["time"] = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(item.get("time")))) yield data def to_mongodb(data): try: collection.insert(data) print("Insert the data successfully", data) except: pass def main(): for i in range(63): url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?g_tk=5381&jsonpCallback=jsoncallback05763744516059277&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=GB2312¬ice=0&platform=yqq&needNewCode=0&cid=205360772&reqtype=2&biztype=1&topid=7072290&cmd=6&needmusiccrit=0&pagenum=%s&pagesize=10&lasthotcommentid=song_7072290_2856798698_1489491834&callback=jsoncallback05763744516059277&domain=qq.com&ct=24&cv=101010' %i headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'} time.sleep(2) response = get_html(url, headers) for item in parse_html(response.text): to_mongodb(item) if __name__ == '__main__': main() print("Finish The Work")
最後成功獲取評論信息學習
讀取MongoDB中評論數據,生成詞雲jsonp
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator import matplotlib.pyplot as plt import pandas as pd import pymongo import jieba import re client = pymongo.MongoClient('localhost', 27017) db = client.QQ_Music table = db.comments data = pd.DataFrame(list(table.find())) data = data[['comment']] text = '' for line in data['comment']: r = '[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@,。?★、…【】《》?「」‘’![\\]^_`{|}~]+' line = re.sub(r, '', line) text += ' '.join(jieba.cut(line, cut_all=False)) backgroud_Image = plt.imread('luck.jpg') wc = WordCloud( background_color='white', mask=backgroud_Image, font_path='msyh.ttf', max_words=2000, stopwords=STOPWORDS, max_font_size=130, random_state=30 ) wc.generate_from_text(text) img_colors = ImageColorGenerator(backgroud_Image) wc.recolor(color_func=img_colors) plt.imshow(wc) plt.axis('off') wc.to_file("幸運.jpg") print("生成詞雲成功")
幸運.jpgurl
··· END ···