本次爬取的視頻av號爲75993929(11月21的b站榜首),講的是關於動漫革命機,這是一部超魔幻現實主義動漫(滑稽),有興趣的能夠親身去感覺一下這部魔幻大做。python
import requests from pyquery import PyQuery as pq import jieba import pandas as pd # 經過時間來獲取彈幕信息須要登錄才行,因此帶上登錄後的cookie。不然只能獲取當日的一千條彈幕 headers={ "放入cookie" } word = [] def getInfo(date): response = requests.get("https://api.bilibili.com/x/v2/dm/history?type=1&oid=129995312&date=2019-11-"+str(date), headers=headers) # 解決中文亂碼問題 response.encoding = response.apparent_encoding doc = pq(response.content) # 獲取全部的d標籤 result = doc("d") for line in result: word.append(line.text) # 將彈幕信息保存到csv文件中去 def savaFile(): sr = pd.Series(word) sr.to_csv("評革命機B站彈幕.csv", encoding='utf-8', index=None) # 利用jieba庫對彈幕內容進行分詞 def seperate(): data = pd.read_csv(open("評革命機B站彈幕.csv", encoding='utf-8')) # 傳入自定義的字典,畢竟b站玩梗玩到飛起 jieba.load_userdict('dict.txt') strs = "" for i in data.values: strs += "".join(i[0]) l = jieba.cut(strs, cut_all=True) res = '/'.join(l) # 保存到文件中去 with open("word.txt", 'w', encoding='utf-8') as f: f.write(res) # 分析詞語出現的頻率 def analyse(): res = set() def dropNa(s): return s and s.strip() data = open("word.txt", encoding='utf-8').read() data = data.split('/') newdata = [] for i in data: # 去除掉一些無用的 if '哈' in i or len(i) == 1 or '嘿' in i: continue newdata.append(i) data = newdata # 去除空串 data = list(filter(dropNa, data)) df = pd.Series(data) # 統計出現頻率同時寫入文件中 df.value_counts().to_csv("彈幕TOP.csv") for i in range(18, 22): getInfo(i) savaFile() seperate() analyse()
大河內老師不愧是早稻田大學人類科學系的畢業的api
這些彈幕忽然就有內味了cookie
預知爲什麼彈幕會呈現這種狀況,詳情請見這部動畫曾因不切實際被人嘲諷,但6年後現實卻打了全部人的臉! 【革命機】app