pyhton爬取愛豆(李易峯)微博評論(附源碼)


今日目標:微博javascript


以李易峯的微博爲例:html

https://weibo.com/liyifeng2007?is_all=1

而後進入評論頁面,進入XHR查找真是地址:java

https://weibo.com/aj/v6/comment/big?ajwvr=6&id=4353796790279702&from=singleWeiBo

很明顯,是動態的,抓取也是按我之前寫的那些方法來,就不一一說了,他這裏最重要的仍是那串數字,因此咱們只要在第一個網址哪裏把那串數字找出來就算成功一半了,此次須要用到re正則,嗯,這個我不擅長,不過沒事,應該仍是能夠搞到的:python


target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', 'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html): real_id = each.split(" ")[0] filename = each.split("\\")[-2].replace('"',"").replace(":",".") print(real_id,filename)


輸出以下:sql


第一個就是咱們須要的ID,後面則是發微博的時間,咱們用它來作存儲評論數據的文件名稱。json

而後咱們把ID傳入第二個網址:微信

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&from=singleWeiBo'

固然這個是抓取熱度的,如你要抓取最新回覆的,須要下面這個:cookie

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={read_id}&page=1'

拿到這個就簡單了,JSON 數據,直接進json網站解析就行,而後找到咱們須要的數據,這裏就直接上代碼了:app

comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page=1'res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# 提取評論人和評論內容conmment = re.findall('ucardconf="type=1">(.*?)</div>', res)
for each in conmment:      # 將 內容裏的那些表情替換      each = re.sub('<.*?>','',each)      print(each)

re 不太會用,你們將就着看,主要是能把數據搞到手,這個最重要,哈哈…學習

對比一下:

把那些表情給去除了,有些只發表情無法字的就會只顯示名字,這個是正常的,其他就是一毛同樣了。
數據拿到了,我們就存儲到本地吧,所有代碼:

# -*- coding: utf-8 -*-"""Created on 2020-11-18
@author: 李運辰"""
#https://weibo.com/liyifeng2007?is_all=1
import requestsimport re,os
url = 'https://s.weibo.com/?topnav=1&wvr=6'target = 'https://weibo.com/liyifeng2007?is_all=1'
headers = { 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36', 'cookie': 'SUB=_2AkMowDDgf8NxqwJRmPoSyWnqao53ywzEieKenME7JRMxHRl-yT9kqnEjtRB6A0AeDzsLF_aeZGlWOMf4mEl-MBZZXqc_; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9WWcpq860VQlJcIRRNP9pzqS; SINAGLOBAL=1033839351635.7524.1604108279474; login_sid_t=c071efc77911ceace152df2be5986e09; cross_origin_proto=SSL; WBStorage=8daec78e6a891122|undefined; _s_tentry=-; Apache=8275565331127.246.1604195643561; ULV=1604195643568:3:1:1:8275565331127.246.1604195643561:1604122447982; wb_view_log=1920*10801; UOR=,,editor.csdn.net'}
html = requests.get(target,headers=headers).text
for each in re.findall('<a name=(.*?)date=',html): real_id = each.split(" ")[0] filename = each.split("\\")[-2].replace('"',"").replace(":",".") # print(real_id,filename)
# print(filename) for page in range(1,11): comment_url = f'https://weibo.com/aj/v6/comment/big?ajwvr=6&id={real_id}&page={page}' res = requests.get(comment_url,headers=headers).json()["data"]["html"]
# 提取評論人和評論內容 conmment = re.findall('ucardconf="type=1">(.*?)</div>', res) # conmment = re.findall('</i></a>(.*?) </div>', res) for each in conmment: # 將 內容裏的那些表情替換 each = re.sub('<.*?>','',each) print(each) f_name = "./images/"+filename with open(f_name+"_李運辰.txt","a",encoding="utf-8") as f: f.write(each) f.write("\n")

只是測試,因此就只爬了十幾頁:

爬下來後能夠本身對比一下:



搞定!!!!


- END -


歡迎關注公衆號:Python爬蟲數據分析挖掘,方便及時閱讀最新文章

記錄學習python的點點滴滴;

回覆【開源源碼】免費獲取更多開源項目源碼;

本文分享自微信公衆號 - Python爬蟲數據分析挖掘(zyzx3344)。
若有侵權,請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」,歡迎正在閱讀的你也加入,一塊兒分享。

相關文章
相關標籤/搜索