Python開發爬蟲之動態網頁抓取篇：爬取博客評論數據——經過瀏覽器審查元素解析真實網頁地址

時間 2019-12-01

標籤 python 開發爬蟲動態網頁抓取博客評論數據經過瀏覽器審查元素解析真實網頁地址欄目 Python 简体版

原文原文鏈接

因爲主流網站都使用JavaScript展現網頁內容，和前面簡單抓取靜態網頁不一樣的是，在使用JavaScript時，不少內容並不會出如今HTML源代碼中，而是在HTML源碼位置放上一段JavaScript代碼，最後呈現出來的數據是經過JavaScript提取服務器返回的數據加載到源代碼中進行呈現。所以爬取靜態網頁的技術可能沒法正常使用。所以，咱們須要用到動態網頁抓取的兩種技術：web

1.經過瀏覽器審查元素解析真實網頁地址；json

2.使用selenium模擬瀏覽器的方法。api

咱們這裏先介紹第一種方法。瀏覽器

以爬取《Python 網絡爬蟲：從入門到實踐》一書做者的我的博客評論爲例。網址：http://www.santostang.com/2017/03/02/hello-world/

1）「抓包」：找到真實的數據地址

右鍵點擊「檢查」，點擊「network」，選擇「js」。刷新一下頁面，選中頁面刷新時返回的數據list?callback....這個js文件。右邊再選中Header。如圖：服務器

其中，Request URL便是真實的數據地址。網絡

在此狀態下滾動鼠標滾輪可發現User-Agent。app

2）相關代碼：

import requests
import json
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
link="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=2&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"
r=requests.get(link,headers=headers)
# 獲取 json 的 string
json_string = r.text
json_string = json_string[json_string.find('{'):-2]
json_data=json.loads(json_string)
comment_list=json_data['results']['parents']
for eachone in comment_list:
    message=eachone['content']
    print(message)

輸出爲：學習

如今死在了4.2節上，頁面評論是有的，可是XHR裏沒有東西啊，這是什麼狀況？有解決的大神嗎？
爲什麼靜態網頁抓取不了？
奇怪了，我按照書上的方法來操做，XHR也是空的啊
XHR沒有顯示任何東西啊。奇怪。
找到緣由了
caps["marionette"] = True
做者能夠解釋一下這句話是幹什麼的嗎
我用的是 pycham IDE，按照做者的寫法寫的，怎麼不行
對火狐版本有要求嗎
4.3.1 打開Hello World,代碼用的做者的，火狐地址我也設置了，爲啥運行沒反應
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'C:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改爲你電腦中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
我是番茄
爲何刷新沒有XHR數據，評論明明加載出來了

代碼解析：測試

1）對於代碼 json_string.find() api解析爲：網站

Docstring:
S.find(sub[, start[, end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.
Type:      method_descriptor

因此代碼 json_string.find('{') 即返回」{「在json_string字符串中的索引位置。

2）若在代碼中增長一句代碼 print json_string，則該句輸出結果爲（因爲輸出內容過多，只截取了開頭和結尾,關鍵位置均做了紅色標記）：

/**/ typeof jQuery112405600294326674093_1523687034324 === 'function' && jQuery112405600294326674093_1523687034324({"results":{"parents":[{"replySeq":33365104,"name":"骨犬","memberId":"B9E06FBF9013D49CADBB5B623E8226C8","memberIcon":"http://q.qlogo.cn/qqapp/101256433/B9E06FBF9013D49CADBB5B623E8226C8/100","memberUrl":"https://qq.com/","memberDomain":"qq","good":0,"bad":0,"police":0,"parentSeq":33365104,"directSeq":0,"shortUrl":null,"title":"Hello world! - 數據科學@唐鬆
Santos","site":"http://www.santostang.com/2017/03/02/hello-world/","email":null,"ipAddress":"27.210.192.241","isMobile":"0","agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.108 Safari/537.36 2345Explorer/8.8.3.16721","septSns":null,"targetService":null,"targetUserName":null,"info1":null,"info2":null,"info3":null,"image1":null,"image2":null,"image3":null,"link1":null,"link2":null,"link3":null,"isSecret":0,"isModified":0,"confirm":0,"subCount":1,"regdate":"2018-01-01T06:27:50.000Z","deletedDate":null,"file1":null,"file2":null,"file3":null,"additionalSeq":0,"content":"如今死在了4.2節上，頁面評論是有的，可是XHR裏沒有東西啊，這是什麼狀況？有解決的大神嗎？"
 。。。。。。。。。 tent":"個人也是提示火狐版本不匹配，你解決了嗎","quotationSeq":null,"quotationContent":null,"consumerSeq":1020,"livereSeq":28583,"repSeq":3871836,"memberGroupSeq":26828779,"memberSeq":27312353,"status":0,"repGroupSeq":0,"adminSeq":25413747,"deleteReason":null,"sticker":0,"version":null}],"quotations":[]},"resultCode":200,"resultMessage":"Okay, livere"});

由上面輸出結果可知，咱們在代碼中加入 json_string = json_string[json_string.find('{'):-2]的重要性。

若不加入json_string.find('{')則該結果不是合法的json格式，不能順利構成json文件；若不截取到倒數第二位，則結果包含多餘的);也構不成合法的json格式。

3）對於代碼comment_list=json_data['results']['parents']和message=eachone['content'] 中的中括號中的字符串類型的標籤訂位，可在上面2）中關鍵部位查找，即完成截取後的合法的json文件由「results」和「parents」二者所包含故使用兩個中括號逐級定位，又因爲咱們爬取的是評論，其內容在該json文件的「content」標籤中，故使用["content"]進行定位。

據觀察，在真實的數據地址中的offset是頁數。

爬取全部頁面的評論：

import requests
import json
def single_page_comment(link):
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
   
    r=requests.get(link,headers=headers)
    # 獲取 json 的 string
    json_string = r.text
    json_string = json_string[json_string.find('{'):-2]
    json_data=json.loads(json_string)
    comment_list=json_data['results']['parents']
    for eachone in comment_list:
        message=eachone['content']
        print(message)
        
for page in range(1,4):
    link1="https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset="
    link2="&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329"
    page_str=str(page)
    link=link1+page_str+link2
    print(link)
    single_page_comment(link)

輸出爲:

https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=1&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
在JS 裏面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
在JS 裏面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
在JS 裏面也找不到https://api.gentie.163.com/products/ 哪位大神幫忙解答下。謝謝。
測試
爲何我用代碼打開的文章只有兩條評論，原本是有46條的，有大神知道怎麼回事嗎？
菜鳥一隻，求學習羣
lalala1
我來試一試 :smiley:
我來試一試 :smiley:
應該點JS，而後看裏面的Preview或者Response，裏面響應的是Ajax的內容，而後若是去爬網站的評論的話，點開js那個請求後點Headers -->在General裏面拷貝 RequestURL 就能夠了 :grinning:
https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=2&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
如今死在了4.2節上，頁面評論是有的，可是XHR裏沒有東西啊，這是什麼狀況？有解決的大神嗎？
爲什麼靜態網頁抓取不了？
奇怪了，我按照書上的方法來操做，XHR也是空的啊
XHR沒有顯示任何東西啊。奇怪。
找到緣由了
caps["marionette"] = True
做者能夠解釋一下這句話是幹什麼的嗎
我用的是 pycham IDE，按照做者的寫法寫的，怎麼不行
對火狐版本有要求嗎
4.3.1 打開Hello World,代碼用的做者的，火狐地址我也設置了，爲啥運行沒反應
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = False
binary = FirefoxBinary(r'C:\Program Files\Mozilla Firefox\firefox.exe')
#把上述地址改爲你電腦中Firefox程序的地址
driver = webdriver.Firefox(firefox_binary=binary, capabilities=caps)
driver.get("http://www.santostang.com/2017/03/02/hello-world/")
我是番茄
爲何刷新沒有XHR數據，評論明明加載出來了
https://api-zero.livere.com/v1/comments/list?callback=jQuery112405600294326674093_1523687034324&limit=10&offset=3&repSeq=3871836&requestPath=%2Fv1%2Fcomments%2Flist&consumerSeq=1020&livereSeq=28583&smartloginSeq=5154&_=1523687034329
爲何刷新沒有XHR數據，評論明明加載出來了
爲何刷新沒有XHR數據，評論明明加載出來了
第21條測試評論
第20條測試評論
第19條測試評論
第18條測試評論
第17條測試評論
第16條測試評論
第15條測試評論
第14條測試評論