先說明,這個標題有點標題黨的成分,並非全部數據都能爬,只是那些以正常手段、能夠經過瀏覽器訪問獲取到的數據。html
同時,這段代碼沒什麼 Magic 的地方,用到的核心技術都是開源的:node
GoogleChrome/puppeteerpython
使用了 puppeteer 的 page.on('event', callback)
API:github
page.on('response', intercept_response)
web
當瀏覽器收到 response 的時候,就會調用這個回調函數。而這,就具有了爬取全部數據的潛力。數據庫
至於爲何不用node,那徹底是由於我node不熟練 😓json
import asyncio
from pyppeteer import launch
from pyppeteer.network_manager import Response
from pyppeteer.page import Page
async def crawl_response( start_url: str, actions: list, response_match_callback: callable = None, response_handle_callback: callable = None, ):
""" A highly abstracted function can perform amost any puppeteer based spider task. Ignore the existence of any js encryption. :param start_url: init page url :param actions: a list of actions to perform on a Page object, each action should accept exact one argument. for example: async def click_by_xpath(page: Page): await asyncio.sleep(3) elemHandlers = await page.xpath(xpath) elemHandler = elemHandlers[0] await elemHandler.click() :param response_match_callback: a callback function determine whether should take actions on a response. this function should be a `async` function and accept exact one argument. for example: 1. match all response lambda res: True 2. match response with 'api' in its url lambda res: "api" in res.url 3. match all xhr and fetch response def response_match_callback(res : Response): resourceType = res.request.resourceType if resourceType in ['xhr', 'fetch']: return True :param response_handle_callback: for those response match response_match_callback. this function should be a `async` function and accept exact one argument. for example: 1. simply print response text async def response_handle_callback(res: Response): text = await res.text() print(text) 2. save response to filesystem async def response_handle_callback(res: Response): text = await res.text() with open('example.json', 'w', encoding = 'utf-8') as f: f.write(text) :return: """
async def intercept_response(res: Response):
if response_match_callback:
match = await response_match_callback(res)
if match:
await response_handle_callback(res)
browser = await launch({
'headless': False,
'devtools': False,
'args': [
'-start-maximized',
'--disable-extensions',
'--hide-scrollbars',
'--disable-bundled-ppapi-flash',
'--mute-audio',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu',
'--disable-infobars',
'--enable-automation',
],
'dumpio': True,
})
try:
page = await browser.newPage()
await page.goto(start_url)
page.on('response', intercept_response)
for task_action in actions:
await task_action(page)
finally:
await browser.close()
複製代碼
若是你使用的閱讀設備不方便看代碼,能夠看下面這張圖片:api
傳統的web爬蟲:Developer Tools → Network → 分析請求 → 若是有加密參數,js打斷點 → 用腳本語言如python模擬請求。瀏覽器
基於 webdriver 的爬蟲:定義過濾、處理請求的回調函數 → 訪問起始url → 對頁面進行一系列操做(點擊、跳轉、滾動等)
優點:
劣勢:
優點:
劣勢:
async def crawl_response( start_url: str, actions: list, response_match_callback: callable = None, response_handle_callback: callable = None, ):
pass
複製代碼
接收四個參數:
你會發現,咱們平常生活中,進行的全部瀏覽器瀏覽行爲,均可以總結爲:先訪問一個起始頁面,而後進行一系列操做。
而 puppeteer 有能力把訪問過程當中的全部數據保存下來。
這兩點結合起來,就使得這個函數成爲了一個萬能方法。
好了,廢話很少說,也沒什麼高深的須要講解的了,下面進入實戰。
若是你打開 developer tools,分析網絡請求,否則發現feed流的接口是這個:
https://www.toutiao.com/api/pc/feed/
而後很明顯地發現,請求參數裏面有加密字段:
咱們用上面提到的函數來分析一下:
https://www.toutiao.com/
沒錯,就這麼簡單,只須要兩步。
下面是完整的代碼:
async def main():
# 爬取 https://www.toutiao.com/ feed 流數據
start_url = "https://www.toutiao.com/"
def click_by_xpath(xpath: str):
async def _click(page: Page):
await asyncio.sleep(3)
elemHandlers = await page.xpath(xpath)
elemHandler = elemHandlers[0]
await elemHandler.click()
return _click
actions = []
tabs = ["推薦", "熱點", "科技", "娛樂", "遊戲", "體育", "財經", "搞笑"]
for tabname in tabs:
xpath = "//span[contains(text(),'{}')]".format(tabname)
action = click_by_xpath(xpath)
actions.append(action)
async def is_feed_api(res: Response):
return "api/pc/feed" in res.url
async def print_response_text(res: Response):
text = await res.text()
print(text)
await crawl_response(start_url, actions, is_feed_api, print_response_text)
if __name__ == '__main__':
asyncio.run(main())
複製代碼
分析一下:
https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T恤&opt_type=2
完整代碼:
async def crawl_pdd():
start_url = "https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T%E6%81%A4&opt_type=2"
actions = []
def scroll_down(amount: int, secs: int):
async def _scroll_down(page: Page):
start = time.time()
while True:
await page.evaluate("window.scrollBy({}, 0);".format(amount))
await asyncio.sleep(2)
if time.time() - start >= secs:
break
return _scroll_down
# 不斷下滑 30 s
actions.append(scroll_down(500, 60))
async def print_response_text(res: Response):
text = await res.text()
print(text)
await crawl_response(start_url, actions, lambda res: 'subfenlei_gyl_label' in res.url, print_response_text)
if __name__ == '__main__':
asyncio.run(crawl_pdd())
複製代碼
⚠️這只是一個小demo,跑不起來很正常,有些細節仍是須要本身摸索的。
經過這篇文章,我但願可以帶給你們兩點收穫:
同時,歡迎你也能夠參照示例,用到其餘網站上去。
PS : 有什麼好的想法,或者就是交個朋友,歡迎加個人我的微信,一塊兒交流。