Copy 這段代碼，媽媽不再用擔憂我爬不到數據

時間 2019-11-07

標籤 copy 代碼媽媽不再擔憂不到數據简体版

原文原文鏈接

先說明，這個標題有點標題黨的成分，並非全部數據都能爬，只是那些以正常手段、能夠經過瀏覽器訪問獲取到的數據。html

同時，這段代碼沒什麼 Magic 的地方，用到的核心技術都是開源的：node

GoogleChrome/puppeteerpython

miyakogi/pyppeteergit

使用了 puppeteer 的 page.on('event', callback) API：github

page.on('response', intercept_response)web

當瀏覽器收到 response 的時候，就會調用這個回調函數。而這，就具有了爬取全部數據的潛力。數據庫

0x01 先上代碼

至於爲何不用node，那徹底是由於我node不熟練 😓json

import asyncio
from pyppeteer import launch
from pyppeteer.network_manager import Response
from pyppeteer.page import Page

async def crawl_response( start_url: str, actions: list, response_match_callback: callable = None, response_handle_callback: callable = None, ):
    """ A highly abstracted function can perform amost any puppeteer based spider task. Ignore the existence of any js encryption. :param start_url: init page url :param actions: a list of actions to perform on a Page object, each action should accept exact one argument. for example: async def click_by_xpath(page: Page): await asyncio.sleep(3) elemHandlers = await page.xpath(xpath) elemHandler = elemHandlers[0] await elemHandler.click() :param response_match_callback: a callback function determine whether should take actions on a response. this function should be a `async` function and accept exact one argument. for example: 1. match all response lambda res: True 2. match response with 'api' in its url lambda res: "api" in res.url 3. match all xhr and fetch response def response_match_callback(res : Response): resourceType = res.request.resourceType if resourceType in ['xhr', 'fetch']: return True :param response_handle_callback: for those response match response_match_callback. this function should be a `async` function and accept exact one argument. for example: 1. simply print response text async def response_handle_callback(res: Response): text = await res.text() print(text) 2. save response to filesystem async def response_handle_callback(res: Response): text = await res.text() with open('example.json', 'w', encoding = 'utf-8') as f: f.write(text) :return: """

    async def intercept_response(res: Response):
        if response_match_callback:
            match = await response_match_callback(res)
            if match:
                await response_handle_callback(res)

    browser = await launch({
        'headless': False,
        'devtools': False,
        'args': [
            '-start-maximized',
            '--disable-extensions',
            '--hide-scrollbars',
            '--disable-bundled-ppapi-flash',
            '--mute-audio',
            '--no-sandbox',
            '--disable-setuid-sandbox',
            '--disable-gpu',
            '--disable-infobars',
            '--enable-automation',
        ],
        'dumpio': True,
    })

    try:
        page = await browser.newPage()
        await page.goto(start_url)
        page.on('response', intercept_response)
        for task_action in actions:
            await task_action(page)
    finally:
        await browser.close()

複製代碼

若是你使用的閱讀設備不方便看代碼，能夠看下面這張圖片：api

0x02 對比下兩種爬蟲模式

思惟模式

傳統的web爬蟲：Developer Tools → Network → 分析請求 → 若是有加密參數，js打斷點 → 用腳本語言如python模擬請求。瀏覽器

基於 webdriver 的爬蟲：定義過濾、處理請求的回調函數 → 訪問起始url → 對頁面進行一系列操做（點擊、跳轉、滾動等）

優劣對比

傳統web爬蟲

優點：

只要破解了加密參數，爬取效率更高。
任務調度、代碼組織更靈活簡單。

劣勢：

前期分析很是耗時，並且不必定可以破解加密參數。
若是哪天加密方式變了，須要從新破解。

基於 webdriver 的爬蟲

優點：

無視任何js加密手段，只要是普通用戶能訪問到的數據，就必定能夠無腦獲取到。即便網站js加密手段變了，也沒影響。
編寫簡單，只須要定義一系列頁面操做就好了。
可以一次性把全部請求過程當中獲取的數據全給保存了，方便後面

劣勢：

效率低是確定的。
有時候必需要使用有頭瀏覽器的時候，部署起來就麻煩了。

0x03 分析一下上面給出的代碼

async def crawl_response( start_url: str, actions: list, response_match_callback: callable = None, response_handle_callback: callable = None, ):
	pass
複製代碼

接收四個參數：

start_url : 初始訪問連接
actions：對頁面進行的一系列操做
response_match_callback : 一個函數，接收 response 爲參數，判斷是否要對這個 response 進行操做
response_handle_callback：一個函數，對 response 進行具體的處理，好比保存數據庫等。

你會發現，咱們平常生活中，進行的全部瀏覽器瀏覽行爲，均可以總結爲：先訪問一個起始頁面，而後進行一系列操做。

而 puppeteer 有能力把訪問過程當中的全部數據保存下來。

這兩點結合起來，就使得這個函數成爲了一個萬能方法。

好了，廢話很少說，也沒什麼高深的須要講解的了，下面進入實戰。

0x04 實戰一：爬取今日頭條首頁feed流數據

今日頭條

若是你打開 developer tools，分析網絡請求，否則發現feed流的接口是這個：

https://www.toutiao.com/api/pc/feed/

而後很明顯地發現，請求參數裏面有加密字段：

咱們用上面提到的函數來分析一下：

首先起始連接是 https://www.toutiao.com/
其次，只要咱們點擊這些側邊欄，就能觸發網絡請求獲取到想要的feed流數據。

沒錯，就這麼簡單，只須要兩步。

下面是完整的代碼：

async def main():
    # 爬取 https://www.toutiao.com/ feed 流數據
    start_url = "https://www.toutiao.com/"

    def click_by_xpath(xpath: str):
        async def _click(page: Page):
            await asyncio.sleep(3)
            elemHandlers = await page.xpath(xpath)
            elemHandler = elemHandlers[0]
            await elemHandler.click()

        return _click

    actions = []
    tabs = ["推薦", "熱點", "科技", "娛樂", "遊戲", "體育", "財經", "搞笑"]
    for tabname in tabs:
        xpath = "//span[contains(text(),'{}')]".format(tabname)
        action = click_by_xpath(xpath)
        actions.append(action)

    async def is_feed_api(res: Response):
        return "api/pc/feed" in res.url

    async def print_response_text(res: Response):
        text = await res.text()
        print(text)

    await crawl_response(start_url, actions, is_feed_api, print_response_text)


if __name__ == '__main__':
    asyncio.run(main())

複製代碼

0x05 實戰二：爬取拼多多t恤類商品數據

拼多多商城

分析一下：

起始url ：https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T恤&opt_type=2
不斷下滑

完整代碼：

async def crawl_pdd():
    start_url = "https://mobile.yangkeduo.com/catgoods.html?refer_page_name=index&opt_id=1274&opt_name=T%E6%81%A4&opt_type=2"

    actions = []

    def scroll_down(amount: int, secs: int):
        async def _scroll_down(page: Page):
            start = time.time()
            while True:
                await page.evaluate("window.scrollBy({}, 0);".format(amount))
                await asyncio.sleep(2)
                if time.time() - start >= secs:
                    break

        return _scroll_down

    # 不斷下滑 30 s
    actions.append(scroll_down(500, 60))

    async def print_response_text(res: Response):
        text = await res.text()
        print(text)

    await crawl_response(start_url, actions, lambda res: 'subfenlei_gyl_label' in res.url, print_response_text)


if __name__ == '__main__':
    asyncio.run(crawl_pdd())
複製代碼

⚠️這只是一個小demo，跑不起來很正常，有些細節仍是須要本身摸索的。

0x06 總結一下

經過這篇文章，我但願可以帶給你們兩點收穫：

一種新的爬蟲手段，當你破解js無果時，能夠試試這張方法。
學會抽象，把複雜的問題變得簡單。這個函數，本質上就是把使用瀏覽器訪問頁面這件事，抽象成了兩步：訪問初始頁面 + 對頁面採起一系列操做。幾乎全部瀏覽行爲，均可以用這兩步歸納（有哪些不能用這兩步歸納的，歡迎告知嗷），這也就說明了這個接口的強大抽象能力。

同時，歡迎你也能夠參照示例，用到其餘網站上去。

PS : 有什麼好的想法，或者就是交個朋友，歡迎加個人我的微信，一塊兒交流。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。