pyppeteer進階技巧

時間 2019-12-01

標籤 pyppeteer 進階技巧简体版

原文原文鏈接

記錄一下在使用pyppeteer過程當中慢慢發現的一些稍微高級一點的用法。javascript

1、攔截器簡單用法css

攔截器做用於單個Page，即瀏覽器中的一個標籤頁。每初始化一個Page都要添加一下攔截器。攔截器其實是html

經過給各類事件添加回調函數來實現的。java

事件列表可參見：pyppeteer.page.Page.Eventsnode

經常使用攔截器：git

- request：發出網絡請求時觸發
- response：收到網絡響應時觸發
- dialog：頁面有彈窗時觸發

使用request攔截器修改請求：github

# coding:utf8
import asyncio
from pyppeteer import launch

from pyppeteer.network_manager import Request


launch_args = {
    "headless": False,
    "args": [
        "--start-maximized",
        "--no-sandbox",
        "--disable-infobars",
        "--ignore-certificate-errors",
        "--log-level=3",
        "--enable-extensions",
        "--window-size=1920,1080",
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    ],
}


async def modify_url(request: Request):
    if request.url == "https://www.baidu.com/":
        await request.continue_({"url": "https://www.baidu.com/s?wd=ip&ie=utf-8"})
    else:
        await request.continue_()


async def interception_test():
    # 啓動瀏覽器
    browser = await launch(**launch_args)
    # 新建標籤頁
    page = await browser.newPage()
    # 設置頁面打開超時時間
    page.setDefaultNavigationTimeout(10 * 1000)
    # 設置窗口大小
    await page.setViewport({"width": 1920, "height": 1040})

    # 啓用攔截器
    await page.setRequestInterception(True)

    # 設置攔截器
    # 1. 修改請求的url
    if 1:
        page.on("request", modify_url)
        await page.goto("https://www.baidu.com")

    await asyncio.sleep(10)

    # 關閉瀏覽器
    await page.close()
    await browser.close()
    return


if __name__ == "__main__":
    loop = asyncio.get_event_loop()
    loop.run_until_complete(interception_test())

使用response攔截器獲取某個請求的響應：web

async def get_content(response: Response):
    """
        # 注意這裏不須要設置 page.setRequestInterception(True)
        page.on("response", get_content)
    :param response:
    :return:
    """
    if response.url == "https://www.baidu.com/":
        content = await response.text()
        title = re.search(b"<title>(.*?)</title>", content)
        print(title.group(1))

幹掉頁面全部彈窗：chrome

async def handle_dialog(dialog: Dialog):
    """
        page.on("dialog", get_content)
    :param dialog: 
    :return: 
    """
    await dialog.dismiss()

2、攔截器實現切換代理瀏覽器

通常狀況下瀏覽器添加代理的方法爲設置啓動參數：

--proxy-server=http://user:password@ip:port

例如：

launch_args = {
    "headless": False,
    "args": [
        "--proxy-server=http://localhost:1080",
        "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36",
    ],
}

但此種方式的缺點很明顯，只能在瀏覽器啓動時設置。當須要切換代理時，只能重啓瀏覽器，這個代價

就過高了，因此咱們能夠想一想其餘辦法。

思路很簡單：

request攔截器能夠修改請求屬性而且返回自定義響應內容
使用第三方庫來發送網絡請求，並設置代理。而後封裝響應內容返回給瀏覽器

上代碼：

import aiohttp

aiohttp_session = aiohttp.ClientSession(loop=asyncio.get_event_loop())

proxy = "http://127.0.0.1:1080"
async def use_proxy_base(request: Request):
    """
        # 啓用攔截器
        await page.setRequestInterception(True)
        page.on("request", use_proxy_base)
    :param request:
    :return:
    """
    # 構造請求並添加代理
    req = {
        "headers": request.headers,
        "data": request.postData,
        "proxy": proxy,  # 使用全局變量 則可隨意切換
        "timeout": 5,
        "ssl": False,
    }
    try:
        # 使用第三方庫獲取響應
        async with aiohttp_session.request(
            method=request.method, url=request.url, **req
        ) as response:
            body = await response.read()
    except Exception as e:
        await request.abort()
        return

    # 數據返回給瀏覽器
    resp = {"body": body, "headers": response.headers, "status": response.status}
    await request.respond(resp)
    return

或者再增長一些緩存來節約一下帶寬：

# 靜態資源緩存
static_cache = {}

async def use_proxy_and_cache(request: Request):
    """
        # 啓用攔截器
        await page.setRequestInterception(True)
        page.on("request", use_proxy_base)
    :param request:
    :return:
    """
    global static_cache
    if request.url not in static_cache:
        # 構造請求並添加代理
        req = {
            "headers": request.headers,
            "data": request.postData,
            "proxy": proxy,  # 使用全局變量 則可隨意切換
            "timeout": 5,
            "ssl": False,
        }
        try:
            # 使用第三方庫獲取響應
            async with aiohttp_session.request(
                method=request.method, url=request.url, **req
            ) as response:
                body = await response.read()
        except Exception as e:
            await request.abort()
            return

        # 數據返回給瀏覽器
        resp = {"body": body, "headers": response.headers, "status": response.status}
        # 判斷數據類型 若是是靜態文件則緩存起來
        content_type = response.headers.get("Content-Type")
        if content_type and ("javascript" in content_type or "/css" in content_type):
            static_cache[request.url] = resp
    else:
        resp = static_cache[request.url]

    await request.respond(resp)
    return

3、反反爬蟲

使用pyppeteer來模擬瀏覽器進行爬蟲行動，咱們的本意是假裝本身，讓目標網站認爲我是一個真實的人，然而

總有一些很蛋疼的東西會暴露本身。好比當你使用我上面的配置去模擬淘寶登陸的時候，會發現怎麼都登陸不上。因

爲瀏覽器的navigator.webdriver屬性暴露了你的身份。在正常瀏覽器中，這個屬性是沒有的。可是當你使用pyppeteer

或者selenium時，默認狀況下這個參數就會設置爲true。

去除這個屬性有兩種方式。

先說簡單的，pyppeteer的啓動參數中，默認會增長一個：--enable-automation

去掉方式以下：在導入launch以前先把默認參數改了

from pyppeteer import launcher
# hook  禁用 防止監測webdriver
launcher.AUTOMATION_ARGS.remove("--enable-automation")
from pyppeteer import launch

還有個稍微複雜點的方式，就是利用攔截器來實現注入JS代碼。

JS代碼參見:

　　https://github.com/dytttf/little_spider/blob/master/pyppeteer/pass_webdriver.js

攔截器代碼：

async def pass_webdriver(request: Request):
    """
        # 啓用攔截器
        await page.setRequestInterception(True)
        page.on("request", use_proxy_base)
    :param request:
    :return:
    """
    # 構造請求並添加代理
    req = {
        "headers": request.headers,
        "data": request.postData,
        "proxy": proxy,  # 使用全局變量 則可隨意切換
        "timeout": 5,
        "ssl": False,
    }
    try:
        # 使用第三方庫獲取響應
        async with aiohttp_session.request(
            method=request.method, url=request.url, **req
        ) as response:
            body = await response.read()
    except Exception as e:
        await request.abort()
        return

    if request.url == "https://www.baidu.com/":
        with open("pass_webdriver.js") as f:
            js = f.read()
        # 在html源碼頭部添加js代碼 修改navigator屬性
        body = body.replace(b"<title>", b"<script>%s</script><title>" % js.encode())

    # 數據返回給瀏覽器
    resp = {"body": body, "headers": response.headers, "status": response.status}
    await request.respond(resp)
    return