用PyCharm Profile分析異步爬蟲效率

時間 2019-11-08

原文原文鏈接

~~今天比較忙，水一下~~html

下面的代碼來源於這個視頻裏面提到的，github 的連接爲：github.com/mikeckenned…python

第一個代碼以下，就是一個普通的 for 循環爬蟲。原文地址。git

import requests
import bs4
from colorama import Fore


def main():
    get_title_range()
    print("Done.")


def get_html(episode_number: int) -> str:
    print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

    url = f'https://talkpython.fm/{episode_number}'
    resp = requests.get(url)
    resp.raise_for_status()

    return resp.text


def get_title(html: str, episode_number: int) -> str:
    print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
    soup = bs4.BeautifulSoup(html, 'html.parser')
    header = soup.select_one('h1')
    if not header:
        return "MISSING"

    return header.text.strip()


def get_title_range():
    # Please keep this range pretty small to not DDoS my site. ;)
    for n in range(185, 200):
        html = get_html(n)
        title = get_title(html, n)
        print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
    main()
複製代碼

這段代碼跑完花了37s，而後咱們用 pycharm 的 profiler 工具來具體看看哪些地方比較耗時間。github

點擊Profile (文件名稱)web

以後獲取到獲得一個詳細的函數調用關係、耗時圖：api

能夠看到 get_html 這個方法佔了96.7%的時間。這個程序的 IO 耗時達到了97%，獲取 html 的時候，這段時間內程序就在那死等着。若是咱們可以讓他不要在那兒傻傻地等待 IO 完成，而是開始幹些其餘有意義的事，就能節省大量的時間。bash

稍微作一個計算，試用asyncio異步抓取，能將時間下降多少？session

get_html這個方法耗時36.8s，一共調用了15次，說明實際上獲取一個連接的 html 的時間爲36.8s / 15 = 2.4s。**要是全異步的話，獲取15個連接的時間仍是2.4s。**而後加上get_title這個函數的耗時0.6s，因此咱們估算，改進後的程序將能夠用 3s 左右的時間完成，也就是性能可以提高13倍。app

再看下改進後的代碼。原文地址。異步

import asyncio
from asyncio import AbstractEventLoop

import aiohttp
import requests
import bs4
from colorama import Fore


def main():
    # Create loop
    loop = asyncio.get_event_loop()
    loop.run_until_complete(get_title_range(loop))
    print("Done.")


async def get_html(episode_number: int) -> str:
    print(Fore.YELLOW + f"Getting HTML for episode {episode_number}", flush=True)

    # Make this async with aiohttp's ClientSession
    url = f'https://talkpython.fm/{episode_number}'
    # resp = await requests.get(url)
    # resp.raise_for_status()

    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            resp.raise_for_status()

            html = await resp.text()
            return html


def get_title(html: str, episode_number: int) -> str:
    print(Fore.CYAN + f"Getting TITLE for episode {episode_number}", flush=True)
    soup = bs4.BeautifulSoup(html, 'html.parser')
    header = soup.select_one('h1')
    if not header:
        return "MISSING"

    return header.text.strip()


async def get_title_range(loop: AbstractEventLoop):
    # Please keep this range pretty small to not DDoS my site. ;)
    tasks = []
    for n in range(190, 200):
        tasks.append((loop.create_task(get_html(n)), n))

    for task, n in tasks:
        html = await task
        title = get_title(html, n)
        print(Fore.WHITE + f"Title found: {title}", flush=True)


if __name__ == '__main__':
    main()
複製代碼