基於 asyncio 的Python異步爬蟲框架

時間 2019-12-08

原文原文鏈接

aspider

A web scraping micro-framework based on asyncio.css

輕量異步爬蟲框架aspider，基於asyncio，目的是讓編寫單頁面爬蟲更方便更迅速，利用異步特性讓爬蟲更快（減小在IO上的耗時）html

介紹

pip install aspider

Item

對於單頁面，只要實現框架定義的 Item 就能夠實現對目標數據的抓取：python

import asyncio

from aspider import Request

request = Request("https://news.ycombinator.com/")
response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output
# [2018-07-25 11:23:42,620]-Request-INFO  <GET: https://news.ycombinator.com/>
# <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>

Spider

對於頁面目標較多，須要進行深度抓取時，Spider就派上用場了git

import aiofiles

from aspider import AttrField, TextField, Item, Spider


class HackerNewsItem(Item):
    target_item = TextField(css_select='tr.athing')
    title = TextField(css_select='a.storylink')
    url = AttrField(css_select='a.storylink', attr='href')

    async def clean_title(self, value):
        return value


class HackerNewsSpider(Spider):
    start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']

    async def parse(self, res):
        items = await HackerNewsItem.get_items(html=res.body)
        for item in items:
            async with aiofiles.open('./hacker_news.txt', 'a') as f:
                await f.write(item.title + '\n')


if __name__ == '__main__':
    HackerNewsSpider.start()

支持JS的加載github

Request類也能夠很好的工做並返回內容，這裏以這個爲例演示下抓取須要加載js才能夠抓取的例子：web

request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)

若是喜歡，能夠玩玩看，項目Github地址：aspidershell