A web scraping micro-framework based on asyncio.css
輕量異步爬蟲框架aspider,基於asyncio,目的是讓編寫單頁面爬蟲更方便更迅速,利用異步特性讓爬蟲更快(減小在IO上的耗時)html
pip install aspider
對於單頁面,只要實現框架定義的 Item 就能夠實現對目標數據的抓取:python
import asyncio from aspider import Request request = Request("https://news.ycombinator.com/") response = asyncio.get_event_loop().run_until_complete(request.fetch()) # Output # [2018-07-25 11:23:42,620]-Request-INFO <GET: https://news.ycombinator.com/> # <Response url[text]: https://news.ycombinator.com/ status:200 metadata:{}>
對於頁面目標較多,須要進行深度抓取時,Spider就派上用場了git
import aiofiles from aspider import AttrField, TextField, Item, Spider class HackerNewsItem(Item): target_item = TextField(css_select='tr.athing') title = TextField(css_select='a.storylink') url = AttrField(css_select='a.storylink', attr='href') async def clean_title(self, value): return value class HackerNewsSpider(Spider): start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2'] async def parse(self, res): items = await HackerNewsItem.get_items(html=res.body) for item in items: async with aiofiles.open('./hacker_news.txt', 'a') as f: await f.write(item.title + '\n') if __name__ == '__main__': HackerNewsSpider.start()
支持JS的加載github
Request
類也能夠很好的工做並返回內容,這裏以這個爲例演示下抓取須要加載js才能夠抓取的例子:web
request = Request("https://www.jianshu.com/", load_js=True) response = asyncio.get_event_loop().run_until_complete(request.fetch()) print(response.body)
若是喜歡,能夠玩玩看,項目Github地址:aspidershell