強大的aiohttp異步爬蟲的使用

時間 2020-07-06

原文原文鏈接

aiohttp是一個爲Python提供異步HTTP 客戶端/服務端編程，基於asyncio(Python用於支持異步編程的標準庫)的異步庫。

爬蟲方面咱們用的主要是客戶端來發起請求，通常咱們使用aiohttp和asyncio聯合這兩個異步庫來寫異步爬蟲，其實能夠把aiohttp 看做是異步版的requests庫。

這是aiohttp使用的最簡單的例子html

import aiohttp
import asyncio

async def main():
	#咱們獲得一個session會話對象，由ClientSession賦值獲得
	async with aiohttp.ClientSession() as session:
		#使用session.get方法獲得response(response是一個CilentResponse對象)
		async with session.get("https://baidu.com") as response:
			print(response.status)
			print(await response.text)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

要注意的是因爲這是異步庫，要實現異步必須所有使用async/await 異步語法
其實對於session對象的操做好比get，post得到json數據等等http方法的使用和在requests裏使用都是十分類似的web

下面記流水仗了

傳遞參數

async with session.get(url, params = dict) as response:

注意的是aiohttp會在發送請求前標準化URL。域名部分會用IDNA 編碼，路徑和查詢條件會從新編譯(requoting)。若是服務器須要接受準確的表示並不要求編譯URL，那標準化過程應是禁止的。禁止標準化能夠使用encoded=True:編程

await session.get(URL('http://example.com/%30', encoded=True))

文本的解碼

await resp.text(encoding='utf-8')

文件的讀取

await response.read()
await response.text()
await response.json()

獲取流式響應內容

await response.content.text()

return 信息json

import aiohttp
import asyncio

async def main():
    # 好像必須寫一個併發數，不然沒法return
    async with asyncio.Semaphore(5):
        async with aiohttp.ClientSession() as session:
            async with session.get("https://baidu.com") as html:
                response = await html.text(encoding = 'utf-8')
                return response
            

loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。