爬蟲——單線程+多任務異步協程

時間 2019-11-11

標籤爬蟲單線多任務異步欄目網絡爬蟲简体版

原文原文鏈接

要點：python

1.resquests模塊不支持異步，在須要異步的地方使用aiohttp模塊進行替換app

2.定義一個協程函數，建立協程任務，將協程打包爲一個 Task 排入日程準備執行。返回 Task 對象dom

獲取當前事件循環,開啓循環異步

async def func(arge):async

task = asyncio.ensure_future(func(arge))ide

loop = asyncio.get_event_loop(函數

loop.run_until_complete(asyncio.wait(task_list))oop

import asyncio
import requests,re
import aiohttp
from lxml import etree
from random import randint
target_url = 'https://www.pearvideo.com'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
    }
response = requests.get(url = target_url,headers = headers)
ht = response.text
tree = etree.HTML(ht)
link_list = tree.xpath("//*[@id = 'actRecommendCont'][1]//a[@class = 'actcont-detail actplay']/@href")
videoUrl = []
for link in link_list:
    detail = 'https://www.pearvideo.com/' + link
    response1 = requests.get(url = detail,headers=headers)
    ht = response1.text
    reg = 'var contId.*?srcUrl="(.*?)"'
    link = re.findall(reg,ht,re.S)[0]
    videoUrl.append(link)
       
async def getVideoDate(url):
    fn = int(randint(1,999))
    print('開始下載視頻%s' % fn)
    async with aiohttp.ClientSession() as s:
        async with await s.get(url = url,headers = headers) as response:
            data = await response.read() #此處參考aiohttp,獲取的數據類型（bytes-like）
            with open('./%s.mp4' % fn,'ab') as f:
                f.write(data)
                print('視頻%s下載完畢' % fn)
task_list = []
for url in videoUrl:
    task = asyncio.ensure_future(getVideoDate(url)) 
    task_list.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(task_list))