get方法會阻塞多線程
異步爬蟲方式:app
- 多線程 多進程(不建議)異步
好處:能夠爲相關阻塞操做單獨開啓線程,進程,實現異步ide
壞處:沒法無限制開啓多線程或多進程ui
- 線程池 進程池(適當使用)url
好處:下降系統對進程或線程建立和銷燬頻率,下降系統開銷spa
壞處: 池中線程或進程數量有上線 (阻塞遠遠高於池中線程,進程時,提高效率不明顯)線程
原則:處理的是阻塞且耗時的操做code
線程池的基本使用:視頻
from multiprocessing.dummy import Pool import time stari_time = time.time() def f1(name): print("%s is running"%name) time.sleep(2) print("%s running done"%name) #實例化線程池對象 name_list = ['a','b','c','d'] pool = Pool(4) #pool.map(func,iterable) pool.map(f1,name_list) print(time.time()-stari_time)
線程池案例應用:
- 梨視頻 生活板塊 最熱的視頻數據
from multiprocessing.dummy import Pool import requests import re from lxml import etree headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36' } url = 'https://www.pearvideo.com/category_5' page_text = requests.get(url=url,headers=headers).text tree =etree.HTML(page_text) li_list = tree.xpath('//li[@class="categoryem"]') urls = []#全部視頻的url for li in li_list: detail_url = 'https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0] name = li.xpath(".//div[@class='vervideo-title']/text()")[0]+".mp4" res = requests.get(url=detail_url,headers=headers).text ex = 'srcUrl="(.*?)",vdoUrl' #動態加載的數據 xpath匹配不到script標籤 用正則匹配 video_url = re.findall(ex,res)[0] dic = { 'name':name, 'url':video_url } urls.append(dic) pool =Pool(5) def f1(dic): video_content = requests.get(url=dic['url'],headers=headers).content print(dic['name'], "正在下載") #持久化存儲操做 with open(dic['name'],'wb')as f: f.write(video_content) print(dic['name'],"下載成功") pool.map(f1,urls) pool.close() pool.join()