Python使用asyncio+aiohttp異步爬取貓眼電影專業版

 

asyncio是從pytohn3.4開始添加到標準庫中的一個強大的異步併發庫,能夠很好地解決python中高併發的問題,入門學習能夠參考官方文檔html

併發訪問能極大的提升爬蟲的性能,可是requests訪問網頁是阻塞的,沒法併發,因此咱們須要一個更牛逼的庫 aiohttp ,它的用法與requests類似,能夠當作是異步版的requests,下面經過實戰爬取貓眼電影專業版來熟悉它們的使用:python

1. 分析

分析網頁源代碼發現貓眼專業版是一個動態網頁,其中的數據都是後臺傳送的,打開F12調試工具,再刷新網頁選擇XHR後發現第一條就是後臺發來的電影數據,由此獲得接口 https://box.maoyan.com/promovie/api/box/second.json?beginDate=日期json

 
在這裏插入圖片描述

 


2. 異步爬取

建立20個任務來併發爬取20天的電影信息並寫入csv文件,同時計算一下耗費的時間api

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import asyncio
from  aiohttp import ClientSession
import aiohttp
import time
import csv
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
headers = { 'User-Agent' 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                              'AppleWebKit/537.36 (KHTML, like Gecko) '
                              'Chrome/67.0.3396.99 Safari/537.36' }
 
# 協程函數,完成一個無阻塞的任務
async def get_one_page(url):
 
     try :
         conn = aiohttp.TCPConnector(verify_ssl=False)  # 防止ssl報錯
         async with aiohttp.ClientSession(connector=conn)  as  session:  # 建立session
 
             async with session. get (url, headers=headers)  as  r:
                 # 返回解析爲字典的電影數據
                 return   await r.json()
     except Exception  as  e:
         print( '請求異常: '  + str(e))
         return  {}
 
 
# 解析函數,提取每一條內容並寫入csv文件
def parse_one_page(movie_dict, writer):
     try :
         movie_list = movie_dict[ 'data' ][ 'list' ]
         for  movie  in  movie_list:
             movie_name = movie[ 'movieName' ]
             release = movie[ 'releaseInfo' ]
             sum_box = movie[ 'sumBoxInfo' ]
             box_info = movie[ 'boxInfo' ]
             box_rate = movie[ 'boxRate' ]
             show_info = movie[ 'showInfo' ]
             show_rate = movie[ 'showRate' ]
             avg_show_view = movie[ 'avgShowView' ]
             avg_seat_view = movie[ 'avgSeatView' ]
             writer.writerow([movie_name, release, sum_box, box_info, box_rate,
                             show_info, show_rate, avg_show_view, avg_seat_view])
         return ( '寫入成功' )
     except Exception  as  e:
         return ( '解析異常: '  + str(e))
 
 
# 併發爬取
async def main():
 
     # 待訪問的20個URL連接
     urls = [ 'https://box.maoyan.com/promovie/api/box/second.json?beginDate=201904{}{}' .format(i, j)  for  in  range(1, 3)  for  in  range(10)]
     # 任務列表
     tasks = [get_one_page(url)  for  url  in  urls]
     # 併發執行並保存每個任務的返回結果
     results = await asyncio.gather(*tasks)
 
     # 處理每個結果
     with open( 'pro_info.csv' 'w' as  f:
         writer = csv.writer(f)
         for  result  in  results:
             print(parse_one_page(result, writer))
 
 
if  __name__ ==  "__main__" :
     
     start = time.time()
     
     # asyncio.run(main())
     # python3.7以前的寫法
     loop = asyncio.get_event_loop()
     loop.run_until_complete(main())
     loop.close()
     
     print(time.time()-start)

  

 

3. 對比同步爬取

import requests import csv import time headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/67.0.3396.99 Safari/537.36'} def get_one_page(url): try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding return r.json() except Exception as e: print('請求異常: ' + e) return {} def parse_one_page(movie_dict, writer): try: movie_list = movie_dict['data']['list'] for movie in movie_list: movie_name = movie['movieName'] release = movie['releaseInfo'] sum_box = movie['sumBoxInfo'] box_info = movie['boxInfo'] box_rate = movie['boxRate'] show_info = movie['showInfo'] show_rate = movie['showRate'] avg_show_view = movie['avgShowView'] avg_seat_view = movie['avgSeatView'] writer.writerow([movie_name, release, sum_box, box_info, box_rate, show_info, show_rate, avg_show_view, avg_seat_view]) print('寫入成功') except Exception as e: print('解析異常: ' + e) def main(): # 待訪問的20個URL連接 urls = ['https://box.maoyan.com/promovie/api/box/second.json?beginDate=201903{}{}'.format(i, j) for i in range(1, 3) for j in range(10)] with open('out/pro_info.csv', 'w') as f: writer = csv.writer(f) for url in urls: # 逐一處理 movie_dict = get_one_page(url) parse_one_page(movie_dict, writer) if __name__ == '__main__': a = time.time() main() print(time.time() - a) 
 
在這裏插入圖片描述

能夠看到使用asyncio+aiohttp的異步爬取方式要比簡單的requests同步爬取快上很多,尤爲是爬取大量網頁的時候,這種差距會很是明顯。session

相關文章
相關標籤/搜索
本站公眾號
   歡迎關注本站公眾號,獲取更多信息