做者:張亞飛
山西醫科大學在讀研究生html
Python中實現併發編程的三種方案:多線程、多進程和異步I/O。併發編程的好處在於能夠提高程序的執行效率以及改善用戶體驗;壞處在於併發的程序不容易開發和調試,同時對其餘程序來講它並不友好。python
future
對象來獲取任務執行的結果。Python 3經過asyncio
模塊和await
和async
關鍵字(在Python 3.7中正式被列爲關鍵字)來支持異步處理。Python中有一個名爲
aiohttp
的三方庫,它提供了異步的HTTP客戶端和服務器,這個三方庫能夠跟asyncio
模塊一塊兒工做,並提供了對Future
對象的支持。Python 3.6中引入了async和await來定義異步執行的函數以及建立異步上下文,在Python 3.7中它們正式成爲了關鍵字。下面的代碼異步的從5個URL中獲取頁面並經過正則表達式的命名捕獲組提取了網站的標題。git
# -*- coding: utf-8 -*-
"""
Datetime: 2019/6/13
Author: Zhang Yafei
Description: async + await + aiiohttp 異步編程示例
"""
import asyncio
import re
import aiohttp
PATTERN = re.compile(r'\<title\>(?P<title>.*)\<\/title\>')
async def fetch_page(session, url):
async with session.get(url, ssl=False) as resp:
return await resp.text()
async def show_title(url):
async with aiohttp.ClientSession() as session:
html = await fetch_page(session, url)
print(PATTERN.search(html).group('title'))
def main():
urls = ('https://www.python.org/',
'https://git-scm.com/',
'https://www.jd.com/',
'https://www.taobao.com/',
'https://www.douban.com/')
loop = asyncio.get_event_loop()
tasks = [show_title(url) for url in urls]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
if __name__ == '__main__':
main()
當程序不須要真正的併發性或並行性,而是更多的依賴於異步處理和回調時,asyncio就是一種很好的選擇。若是程序中有大量的等待與休眠時,也應該考慮asyncio,它很適合編寫沒有實時數據處理需求的Web應用服務器。web
目錄結構
- manage.py: 項目啓動文件
- engine.py: 項目引擎
- settings.py: 項目參數設置
- spiders文件夾: spider爬蟲編寫
import os
DIR_PATH = os.path.abspath(os.path.dirname(__file__))
# 爬蟲項目模塊類路徑
Spider_Name = 'spiders.xiaohua.XiaohuaSpider'
# 全局headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
TO_FILE = 'xiaohua.csv'
# 若要保存圖片,設置文件夾
IMAGE_DIR = 'images'
if not os.path.exists(IMAGE_DIR):
os.mkdir(IMAGE_DIR)
- 編寫爬取xiaohua網示例
# -*- coding: utf-8 -*-
"""
Datetime: 2019/6/11
Author: Zhang Yafei
Description: 爬蟲Spider
"""
import os
import re
from urllib.parse import urljoin
from engine import Request
from settings import TO_FILE
import pandas as pd
class XiaohuaSpider(object):
""" 自定義Spider類 """
# 1. 自定義起始url列表
start_urls = [f'http://www.xiaohuar.com/list-1-{i}.html' for i in range(4)]
def filter_downloaded_urls(self):
""" 2. 添加過濾規則 """
# self.start_urls = self.start_urls
pass
def start_request(self):
""" 3. 將請求加入請求隊列(集合),發送請求 """
for url in self.start_urls:
yield Request(url=url, callback=self.parse)
async def parse(self, response):
""" 4. 拿到請求響應,進行數據解析 """
html = await response.text(encoding='gbk')
reg = re.compile('<img width="210".*alt="(.*?)".*src="(.*?)" />')
results = re.findall(reg, html)
item_list = []
request_list = []
for name, src in results:
img_url = src if src.startswith('http') else urljoin('http://www.xiaohuar.com', src)
item_list.append({'name': name, 'img_url': img_url})
request_list.append(Request(url=img_url, callback=self.download_img, meta={'name': name}))
# 4.1 進行數據存儲
await self.store_data(data=item_list, url=response.url)
# 4.2 返回請求和回調函數
return request_list
@staticmethod
async def store_data(data, url):
""" 5. 數據存儲 """
df = pd.DataFrame(data=data)
if os.path.exists(TO_FILE):
df.to_csv(TO_FILE, index=False, mode='a', header=False, encoding='utf_8_sig')
else:
df.to_csv(TO_FILE, index=False, encoding='utf_8_sig')
print(f'{url}\t數據下載完成')
@staticmethod
async def download_img(response):
""" 二層深度下載 """
name = response.request.meta.get('name')
with open(f'images/{name}.jpg', mode='wb') as f:
f.write(await response.read())
print(f'{name}\t下載成功')
運行結果cd AsyncSpider
python manage.py正則表達式
運行結果編程
結果下載圖片安全
csv文件
gitee傳送門:https://gitee.com/zhangyafeii/AsyncSpider服務器