自定義異步爬蟲架構 - AsyncSpider

時間 2019-12-12

原文原文鏈接

做者：張亞飛
山西醫科大學在讀研究生html

1. 併發編程

Python中實現併發編程的三種方案：多線程、多進程和異步I/O。併發編程的好處在於能夠提高程序的執行效率以及改善用戶體驗；壞處在於併發的程序不容易開發和調試，同時對其餘程序來講它並不友好。python

多線程：Python中提供了Thread類並輔以Lock、Condition、Event、Semaphore和Barrier。Python中有GIL來防止多個線程同時執行本地字節碼，這個鎖對於CPython是必須的，由於CPython的內存管理並非線程安全的，由於GIL的存在多線程並不能發揮CPU的多核特性。
多進程：多進程能夠有效的解決GIL的問題，實現多進程主要的類是Process，其餘輔助的類跟threading模塊中的相似，進程間共享數據可使用管道、套接字等，在multiprocessing模塊中有一個Queue類，它基於管道和鎖機制提供了多個進程共享的隊列。下面是官方文檔上關於多進程和進程池的一個示例。
異步處理：從調度程序的任務隊列中挑選任務，該調度程序以交叉的形式執行這些任務，咱們並不能保證任務將以某種順序去執行，由於執行順序取決於隊列中的一項任務是否願意將CPU處理時間讓位給另外一項任務。異步任務一般經過多任務協做處理的方式來實現，因爲執行時間和順序的不肯定，所以須要經過回調式編程或者future對象來獲取任務執行的結果。Python 3經過asyncio模塊和await和async關鍵字（在Python 3.7中正式被列爲關鍵字）來支持異步處理。

Python中有一個名爲aiohttp的三方庫，它提供了異步的HTTP客戶端和服務器，這個三方庫能夠跟asyncio模塊一塊兒工做，並提供了對Future對象的支持。Python 3.6中引入了async和await來定義異步執行的函數以及建立異步上下文，在Python 3.7中它們正式成爲了關鍵字。下面的代碼異步的從5個URL中獲取頁面並經過正則表達式的命名捕獲組提取了網站的標題。git

# -*- coding: utf-8 -*-

"""
Datetime: 2019/6/13
Author: Zhang Yafei
Description: async + await + aiiohttp 異步編程示例
"""
import asyncio
import re

import aiohttp

PATTERN = re.compile(r'\<title\>(?P<title>.*)\<\/title\>')


async def fetch_page(session, url):
    async with session.get(url, ssl=False) as resp:
        return await resp.text()


async def show_title(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch_page(session, url)
        print(PATTERN.search(html).group('title'))


def main():
    urls = ('https://www.python.org/',
            'https://git-scm.com/',
            'https://www.jd.com/',
            'https://www.taobao.com/',
            'https://www.douban.com/')
    loop = asyncio.get_event_loop()
    tasks = [show_title(url) for url in urls]
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()


if __name__ == '__main__':
    main()

異步I/O與多進程的比較。

當程序不須要真正的併發性或並行性，而是更多的依賴於異步處理和回調時，asyncio就是一種很好的選擇。若是程序中有大量的等待與休眠時，也應該考慮asyncio，它很適合編寫沒有實時數據處理需求的Web應用服務器。web

2. 自定義異步爬蟲架構 - AsyncSpider

目錄結構

manage.py: 項目啓動文件

engine.py: 項目引擎

settings.py: 項目參數設置

spiders文件夾： spider爬蟲編寫

settings設置

import os

DIR_PATH = os.path.abspath(os.path.dirname(__file__))

# 爬蟲項目模塊類路徑
Spider_Name = 'spiders.xiaohua.XiaohuaSpider'

# 全局headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

TO_FILE = 'xiaohua.csv'

# 若要保存圖片，設置文件夾
IMAGE_DIR = 'images'

if not os.path.exists(IMAGE_DIR):
    os.mkdir(IMAGE_DIR)

spider編寫
結構spider編寫

編寫爬取xiaohua網示例

# -*- coding: utf-8 -*-

"""
Datetime: 2019/6/11
Author: Zhang Yafei
Description: 爬蟲Spider
"""
import os
import re
from urllib.parse import urljoin

from engine import Request
from settings import TO_FILE
import pandas as pd


class XiaohuaSpider(object):
    """ 自定義Spider類 """
    # 1. 自定義起始url列表
    start_urls = [f'http://www.xiaohuar.com/list-1-{i}.html' for i in range(4)]

    def filter_downloaded_urls(self):
        """ 2. 添加過濾規則 """
        # self.start_urls = self.start_urls
        pass

    def start_request(self):
        """ 3. 將請求加入請求隊列（集合），發送請求 """
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    async def parse(self, response):
        """ 4. 拿到請求響應，進行數據解析 """
        html = await response.text(encoding='gbk')
        reg = re.compile('<img width="210".*alt="(.*?)".*src="(.*?)" />')
        results = re.findall(reg, html)
        item_list = []
        request_list = []
        for name, src in results:
            img_url = src if src.startswith('http') else urljoin('http://www.xiaohuar.com', src)
            item_list.append({'name': name, 'img_url': img_url})
            request_list.append(Request(url=img_url, callback=self.download_img, meta={'name': name}))
        # 4.1 進行數據存儲
        await self.store_data(data=item_list, url=response.url)
        # 4.2 返回請求和回調函數
        return request_list

    @staticmethod
    async def store_data(data, url):
        """ 5. 數據存儲 """
        df = pd.DataFrame(data=data)
        if os.path.exists(TO_FILE):
            df.to_csv(TO_FILE, index=False, mode='a', header=False, encoding='utf_8_sig')
        else:
            df.to_csv(TO_FILE, index=False, encoding='utf_8_sig')
        print(f'{url}\t數據下載完成')

    @staticmethod
    async def download_img(response):
        """ 二層深度下載 """
        name = response.request.meta.get('name')
        with open(f'images/{name}.jpg', mode='wb') as f:
            f.write(await response.read())
        print(f'{name}\t下載成功')