Scrapy的架構初探

時間 2019-11-07

標籤 scrapy 架構初探欄目 Python 简体版

原文原文鏈接

1. 引言

本文簡單講解一下Scrapy的架構。沒錯，GooSeeker開源的通用提取器gsExtractor就是要集成到Scrapy架構中，最看重的是Scrapy的事件驅動的可擴展的架構。除了Scrapy，這一批研究對象還包括ScrapingHub，Import.io等，把先進的思路、技術引進來。html

請注意，本文不想複述原文內容，而是爲了開源Python爬蟲的發展方向找參照，並且以9年來開發網絡爬蟲經驗做爲對標，從而本文含有很多筆者主觀評述，若是想讀Scrapy官方原文，請點擊Scrapy官網的Architecture。程序員

2. Scrapy架構圖

Spiders就是針對特定目標網站編寫的內容提取器，這是在通用網絡爬蟲框架中最須要定製的部分。使用Scrapy建立一個爬蟲工程的時候，就會生成一個Spider架子，只需往裏面填寫代碼，按照它的運行模式填寫，就能融入Scrapy總體的數據流中。GooSeeker開源網絡爬蟲的目標是節省下程序員一半以上的時間，關鍵就是提升Spider的定義和測試速度，解決方案參看《1分鐘快速生成網頁內容提取器》，讓整個Scrapy爬蟲系統實現快速定製的目標。segmentfault

3. Scrapy的數據流（Data Flow）

Scrapy中的數據流由執行引擎控制，下面的原文摘自Scrapy官網，我根據猜想作了點評，爲進一步開發GooSeeker開源爬蟲指示方向：api

The Engine gets the first URLs to crawl from the Spider and schedules
them in the Scheduler, as Requests.網絡

URL誰來準備呢？看樣子是Spider本身來準備，那麼能夠猜想Scrapy架構部分（不包括Spider）主要作事件調度，無論網址的存儲。看起來相似GooSeeker會員中心的爬蟲羅盤，爲目標網站準備一批網址，放在羅盤中準備執行爬蟲調度操做。因此，這個開源項目的下一個目標是把URL的管理放在一個集中的調度庫裏面架構

The Engine asks the Scheduler for the next URLs to crawl.框架

看到這裏其實挺難理解的，要看一些其餘文檔才能理解透。接第1點，引擎從Spider中把網址拿到之後，封裝成一個Request，交給了事件循環，會被Scheduler收來作調度管理的，暫且理解成對Request作排隊。引擎如今就找Scheduler要接下來要下載的網頁地址異步

The Scheduler returns the next URLs to crawl to the Engine and the
Engine sends them to the Downloader, passing through the Downloader
Middleware (request direction).scrapy

從調度器申請任務，把申請到的任務交給下載器，在下載器和引擎之間有個下載器中間件，這是做爲一個開發框架的必備亮點，開發者能夠在這裏進行一些定製化擴展ide

Once the page finishes downloading the Downloader generates a Response
(with that page) and sends it to the Engine, passing through the
Downloader Middleware (response direction).

下載完成了，產生一個Response，經過下載器中間件交給引擎。注意，Response和前面的Request的首字母都是大寫，雖然我尚未看其它Scrapy文檔，可是我猜想這是Scrapy框架內部的事件對象，也能夠推測出是一個異步的事件驅動的引擎，對於高性能、低開銷引擎來講，這是必須的

The Engine receives the Response from the Downloader and sends it to
the Spider for processing, passing through the Spider Middleware
(input direction).

再次出現一箇中間件，給開發者足夠的發揮空間

The Spider processes the Response and returns scraped items and new
Requests (to follow) to the Engine.

每一個Spider順序抓取一個個網頁，完成一個就構造另外一個Request事件，開始另外一個網頁的抓取

The Engine passes scraped items and new Requests returned by a spider
through Spider Middleware (output direction), and then sends processed
items to Item Pipelines and processed Requests to the Scheduler.

引擎做事件分發