scrapy框架的介紹

時間 2019-11-08

標籤 scrapy 框架介紹欄目 Python 简体版

原文原文鏈接

1.Scrapy架構圖html

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信，信號、數據傳遞等。web

Scheduler(調度器): 它負責接受引擎發送過來的Request請求，並按照必定的方式進行整理排列，入隊，當引擎須要時，交還給引擎。cookie

Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的全部Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，架構

Spider（爬蟲）：它負責處理全部Responses,從中分析提取數據，獲取Item字段須要的數據，並將須要跟進的URL提交給引擎，再次進入Scheduler(調度器)，併發

Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方.app

Downloader Middlewares（下載中間件）：你能夠看成是一個能夠自定義擴展下載功能的組件。dom

Spider Middlewares（Spider中間件）：你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件（好比進入Spider的Responses;和從Spider出去的Requests）scrapy

2.Scrapy執行流程圖ide

3.執行順序url

(1)SPIDERS的yeild將request發送給ENGIN
(2)ENGINE對request不作任何處理髮送給SCHEDULER
(3)SCHEDULER( url調度器)，生成request交給ENGIN
(4)ENGINE拿到request，經過MIDDLEWARE進行層層過濾發送給DOWNLOADER
(5)DOWNLOADER在網上獲取到response數據以後，又通過MIDDLEWARE進行層層過濾發送給ENGIN
(6)ENGINE獲取到response數據以後，返回給SPIDERS，SPIDERS的parse()方法對獲取到的response數據進行處理，解析出items或者requests
(7)將解析出來的items或者requests發送給ENGIN
(8)ENGIN獲取到items或者requests，將items發送給ITEM PIPELINES，將requests發送給SCHEDULER

4.配置文件（spiders/settings.py）
BOT_NAME = 'mySpider'# 建立項目名稱
SPIDER_MODULES = ['mySpider.spiders']#爬蟲模塊的位置
NEWSPIDER_MODULE = 'mySpider.spiders'#新爬蟲模塊的位置

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False# 是否要遵循爬蟲協議，我們不遵循，設置爲Fasle或者註釋掉便可

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32#爬蟲的併發量，默認是16

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3#下載延遲配置，默認是0,之後能夠設置2或者1.5都行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16#每一個域的併發請求
#CONCURRENT_REQUESTS_PER_IP = 16#每一個IP 16的併發請求

# Disable cookies (enabled by default)
COOKIES_ENABLED = False#是否啓用cookie,默認是啓用，要設置不起來，防止別人知道咱們

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False#禁用telnet控制檯（默認啓用）

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#爬蟲中間件，通常用不着
#SPIDER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#下載中間件，之後下載的時候能夠用，後面的值是優先級，數字越小優先級越高
#DOWNLOADER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#管道文件，之後常常下，做用是下載的數據處理
#ITEM_PIPELINES = {
# 'mySpider.pipelines.MyspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5#初始下載延遲
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60#在高延遲狀況下要設置的最大下載延遲。
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'