scrapy框架的介紹

1.Scrapy架構圖html

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通信,信號、數據傳遞等。web

Scheduler(調度器): 它負責接受引擎發送過來的Request請求,並按照必定的方式進行整理排列,入隊,當引擎須要時,交還給引擎。cookie

Downloader(下載器):負責下載Scrapy Engine(引擎)發送的全部Requests請求,並將其獲取到的Responses交還給Scrapy Engine(引擎),由引擎交給Spider來處理,架構

Spider(爬蟲):它負責處理全部Responses,從中分析提取數據,獲取Item字段須要的數據,並將須要跟進的URL提交給引擎,再次進入Scheduler(調度器),併發

Item Pipeline(管道):它負責處理Spider中獲取到的Item,並進行進行後期處理(詳細分析、過濾、存儲等)的地方.app

Downloader Middlewares(下載中間件):你能夠看成是一個能夠自定義擴展下載功能的組件。dom

Spider Middlewares(Spider中間件):你能夠理解爲是一個能夠自定擴展和操做引擎和Spider中間通訊的功能組件(好比進入Spider的Responses;和從Spider出去的Requests)scrapy

2.Scrapy執行流程圖ide

3.執行順序url

(1)SPIDERS的yeild將request發送給ENGIN
(2)ENGINE對request不作任何處理髮送給SCHEDULER
(3)SCHEDULER( url調度器),生成request交給ENGIN
(4)ENGINE拿到request,經過MIDDLEWARE進行層層過濾發送給DOWNLOADER
(5)DOWNLOADER在網上獲取到response數據以後,又通過MIDDLEWARE進行層層過濾發送給ENGIN
(6)ENGINE獲取到response數據以後,返回給SPIDERS,SPIDERS的parse()方法對獲取到的response數據進行處理,解析出items或者requests
(7)將解析出來的items或者requests發送給ENGIN
(8)ENGIN獲取到items或者requests,將items發送給ITEM PIPELINES,將requests發送給SCHEDULER

4.配置文件(spiders/settings.py)
BOT_NAME = 'mySpider'# 建立項目名稱
SPIDER_MODULES = ['mySpider.spiders']#爬蟲模塊的位置
NEWSPIDER_MODULE = 'mySpider.spiders'#新爬蟲模塊的位置

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False# 是否要遵循爬蟲協議,我們不遵循,設置爲Fasle或者註釋掉便可

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32#爬蟲的併發量,默認是16

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3#下載延遲配置,默認是0,之後能夠設置2或者1.5都行
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16#每一個域的併發請求
#CONCURRENT_REQUESTS_PER_IP = 16#每一個IP 16的併發請求

# Disable cookies (enabled by default)

COOKIES_ENABLED = False#是否啓用cookie,默認是啓用,要設置不起來,防止別人知道咱們

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False#禁用telnet控制檯(默認啓用)

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See
https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#爬蟲中間件,通常用不着
#SPIDER_MIDDLEWARES = {
#    'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#下載中間件,之後下載的時候能夠用,後面的值是優先級,數字越小優先級越高
#DOWNLOADER_MIDDLEWARES = {
#    'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See
https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#管道文件,之後常常下,做用是下載的數據處理
#ITEM_PIPELINES = {
#    'mySpider.pipelines.MyspiderPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5#初始下載延遲
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60#在高延遲狀況下要設置的最大下載延遲。
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.scrapy經常使用的命令:

 6.建立scrapy的爬蟲項目:- ---- scrapy startproject mySpider

下面來簡單介紹一下各個主要文件的做用:

scrapy.cfg :項目的配置文件,不能刪除

mySpider/ :項目的Python模塊,將會從這裏引用代碼

mySpider/items.py :項目的目標文件

mySpider/pipelines.py :項目的管道文件

mySpider/settings.py :項目的設置文件

mySpider/spiders/ :存儲爬蟲代碼目錄

7.用命令自動生成爬蟲部分代碼

scrapy genspider Baidu "baidu.com"

8.運行爬蟲----scrapy crawl Baidu

相關文章
相關標籤/搜索