爬蟲主體寫完了,要部署運行,還有一些工程性問題;html
CONCURRENT_ITEMS # item隊列最大容量python
Default: 100數據庫
Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).服務器
CONCURRENT_REQUESTS網絡
Default: 16併發
The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.app
實現原理:框架
next_request中while循環,調用needs_back,若是len(self.active)>它,則繼續空循環,不然下一步;dom
核心仍是engine.py,會查詢downloader,slot等是否須要等待scrapy
CONCURRENT_REQUESTS_PER_DOMAIN
Default: 8
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.
See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.
CONCURRENT_REQUESTS_PER_IP
Default: 0
The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.
This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.
DOWNLOAD_DELAY 下載間隔時間
DEPTH_LIMIT
首先,scrapy自帶關閉擴展,能夠在setting中設置自動關閉條件。
python3.6.4\Lib\site-packages\scrapy\extensions
核心就這麼四句
crawler.signals.connect(self.error_count, signal=signals.spider_error)
crawler.signals.connect(self.page_count, signal=signals.response_received)
crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)
把信號connect到方法,每次達到條件就觸發相應信號,到這裏調用相應函數,完成操做。
操做函數示例:
def error_count(self, failure, response, spider):
self.counter['errorcount'] += 1
if self.counter['errorcount'] == self.close_on['errorcount']:
self.crawler.engine.close_spider(spider, 'closespider_errorcount')
固然,也能夠自行判斷而後拋出異常,調用已有關閉方法;
在spider中:raise CloseSpider(‘bandwidth_exceeded’)
在其它組件中(middlewares, pipeline, etc):
crawler.engine.close_spider(self, ‘log message’)
主要是內存
MEMDEBUG_ENABLED
Default: False
Whether to enable memory debugging.
MEMDEBUG_NOTIFY
Default: []
When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.
Example:MEMDEBUG_NOTIFY = ['user@example.com']
MEMUSAGE_LIMIT_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.
MEMUSAGE_CHECK_INTERVAL_SECONDS
New in version 1.1.
Default: 60.0
Scope: scrapy.extensions.memusage
MEMUSAGE_NOTIFY_MAIL
Default: False
Scope: scrapy.extensions.memusage
A list of emails to notify if the memory limit has been reached.
Example:
MEMUSAGE_NOTIFY_MAIL = ['user@example.com']
MEMUSAGE_WARNING_MB
Default: 0
Scope: scrapy.extensions.memusage
The maximum amount of memory to allow (in megabytes) before sending a warning email notifying about it. If zero, no warning will be produced.
這個得結合通常爬蟲分佈爬取任務劃分狀態來講,
大部分必定規模的爬蟲框架都是一個主爬蟲+多個子爬蟲,主爬蟲負責爬取列表頁並從列表頁中獲取子頁地址,生成二級請求隊列;子爬蟲負責從二級請求隊列中讀取請求,完成爬取。
爬蟲監控信息及做用主要包括三種:
用於判斷當前爬蟲進度
進程監控:爬蟲進程掛了,須要重啓;
資源監控:爬蟲使用資源超標,須要重啓釋放;
主要是異常中止或被ban,須要人工干預;
整體來講,監控日誌監控是比較方便的方法,把日誌寫入數據庫或使用sockethandler,
而後另寫功能模塊負責完成監控,展現,處理功能,可使爬蟲主體功能簡單化,另外一方面能夠避開多進程寫日誌調度這個麻煩。
from scrapy.mail import MailSender
mailer = MailSender(
smtphost = "smtp.163.com", # 發送郵件的服務器
mailfrom = "***********@163.com", # 郵件發送者
smtpuser = "***********@163.com", # 用戶名
smtppass = "***********", # 發送密碼不是登錄密碼,而是受權碼!
smtpport = 25 # 端口號
)
body = u"""發送的郵件內容"""
subject = u'發送的郵件標題'
# 若是發送的內容太過簡單的話,極可能會部分郵箱的反垃圾設置給禁止
mailer.send(to=["****@qq.com", "****@qq.com"], subject = subject.encode("utf-8"), body = body.encode("utf-8"))能會被當作垃圾郵件給禁止發送。
性能問題有緣由有多種:
在爬蟲中,CPU最容易出現瓶頸的地方是解析,
縱向擴展是提升解析效率,使用更高效率的庫,或條件容許直接上正則;另外scrapy在解析時是單線程的,能夠考慮使用gevent;
橫向擴展是多進程,在一臺服務器上同時運行多個爬蟲;