筆記-scrapy-輔助功能

時間 2019-11-25

標籤筆記 scrapy 輔助功能欄目 Python 简体版

原文原文鏈接

筆記-scrapy-輔助功能

1. scrapy爬蟲管理

爬蟲主體寫完了，要部署運行，還有一些工程性問題；html

限頻
爬取深度限制
按條件中止，例如爬取次數，錯誤次數；
資源使用限制，例如內存限制；
狀態報告，郵件
性能問題。

2. 限頻

CONCURRENT_ITEMS # item隊列最大容量python

Default: 100數據庫

Maximum number of concurrent items (per response) to process in parallel in the Item Processor (also known as the Item Pipeline).服務器

CONCURRENT_REQUESTS網絡

Default: 16併發

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.app

實現原理：框架

next_request中while循環，調用needs_back，若是len(self.active)>它，則繼續空循環，不然下一步；dom

核心仍是engine.py，會查詢downloader,slot等是否須要等待scrapy

CONCURRENT_REQUESTS_PER_DOMAIN

Default: 8

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single domain.

See also: AutoThrottle extension and its AUTOTHROTTLE_TARGET_CONCURRENCY option.

CONCURRENT_REQUESTS_PER_IP

Default: 0

The maximum number of concurrent (ie. simultaneous) requests that will be performed to any single IP. If non-zero, the CONCURRENT_REQUESTS_PER_DOMAIN setting is ignored, and this one is used instead. In other words, concurrency limits will be applied per IP, not per domain.

This setting also affects DOWNLOAD_DELAY and AutoThrottle extension: if CONCURRENT_REQUESTS_PER_IP is non-zero, download delay is enforced per IP, not per domain.

DOWNLOAD_DELAY 下載間隔時間

3. 爬取深度限制

DEPTH_LIMIT

4. 按條件中止爬蟲

首先，scrapy自帶關閉擴展，能夠在setting中設置自動關閉條件。

python3.6.4\Lib\site-packages\scrapy\extensions

CLOSESPIDER_TIMEOUT

CLOSESPIDER_ITEMCOUNT

CLOSESPIDER_PAGECOUNT

CLOSESPIDER_ERRORCOUNT

核心就這麼四句

crawler.signals.connect(self.error_count, signal=signals.spider_error)

crawler.signals.connect(self.page_count, signal=signals.response_received)

crawler.signals.connect(self.spider_opened, signal=signals.spider_opened)

crawler.signals.connect(self.item_scraped, signal=signals.item_scraped)

把信號connect到方法，每次達到條件就觸發相應信號，到這裏調用相應函數，完成操做。

操做函數示例：

def error_count(self, failure, response, spider):

self.counter['errorcount'] += 1

if self.counter['errorcount'] == self.close_on['errorcount']:

self.crawler.engine.close_spider(spider, 'closespider_errorcount')

固然，也能夠自行判斷而後拋出異常，調用已有關閉方法；

在spider中：raise CloseSpider(‘bandwidth_exceeded’)

在其它組件中(middlewares, pipeline, etc)：

crawler.engine.close_spider(self, ‘log message’)

5. 資源使用限制

主要是內存

MEMDEBUG_ENABLED

Default: False

Whether to enable memory debugging.

MEMDEBUG_NOTIFY

Default: []

When memory debugging is enabled a memory report will be sent to the specified addresses if this setting is not empty, otherwise the report will be written to the log.

Example:MEMDEBUG_NOTIFY = ['user@example.com']

MEMUSAGE_LIMIT_MB

Default: 0

Scope: scrapy.extensions.memusage

The maximum amount of memory to allow (in megabytes) before shutting down Scrapy (if MEMUSAGE_ENABLED is True). If zero, no check will be performed.

MEMUSAGE_CHECK_INTERVAL_SECONDS

New in version 1.1.

Default: 60.0

Scope: scrapy.extensions.memusage

MEMUSAGE_NOTIFY_MAIL

Default: False

Scope: scrapy.extensions.memusage