scrapy之crawls的暫停與重啓

Jobs: pausing and resuming crawls¹

Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.html

Scrapy supports this functionality out of the box by providing the following facilities:python

a scheduler that persists scheduled requests on disk
a duplicates filter that persists visited requests on disk
an extension that keeps some spider state (key/value pairs) persistent between batches

Job directory

To enable persistence support you just need to define a job directory through the JOBDIR setting. This directory will be for storing all required data to keep the state of a single job (ie. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.cookie

How to use it

To start a spider with persistence support enabled, run it like this:python爬蟲

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:scrapy

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Keeping persistent state between batches

Sometimes you’ll want to keep some persistent spider state between pause/resume batches. You can use the spider.state attribute for that, which should be a dict. There’s a built-in extension that takes care of serializing, storing and loading that attribute from the job directory, when the spider starts and stops.分佈式

Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):ide

def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1

Persistence gotchas

There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:學習

Cookies expiration

Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This won’t be an issue if you spider doesn’t rely on cookies.ui

Request serialization

Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that your requests are serializable.this

The most common issue here is to use lambda functions on request callbacks that can’t be persisted.

So, for example, this won’t work:

def some_callback(self, response):
    somearg = 'test'
    return scrapy.Request('http://www.example.com', callback=lambda r: self.other_callback(r, somearg))

def other_callback(self, response, somearg):
    print("the argument passed is: %s" % somearg)

But this will:

def some_callback(self, response):
    somearg = 'test'
    return scrapy.Request('http://www.example.com', callback=self.other_callback, meta={'somearg': somearg})

def other_callback(self, response):
    somearg = response.meta['somearg']
    print("the argument passed is: %s" % somearg)

If you wish to log the requests that couldn’t be serialized, you can set the SCHEDULER_DEBUG setting to True in the project’s settings page. It is False by default.

注意：

運行爬蟲的時候將中間信息保存：

方式一：在settings.py文件中設置JOBDIR = ‘path’。
方式二：在具體的爬蟲文件中指定：

custom_settings = {
    "JOBDIR": "path"
}

Windows or Linux環境下，一次Ctrl+C進程將收到中斷信號，兩次Ctrl+C則強制殺掉進程。

Linux環境下，kill -f main.py進程會收到一箇中斷信號，有了這個信號，scrapy就能夠作一些後續的處理，若直接kill -f -9 main.py則進程沒法收到一箇中斷信號，進程將被操做系統給強制殺掉，不會再進行任何後續處理。

示例：

scrapy crawl jobbole -s JOBDIR=job_info/001

-s是-set的意思
不一樣的spider須要不一樣的目錄，不一樣時刻啓動的spider也須要不一樣的目錄
ctrl-c 後就會將暫停信息保存到job_info/001，要想從新開始則再次運行scrapy crawl jobbole -s JOBDIR=job_info/001而後會繼續爬取沒有作完的東西。

參考：
第六章慕課網學習-scrapy的暫停與重啓 https://blog.csdn.net/shaququ/article/details/77587941
python爬蟲進階之scrapy的暫停與重啓 https://blog.csdn.net/m0_37338590/article/details/81332540
三十二 Python分佈式爬蟲打造搜索引擎Scrapy精講—scrapy的暫停與重啓 https://www.cnblogs.com/meng-wei-zhi/p/8182788.html

Scrapy 官方文檔 https://doc.scrapy.org/en/latest/topics/jobs.html 2019-3-8 ↩︎