解決Scrapy-Redis爬取完畢以後繼續空跑的問題

解決Scrapy-Redis爬取完畢以後繼續空跑的問題

1. 背景

這裏寫圖片描述
根據scrapy-redis分佈式爬蟲的原理,多臺爬蟲主機共享一個爬取隊列。當爬取隊列中存在request時,爬蟲就會取出request進行爬取,若是爬取隊列中不存在request時,爬蟲就會處於等待狀態,行以下:html

  1.  
    E:\Miniconda\python.exe E:/PyCharmCode/redisClawerSlaver/redisClawerSlaver/spiders/main.py
  2.  
    2017 -12 -12 15: 54: 18 [scrapy.utils.log] INFO: Scrapy 1.4 .0 started (bot: scrapybot)
  3.  
    2017 -12 -12 15: 54: 18 [scrapy.utils.log] INFO: Overridden settings: { 'SPIDER_LOADER_WARN_ONLY': True}
  4.  
    2017 -12 -12 15: 54: 18 [scrapy.middleware] INFO: Enabled extensions:
  5.  
    [ 'scrapy.extensions.corestats.CoreStats',
  6.  
    'scrapy.extensions.telnet.TelnetConsole',
  7.  
    'scrapy.extensions.logstats.LogStats']
  8.  
    2017 -12 -12 15: 54: 18 [myspider_redis] INFO: Reading start URLs from redis key 'myspider:start_urls' (batch size: 110, encoding: utf -8
  9.  
    2017 -12 -12 15: 54: 18 [scrapy.middleware] INFO: Enabled downloader middlewares:
  10.  
    [ 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  11.  
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  12.  
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  13.  
    'redisClawerSlaver.middlewares.ProxiesMiddleware',
  14.  
    'redisClawerSlaver.middlewares.HeadersMiddleware',
  15.  
    'scrapy.downloadermiddlewares.retry.RetryMiddleware',
  16.  
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  17.  
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  18.  
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  19.  
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
  20.  
    'scrapy.downloadermiddlewares.stats.DownloaderStats']
  21.  
    2017 -12 -12 15: 54: 18 [scrapy.middleware] INFO: Enabled spider middlewares:
  22.  
    [ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  23.  
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  24.  
    'scrapy.spidermiddlewares.referer.RefererMiddleware',
  25.  
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  26.  
    'scrapy.spidermiddlewares.depth.DepthMiddleware']
  27.  
    2017 -12 -12 15: 54: 18 [scrapy.middleware] INFO: Enabled item pipelines:
  28.  
    [ 'redisClawerSlaver.pipelines.ExamplePipeline',
  29.  
    'scrapy_redis.pipelines.RedisPipeline']
  30.  
    2017 -12 -12 15: 54: 18 [scrapy.core.engine] INFO: Spider opened
  31.  
    2017 -12 -12 15: 54: 18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  32.  
    2017 -12 -12 15: 55: 18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  33.  
    2017 -12 -12 15: 56: 18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  • 但是,若是全部的request都已經爬取完畢了呢?這件事爬蟲程序是不知道的,它沒法區分結束和空窗期狀態的不一樣,因此會一直處於上面的那種等待狀態,也就是咱們說的空跑。
  • 那有沒有辦法讓爬蟲區分這種狀況,自動結束呢?

2. 環境

  • 系統:win7
  • scrapy-redis
  • redis 3.0.5
  • python 3.6.1

3. 解決方案

  • 從背景介紹來看,基於scrapy-redis分佈式爬蟲的原理,爬蟲結束是一個很模糊的概念,在爬蟲爬取過程當中,爬取隊列是一個不斷動態變化的過程,隨着request的爬取,又會有新的request進入爬取隊列。進進出出。爬取速度高於填充速度,就會有隊列空窗期(爬取隊列中,某一段時間會出現沒有request的狀況),爬取速度低於填充速度,就不會出現空窗期。因此對於爬蟲結束這件事來講,只能模糊定義,沒有一個精確的標準。
  • 因此,下面這兩種方案都是一種大概的思路。

3.1. 利用scrapy的關閉spider擴展功能

  1.  
    # 關閉spider擴展
  2.  
    class scrapy.contrib.closespider.CloseSpider
  3.  
    當某些情況發生,spider會自動關閉。每種狀況使用指定的關閉緣由。
  4.  
     
  5.  
    關閉spider的狀況能夠經過下面的設置項配置:
  6.  
     
  7.  
    CLOSESPIDER_TIMEOUT
  8.  
    CLOSESPIDER_ITEMCOUNT
  9.  
    CLOSESPIDER_PAGECOUNT
  10.  
    CLOSESPIDER_ERRORCOUNT
  • CLOSESPIDER_TIMEOUT
  1.  
    CLOSESPIDER_TIMEOUT
  2.  
    默認值: 0
  3.  
     
  4.  
    一個整數值,單位爲秒。若是一個spider在指定的秒數後仍在運行, 它將以 closespider_timeout 的緣由被自動關閉。 若是值設置爲 0(或者沒有設置),spiders不會由於超時而關閉。
  • CLOSESPIDER_ITEMCOUNT
  1.  
    CLOSESPIDER_ITEMCOUNT
  2.  
    缺省值: 0
  3.  
     
  4.  
    一個整數值,指定條目的個數。若是spider爬取條目數超過了指定的數, 而且這些條目經過item pipeline傳遞,spider將會以 closespider_itemcount 的緣由被自動關閉。
  • CLOSESPIDER_PAGECOUNT
  1.  
    CLOSESPIDER_PAGECOUNT
  2.  
    0 .11 新版功能.
  3.  
     
  4.  
    缺省值: 0
  5.  
     
  6.  
    一個整數值,指定最大的抓取響應( reponses)數。 若是 spider抓取數超過指定的值,則會以 closespider_pagecount 的緣由自動關閉。 若是設置爲0(或者未設置), spiders不會由於抓取的響應數而關閉。
  • CLOSESPIDER_ERRORCOUNT
  1.  
    CLOSESPIDER_ERRORCOUNT
  2.  
    0 .11 新版功能.
  3.  
     
  4.  
    缺省值: 0
  5.  
     
  6.  
    一個整數值,指定 spider能夠接受的最大錯誤數。 若是 spider生成多於該數目的錯誤,它將以 closespider_errorcount 的緣由關閉。 若是設置爲0(或者未設置), spiders不會由於發生錯誤過多而關閉。
  • 示例:打開 settings.py,添加一個配置項,以下
  1.  
    # 爬蟲運行超過23.5小時,若是爬蟲尚未結束,則自動關閉
  2.  
    CLOSESPIDER_TIMEOUT = 84600
  • 特別注意:若是爬蟲在規定時限沒有把request所有爬取完畢,此時強行中止的話,爬取隊列中就還會存有部分request請求。那麼爬蟲下次開始爬取時,必定要記得在master端對爬取隊列進行清空操做。

3.2. 修改scrapy-redis源碼

  1.  
    # ----------- 修改scrapy-redis源碼時,特別須要注意的是:---------
  2.  
    # 第一,要留有原始代碼的備份。
  3.  
    # 第二,當項目移植到其餘機器上時,須要將scrapy-redis源碼一塊兒移植過去。通常代碼位置在\Lib\site-packages\scrapy_redis\下
  • 想象一下,爬蟲已經結束的特徵是什麼?那就是爬取隊列已空,從爬取隊列中沒法取到request信息。那着手點應該就在從爬取隊列中獲取request和調度這個部分。查看scrapy-redis源碼,咱們發現了兩個着手點:

3.2.1. 細節python

  1.  
    # .\Lib\site-packages\scrapy_redis\schedluer.py
  2.  
    def next_request(self):
  3.  
    block_pop_timeout = self.idle_before_close
  4.  
    # 下面是從爬取隊列中彈出request
  5.  
    # 這個block_pop_timeout 我還沒有研究清除其做用。不過確定不是超時時間......
  6.  
    request = self.queue.pop(block_pop_timeout)
  7.  
    if request and self. stats:
  8.  
    self.stats.inc_value( 'scheduler/dequeued/redis', spider= self.spider)
  9.  
    return request
  • 9
  1.  
    # .\Lib\site-packages\scrapy_redis\spiders.py
  2.  
    def next_requests(self):
  3.  
    "" "Returns a request to be scheduled or none." ""
  4.  
    use_set = self.settings.getbool( 'REDIS_START_URLS_AS_SET', defaults.START_URLS_AS_SET)
  5.  
    fetch_one = self.server.spop if use_set else self.server.lpop
  6.  
    # XXX: Do we need to use a timeout here?
  7.  
    found = 0
  8.  
    # TODO: Use redis pipeline execution.
  9.  
    while found < self. redis_batch_size:
  10.  
    data = fetch_one( self.redis_key)
  11.  
    if not data:
  12.  
    # 表明爬取隊列爲空。可是多是永久爲空,也多是暫時爲空
  13.  
    # Queue empty.
  14.  
    break
  15.  
    req = self.make_request_from_data(data)
  16.  
    if req:
  17.  
    yield req
  18.  
    found += 1
  19.  
    else:
  20.  
    self.logger.debug( "Request not made from data: %r", data)
  21.  
     
  22.  
    if found:
  23.  
    self.logger.debug( "Read %s requests from '%s'", found, self.redis_key)
  • 參考註釋,從上述源碼來看,就只有這兩處能夠作手腳。可是爬蟲在爬取過程當中,隊列隨時均可能出現暫時的空窗期。想判斷爬取隊列爲空,通常是設定一個時限,若是在一個時段內,隊列一直持續爲空,那咱們能夠基本認定這個爬蟲已經結束了。因此有了以下的改動:
  1.  
    # .\Lib\site-packages\scrapy_redis\schedluer.py
  2.  
     
  3.  
    # 原始代碼
  4.  
    def next_request(self):
  5.  
    block_pop_timeout = self.idle_before_close
  6.  
    request = self.queue.pop(block_pop_timeout)
  7.  
    if request and self. stats:
  8.  
    self.stats.inc_value( 'scheduler/dequeued/redis', spider= self.spider)
  9.  
    return request
  10.  
     
  11.  
    # 修改後的代碼
  12.  
    def __init__(self, server,
  13.  
    persist=False,
  14.  
    flush_on_start=False,
  15.  
    queue_key=defaults.SCHEDULER_QUEUE_KEY,
  16.  
    queue_cls=defaults.SCHEDULER_QUEUE_CLASS,
  17.  
    dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY,
  18.  
    dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS,
  19.  
    idle_before_close=0,
  20.  
    serializer=None):
  21.  
    # ......
  22.  
    # 增長一個計數項
  23.  
    self.lostGetRequest = 0
  24.  
     
  25.  
    def next_request(self):
  26.  
    block_pop_timeout = self.idle_before_close
  27.  
    request = self.queue.pop(block_pop_timeout)
  28.  
    if request and self. stats:
  29.  
    # 若是拿到了就恢復這個值
  30.  
    self.lostGetRequest = 0
  31.  
    self.stats.inc_value( 'scheduler/dequeued/redis', spider= self.spider)
  32.  
    if request is None:
  33.  
    self.lostGetRequest += 1
  34.  
    print(f "request is None, lostGetRequest = {self.lostGetRequest}, time = {datetime.datetime.now()}")
  35.  
    # 100個大概8分鐘的樣子
  36.  
    if self.lostGetRequest > 200:
  37.  
    print(f "request is None, close spider.")
  38.  
    # 結束爬蟲
  39.  
    self.spider.crawler.engine.close_spider( self.spider, 'queue is empty')
  40.  
    return request
  • 相關log信息
  1.  
    2017 -12 -14 16: 18: 06 [scrapy.middleware] INFO: Enabled item pipelines:
  2.  
    [ 'redisClawerSlaver.pipelines.beforeRedisPipeline',
  3.  
    'redisClawerSlaver.pipelines.amazonRedisPipeline',
  4.  
    'scrapy_redis.pipelines.RedisPipeline']
  5.  
    2017 -12 -14 16: 18: 06 [scrapy.core.engine] INFO: Spider opened
  6.  
    2017 -12 -14 16: 18: 06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
  7.  
    request is None, lostGetRequest = 1, time = 2017 -12 -14 16: 18: 06.370400
  8.  
    request is None, lostGetRequest = 2, time = 2017 -12 -14 16: 18: 11.363400
  9.  
    request is None, lostGetRequest = 3, time = 2017 -12 -14 16: 18: 16.363400
  10.  
    request is None, lostGetRequest = 4, time = 2017 -12 -14 16: 18: 21.362400
  11.  
    request is None, lostGetRequest = 5, time = 2017 -12 -14 16: 18: 26.363400
  12.  
    request is None, lostGetRequest = 6, time = 2017 -12 -14 16: 18: 31.362400
  13.  
    request is None, lostGetRequest = 7, time = 2017 -12 -14 16: 18: 36.363400
  14.  
    request is None, lostGetRequest = 8, time = 2017 -12 -14 16: 18: 41.362400
  15.  
    request is None, lostGetRequest = 9, time = 2017 -12 -14 16: 18: 46.363400
  16.  
    request is None, lostGetRequest = 10, time = 2017 -12 -14 16: 18: 51.362400
  17.  
    2017 -12 -14 16: 18: 56 [scrapy.core.engine] INFO: Closing spider (queue is empty)
  18.  
    request is None, lostGetRequest = 11, time = 2017 -12 -14 16: 18: 56.363400
  19.  
    request is None, close spider.
  20.  
    登陸結果:loginRes = ( 235, b 'Authentication successful')
  21.  
    登陸成功,code = 235
  22.  
    mail has been send successfully. message:Content-Type: text/plain; charset= "utf-8"
  23.  
    MIME-Version: 1.0
  24.  
    Content-Transfer-Encoding: base64
  25.  
    From: 548516910@qq.com
  26.  
    To: 548516910@qq.com
  27.  
    Subject: =?utf -8?b? 54is6Jmr57uT5p2f54q25oCB5rGH5oql77yabmFtZSA9IHJlZGlzQ2xhd2VyU2xhdmVyLCByZWFzb24gPSBxdWV1ZSBpcyBlbXB0eSwgZmluaXNoZWRUaW1lID0gMjAxNy0xMi0xNCAxNjoxODo1Ni4zNjQ0MDA=?=
  28.  
     
  29.  
    57uG6IqC77yacmVhc29uID0gcXVldWUgaXMgZW1wdHksIHN1Y2Nlc3NzISBhdDoyMDE3LTEyLTE0
  30.  
    IDE2OjE4OjU2LjM2NDQwMA==
  31.  
     
  32.  
    2017 -12 -14 16: 18: 56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
  33.  
    { 'finish_reason': 'queue is empty',
  34.  
    'finish_time': datetime.datetime(2017, 12, 14, 8, 18, 56, 364400),
  35.  
    'log_count/INFO': 8,
  36.  
    'start_time': datetime.datetime(2017, 12, 14, 8, 18, 6, 362400)}
  37.  
    2017 -12 -14 16: 18: 56 [scrapy.core.engine] INFO: Spider closed (queue is empty)
  38.  
    Unhandled Error
  39.  
    Traceback (most recent call last):
  40.  
    File "E:\Miniconda\lib\site-packages\scrapy\commands\runspider.py", line 89, in run
  41.  
    self.crawler_process.start()
  42.  
    File "E:\Miniconda\lib\site-packages\scrapy\crawler.py", line 285, in start
  43.  
    reactor.run(installSignalHandlers= False) # blocking call
  44.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 1243, in run
  45.  
    self.mainLoop()
  46.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
  47.  
    self.runUntilCurrent()
  48.  
    --- <exception caught here> ---
  49.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
  50.  
    call.func(* call.args, ** call.kw)
  51.  
    File "E:\Miniconda\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
  52.  
    return self._func(*self._a, **self._kw)
  53.  
    File "E:\Miniconda\lib\site-packages\scrapy\core\engine.py", line 137, in _next_request
  54.  
    if self.spider_is_idle(spider) and slot.close_if_idle:
  55.  
    File "E:\Miniconda\lib\site-packages\scrapy\core\engine.py", line 189, in spider_is_idle
  56.  
    if self.slot.start_requests is not None:
  57.  
    builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
  58.  
     
  59.  
    2017 -12 -14 16: 18: 56 [twisted] CRITICAL: Unhandled Error
  60.  
    Traceback (most recent call last):
  61.  
    File "E:\Miniconda\lib\site-packages\scrapy\commands\runspider.py", line 89, in run
  62.  
    self.crawler_process.start()
  63.  
    File "E:\Miniconda\lib\site-packages\scrapy\crawler.py", line 285, in start
  64.  
    reactor.run(installSignalHandlers= False) # blocking call
  65.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 1243, in run
  66.  
    self.mainLoop()
  67.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
  68.  
    self.runUntilCurrent()
  69.  
    --- <exception caught here> ---
  70.  
    File "E:\Miniconda\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
  71.  
    call.func(* call.args, ** call.kw)
  72.  
    File "E:\Miniconda\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
  73.  
    return self._func(*self._a, **self._kw)
  74.  
    File "E:\Miniconda\lib\site-packages\scrapy\core\engine.py", line 137, in _next_request
  75.  
    if self.spider_is_idle(spider) and slot.close_if_idle:
  76.  
    File "E:\Miniconda\lib\site-packages\scrapy\core\engine.py", line 189, in spider_is_idle
  77.  
    if self.slot.start_requests is not None:
  78.  
    builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
  79.  
     
  80.  
     
  81.  
    Process finished with exit code 0
  • 有一個問題,如上所述,當經過engine.close_spider(spider, ‘reason’)來關閉spider時,有時會出現幾個錯誤以後才能關閉。多是由於scrapy會開啓多個線程同時抓取,而後其中一個線程關閉了spider,其餘線程就找不到spider纔會報錯。

3.2.2. 注意事項react

整個調度過程以下:redis

  • scheduler.py 
    這裏寫圖片描述scrapy

  • queue.py 
    這裏寫圖片描述分佈式

  • 因此,PriorityQueue和另外兩種隊列FifoQueue,LifoQueue有所不一樣,特別須要注意。若是會使用到timeout這個參數,那麼在setting中就只能指定爬取隊列爲FifoQueue或LifoQueue。
  1.  
    # 指定排序爬取地址時使用的隊列,
  2.  
    # 默認的 按優先級排序(Scrapy默認),由sorted set實現的一種非FIFO、LIFO方式。
  3.  
    # 'SCHEDULER_QUEUE_CLASS': 'scrapy_redis.queue.SpiderPriorityQueue',
  4.  
    # 可選的 按先進先出排序(FIFO)
  5.  
    'SCHEDULER_QUEUE_CLASS': 'scrapy_redis.queue.SpiderQueue',
  6.  
    # 可選的 按後進先出排序(LIFO)
  7.  
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderStack'
     
     
    https://blog.csdn.net/mr_hui_/article/details/81455387
相關文章
相關標籤/搜索