scrapy extention實戰-空閒時關閉爬蟲

scrapy extention實戰

 

1.      空閒-關閉

使用擴展+spider_idle信號關閉爬蟲。redis

啓用擴展:settings.pyapp

EXTENSIONS = {
    #'scrapy.extensions.telnet.TelnetConsole': None,
   
'extention_my.RedisSpiderSmartIdleClosedExensions': 300,
}scrapy

 

額外配置參數:conf.pyide

MYEXT_ENABLED = True
IDLE_NUMBER = 5設計

 

擴展類:orm

extention_my.py對象

#coding:utf-8

"""
----------------------------------------
description:

author: sss

date:
----------------------------------------
change:
   
----------------------------------------

"""
__author__ = 'sss'

import time
from scrapy import signals
from scrapy.exceptions import NotConfigured

ip

from utils.mylogger import mylogger

logger_c = mylogger(__name__)
logger_m = logger_c.logger

class RedisSpiderSmartIdleClosedExensions(object):

    def __init__(self, idle_number, crawler):
        self.crawler = crawler
        self.idle_number = idle_number
        self.idle_list = []
        self.idle_count = 0

    @classmethod
    def from_crawler(cls, crawler):
        # 首先檢查是否應該啓用和提升擴展
        # 不然不配置
       
from conf import MYEXT_ENABLED
        if not MYEXT_ENABLED:
            raise NotConfigured

        # 獲取配置中的時間片個數,默認爲360個,30分鐘
       
from conf import IDLE_NUMBER as idle_number

        # 實例化擴展對象
       
ext = cls(idle_number, crawler)

        # 將擴展對象鏈接到信號, 將signals.spider_idle 與 spider_idle() 方法關聯起來。
       
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.spider_idle, signal=signals.spider_idle)

        # return the extension object
       
return ext

    def spider_opened(self, spider):
        logger_m.info("opened spider %s redis spider Idle, Continuous idle limit: %d", spider.name, self.idle_number)

    def spider_closed(self, spider):
        logger_m.info("closed spider %s, idle count %d , Continuous idle count %d",
                    spider.name, self.idle_count, len(self.idle_list))

    def spider_idle(self, spider):
        self.idle_count += 1  # 空閒計數
       
self.idle_list.append(time.time())  # 每次觸發 spider_idle時,記錄下觸發時間戳
       
idle_list_len = len(self.idle_list)  # 獲取當前已經連續觸發的次數
       
print(self.idle_number, self.idle_count, self.idle_list)

        # 判斷 當前觸發時間與上次觸發時間 之間的間隔是否大於5秒,若是大於5秒,說明redis 中還有key
       
if idle_list_len > 2 and self.idle_list[-1] - self.idle_list[-2] > 6:
            self.idle_list = [self.idle_list[-1]]

        elif idle_list_len > self.idle_number:
            # 連續觸發的次數達到配置次數後關閉爬蟲
           
logger_m.info('\n continued idle number exceed {} Times'
                        '
\n meet the idle shutdown conditions, will close the reptile operation'
                        '
\n idle start time: {},  close spider time: {}'.format(self.idle_number,
                                                                                self.idle_list[0], self.idle_list[0]))
            # 執行關閉爬蟲操做
           
self.crawler.engine.close_spider(spider, 'closespider_pagecount')utf-8

其它沒有什麼,主要是判斷是否關閉條件的設計。it

相關文章
相關標籤/搜索