分佈式爬蟲

時間 2019-12-14

原文原文鏈接

redis分佈式部署

scrapy框架是否能夠本身實現分佈式？html

不能夠,緣由有二
　　其一：由於多臺機器上部署的scrapy會各自擁有各自的調度器，這樣就使得多臺機器沒法分配start_urls列表中的url。（多臺機器沒法共享同一個調度器）
　　其二：多臺機器爬取到的數據沒法經過同一個管道對數據進行統一的數據持久出存儲。（多臺機器沒法共享同一個管道）

基於scrapy-redis組件的分佈式爬蟲

scrapy-redis組件中爲咱們封裝好了能夠被多臺機器共享的調度器和管道，咱們能夠直接使用並實現分佈式數據爬取。web

搭建流程

搭建流程：
        - 建立工程
        - 爬蟲文件
        - 修改爬蟲文件：
            - 導報：from scrapy_redis.spiders import RedisCrawlSpider
            - 將當前爬蟲類的父類進行修改RedisCrawlSpider
            - allowed_domains，start_url刪除，添加一個新屬性redis_key(調度器隊列的名稱)
            - 數據解析，將解析的數據封裝到item中而後向管道提交
        - 配置文件的編寫：
            - 指定管道：
                                ITEM_PIPELINES = {
                         'scrapy_redis.pipelines.RedisPipeline': 400
                        }
            - 指定調度器：
                # 增長了一個去重容器類的配置, 做用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化
                DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
                # 使用scrapy-redis組件本身的調度器
                SCHEDULER = "scrapy_redis.scheduler.Scheduler"
                # 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。若是是True, 就表示要持久化存儲, 就不清空數據, 不然清空數據
                SCHEDULER_PERSIST = True
            - 指定具體的redis：
                REDIS_HOST = 'redis服務的ip地址'
                REDIS_PORT = 6379
                REDIS_ENCODING = ‘utf-8’
                REDIS_PARAMS = {‘password’:’123456’}
            - 開啓redis服務(攜帶redis的配置文件：redis-server ./redis.windows.conf),和客戶端：
                - 對redis的配置文件進行適當的配置：
                        - #bind 127.0.0.1
                        - protected-mode no
                 - 開啓
             - 啓動程序：scrapy runspider xxx.py
             - 向調度器隊列中扔入一個起始的url（redis的客戶端）：lpush xxx www.xxx.com
                - xxx表示的就是redis_key的屬性值

實現方式：

1.基於該組件的RedisSpider類
2.基於該組件的RedisCrawlSpider類

分佈式實現流程：上述兩種不一樣方式的分佈式實現流程是統一的

下載scrapy-redis組件：pip install scrapy-redis

redis配置文件的配置：redis

- 註釋該行：bind 127.0.0.1，表示可讓其餘ip訪問redis
- 將yes該爲no：protected-mode no，表示可讓其餘ip操做redis

修改爬蟲文件中的相關代碼：

- 將爬蟲類的父類修改爲基於RedisSpider或者RedisCrawlSpider。注意：若是原始爬蟲文件是基於Spider的，則應該將父類修改爲RedisSpider，若是原始爬蟲文件是基於CrawlSpider的，則應該將其父類修改爲RedisCrawlSpider。
- 註釋或者刪除start_urls列表，切加入redis_key屬性，屬性值爲scrpy-redis組件中調度器隊列的名稱

在配置文件中進行相關配置，開啓使用scrapy-redis組件中封裝好的管道windows

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

在配置文件中進行相關配置，開啓使用scrapy-redis組件中封裝好的調度器服務器

# 使用scrapy-redis組件的去重隊列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件本身的調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否容許暫停
SCHEDULER_PERSIST = True

在配置文件中進行爬蟲程序連接redis的配置：cookie

REDIS_HOST = 'redis服務的ip地址'
REDIS_PORT = 6379
REDIS_ENCODING = ‘utf-8’
REDIS_PARAMS = {‘password’:’123456’}

開啓redis服務器：redis-server 配置文件
開啓redis客戶端：redis-cli
運行爬蟲文件：scrapy runspider SpiderFile
向調度器隊列中扔入一個起始url（在redis客戶端中操做）：lpush redis_key屬性值 起始url

示例一

爬蟲文件app

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from redis import Redis
from moviePro.items import MovieproItem
class MovieSpider(CrawlSpider):
    conn = Redis(host='127.0.0.1',port=6379)
    name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567tv.tv/frim/index1.html']

    rules = (
        Rule(LinkExtractor(allow=r'/frim/index1-\d+\.html'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        #解析出當前頁碼對應頁面中電影詳情頁的url
        li_list = response.xpath('//div[@class="stui-pannel_bd"]/ul/li')
        for li in li_list:
            #解析詳情頁的url
            detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first()
            #ex == 1:該url沒有被請求過  ex == 0:該url已經被請求過了
            ex = self.conn.sadd('movie_detail_urls',detail_url)
            if ex == 1:
                print('有新數據可爬取......')
                yield scrapy.Request(url=detail_url,callback=self.parse_detail)
            else:
                print('暫無新數據可爬取！')
    def parse_detail(self,response):
        name = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()').extract_first()
        m_type = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[1]/a[1]/text()').extract_first()
        item = MovieproItem()
        item['name'] = name
        item['m_type'] = m_type

        yield item

items文件框架

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    m_type = scrapy.Field()

管道文件dom

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MovieproPipeline(object):
    def process_item(self, item, spider):
        conn = spider.conn
        dic = {
            'name':item['name'],
            'm_type':item['m_type']
        }
        conn.lpush('movie_data',dic)
        return item

配置文件scrapy

# Scrapy settings for moviePro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'moviePro'

SPIDER_MODULES = ['moviePro.spiders']
NEWSPIDER_MODULE = 'moviePro.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'moviePro (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'moviePro.middlewares.MovieproSpiderMiddleware': 543,
#}
LOG_LEVEL = 'ERROR'
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'moviePro.middlewares.MovieproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'moviePro.pipelines.MovieproPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

示例二:增量爬蟲

爬蟲文件

import scrapy
from qiubaiPro.items import QiubaiproItem
import hashlib
from redis import Redis
class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    conn = Redis(host='127.0.0.1',port=6379)
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            #數據指紋：爬取到一條數據的惟一標識
            author = div.xpath('./div/a[2]/h2/text() | ./div/span[2]/h2/text()').extract_first()
            content = div.xpath('./a/div/span//text()').extract()
            content = ''.join(content)

            item = QiubaiproItem()
            item['author'] = author
            item['content'] = content

            #數據指紋的建立
            data = author+content
            hash_key = hashlib.sha256(data.encode()).hexdigest()
            ex = self.conn.sadd('hash_keys',hash_key)
            if ex == 1:
                print('有新數據更新......')
                yield item
            else:
                print('無數據更新！')

items文件

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class QiubaiproItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()

配置文件

# -*- coding: utf-8 -*-

# Scrapy settings for qiubaiPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'qiubaiPro'

SPIDER_MODULES = ['qiubaiPro.spiders']
NEWSPIDER_MODULE = 'qiubaiPro.spiders'

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'qiubaiPro (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'qiubaiPro.pipelines.QiubaiproPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'