scrapy框架是否能夠本身實現分佈式?html
不能夠,緣由有二
其一:由於多臺機器上部署的scrapy會各自擁有各自的調度器,這樣就使得多臺機器沒法分配start_urls列表中的url。(多臺機器沒法共享同一個調度器)
其二:多臺機器爬取到的數據沒法經過同一個管道對數據進行統一的數據持久出存儲。(多臺機器沒法共享同一個管道)
scrapy-redis組件中爲咱們封裝好了能夠被多臺機器共享的調度器和管道,咱們能夠直接使用並實現分佈式數據爬取。web
搭建流程: - 建立工程 - 爬蟲文件 - 修改爬蟲文件: - 導報:from scrapy_redis.spiders import RedisCrawlSpider - 將當前爬蟲類的父類進行修改RedisCrawlSpider - allowed_domains,start_url刪除,添加一個新屬性redis_key(調度器隊列的名稱) - 數據解析,將解析的數據封裝到item中而後向管道提交 - 配置文件的編寫: - 指定管道: ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } - 指定調度器: # 增長了一個去重容器類的配置, 做用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件本身的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。若是是True, 就表示要持久化存儲, 就不清空數據, 不然清空數據 SCHEDULER_PERSIST = True - 指定具體的redis: REDIS_HOST = 'redis服務的ip地址' REDIS_PORT = 6379 REDIS_ENCODING = ‘utf-8’ REDIS_PARAMS = {‘password’:’123456’} - 開啓redis服務(攜帶redis的配置文件:redis-server ./redis.windows.conf),和客戶端: - 對redis的配置文件進行適當的配置: - #bind 127.0.0.1 - protected-mode no - 開啓 - 啓動程序:scrapy runspider xxx.py - 向調度器隊列中扔入一個起始的url(redis的客戶端):lpush xxx www.xxx.com - xxx表示的就是redis_key的屬性值
1.基於該組件的RedisSpider類
2.基於該組件的RedisCrawlSpider類
下載scrapy-redis組件:pip install scrapy-redis
redis配置文件的配置:redis
- 註釋該行:bind 127.0.0.1,表示可讓其餘ip訪問redis
- 將yes該爲no:protected-mode no,表示可讓其餘ip操做redis
- 將爬蟲類的父類修改爲基於RedisSpider或者RedisCrawlSpider。注意:若是原始爬蟲文件是基於Spider的,則應該將父類修改爲RedisSpider,若是原始爬蟲文件是基於CrawlSpider的,則應該將其父類修改爲RedisCrawlSpider。
- 註釋或者刪除start_urls列表,切加入redis_key屬性,屬性值爲scrpy-redis組件中調度器隊列的名稱
在配置文件中進行相關配置,開啓使用scrapy-redis組件中封裝好的管道windows
ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 }
在配置文件中進行相關配置,開啓使用scrapy-redis組件中封裝好的調度器服務器
# 使用scrapy-redis組件的去重隊列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件本身的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否容許暫停 SCHEDULER_PERSIST = True
在配置文件中進行爬蟲程序連接redis的配置:cookie
REDIS_HOST = 'redis服務的ip地址' REDIS_PORT = 6379 REDIS_ENCODING = ‘utf-8’ REDIS_PARAMS = {‘password’:’123456’}
開啓redis服務器:redis-server 配置文件 開啓redis客戶端:redis-cli 運行爬蟲文件:scrapy runspider SpiderFile 向調度器隊列中扔入一個起始url(在redis客戶端中操做):lpush redis_key屬性值 起始url
爬蟲文件app
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from moviePro.items import MovieproItem class MovieSpider(CrawlSpider): conn = Redis(host='127.0.0.1',port=6379) name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] rules = ( Rule(LinkExtractor(allow=r'/frim/index1-\d+\.html'), callback='parse_item', follow=True), ) def parse_item(self, response): #解析出當前頁碼對應頁面中電影詳情頁的url li_list = response.xpath('//div[@class="stui-pannel_bd"]/ul/li') for li in li_list: #解析詳情頁的url detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first() #ex == 1:該url沒有被請求過 ex == 0:該url已經被請求過了 ex = self.conn.sadd('movie_detail_urls',detail_url) if ex == 1: print('有新數據可爬取......') yield scrapy.Request(url=detail_url,callback=self.parse_detail) else: print('暫無新數據可爬取!') def parse_detail(self,response): name = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()').extract_first() m_type = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[1]/a[1]/text()').extract_first() item = MovieproItem() item['name'] = name item['m_type'] = m_type yield item
items文件框架
# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MovieproItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() m_type = scrapy.Field()
管道文件dom
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html class MovieproPipeline(object): def process_item(self, item, spider): conn = spider.conn dic = { 'name':item['name'], 'm_type':item['m_type'] } conn.lpush('movie_data',dic) return item
配置文件scrapy
# Scrapy settings for moviePro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'moviePro' SPIDER_MODULES = ['moviePro.spiders'] NEWSPIDER_MODULE = 'moviePro.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'moviePro (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'moviePro.middlewares.MovieproSpiderMiddleware': 543, #} LOG_LEVEL = 'ERROR' # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'moviePro.middlewares.MovieproDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'moviePro.pipelines.MovieproPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
爬蟲文件
import scrapy from qiubaiPro.items import QiubaiproItem import hashlib from redis import Redis class QiubaiSpider(scrapy.Spider): name = 'qiubai' conn = Redis(host='127.0.0.1',port=6379) # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: #數據指紋:爬取到一條數據的惟一標識 author = div.xpath('./div/a[2]/h2/text() | ./div/span[2]/h2/text()').extract_first() content = div.xpath('./a/div/span//text()').extract() content = ''.join(content) item = QiubaiproItem() item['author'] = author item['content'] = content #數據指紋的建立 data = author+content hash_key = hashlib.sha256(data.encode()).hexdigest() ex = self.conn.sadd('hash_keys',hash_key) if ex == 1: print('有新數據更新......') yield item else: print('無數據更新!')
items文件
# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class QiubaiproItem(scrapy.Item): # define the fields for your item here like: author = scrapy.Field() content = scrapy.Field()
配置文件
# -*- coding: utf-8 -*- # Scrapy settings for qiubaiPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'qiubaiPro' SPIDER_MODULES = ['qiubaiPro.spiders'] NEWSPIDER_MODULE = 'qiubaiPro.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' LOG_LEVEL = 'ERROR' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'qiubaiPro (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'qiubaiPro.pipelines.QiubaiproPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'