crawlSpider,分佈式爬蟲,增量式爬蟲

時間 2019-12-10

原文原文鏈接

一 . crawlSpider

　　1. 上次了一種爬取全站數據是基於Scrapy框架中的Spider的遞歸爬取進行實現(Requests模塊遞歸回調parse方法).

　　2. 如今在講介紹一種比較好用的方法:基於CrawlSpider的自動爬取進行實現(更加的簡潔高效).

　　crawlSpider的簡介

　　CrawlSpider實際上是Spider的一個子類，除了繼承到Spider的特性和功能外，還派生除了其本身獨有的更增強大的特性和功能。
其中最顯著的功能就是」LinkExtractors連接提取器「。Spider是全部爬蟲的基類，其設計原則只是爲了爬取start_url列表中網頁，
而從爬取到的網頁中提取出的url進行繼續的爬取工做使用CrawlSpider更合適。

　　scrawlSpider的使用

1.建立scrapy工程：scrapy startproject projectName 2.建立爬蟲文件：scrapy genspider -t crawl spiderName www.xxx.com 　--此指令對比之前的指令多了 "-t crawl"，表示建立的爬蟲文件是基於CrawlSpider這個類的，而再也不是Spider這個基類。

　　看一下生成的爬蟲文件

 1 # -*- coding: utf-8 -*-
 2 import scrapy  3 from scrapy.linkextractors import LinkExtractor  4 from scrapy.spiders import CrawlSpider, Rule  5 
 6 
 7 class SuperSpiderSpider(CrawlSpider):  8     name = 'super_spider'
 9     allowed_domains = ['www.xxx.com'] 10     start_urls = ['http://www.xxx.com/'] 11 
12     rules = ( 13         Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), 14  ) 15 
16     def parse_item(self, response): 17         item = {} 18         #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
19         #item['name'] = response.xpath('//div[@id="name"]').get()
20         #item['description'] = response.xpath('//div[@id="description"]').get()
21         return item

　　-- 2,3,4行: 導入CrawlSpider相關模塊css

　　-- 7行: 表示該爬蟲程序是基於CrawlSpider類的html

　　-- 12,13,14行: 表示提取link規則正則表達式

　　-- 16行: 解析方法redis

    CrawlSpider類和Spider類的最大不一樣是CrawlSpider多了一個rules屬性，其做用是定義」提取動做「。
在rules中能夠包含一個或多個Rule對象，在Rule對象中包含了LinkExtractor對象。

　　LinkExtrator:連接提取器

LinkExtractor( 　　　　　　　 allow=r'Items/'，# 知足括號中「正則表達式」的值會被提取，若是爲空，則所有匹配。
 　　　　　　　　 deny=xxx,  # 知足正則表達式的則不會被提取。
 　　　　　　　　 restrict_xpaths=xxx, # 知足xpath表達式的值會被提取
 　　　　　　　　 restrict_css=xxx, # 知足css表達式的值會被提取
 　　　　　　　　 deny_domains=xxx, # 不會被提取的連接的domains。　 　　 ) # 做用：提取response中符合規則的連接。

　　Rule : 規則解析器。根據連接提取器中提取到的連接，根據指定規則提取解析器連接網頁中的內容.

 Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True) 　-- 參數介紹： 　　　　參數1：指定連接提取器 　　　　參數2：指定規則解析器解析數據的規則（回調函數） 　　　　參數3：是否將連接提取器繼續做用到連接提取器提取出的連接網頁中。當callback爲None,參數3的默認值爲true。

　　rules=( ):指定不一樣規則解析器。一個Rule對象表示一種提取規則。

　　 CrawlSpider總體爬取流程：

a)爬蟲文件首先根據起始url，獲取該url的網頁內容

b)連接提取器會根據指定提取規則將步驟a中網頁內容中的連接進行提取

c)規則解析器會根據指定解析規則將連接提取器中提取到的連接中的網頁內容根據指定的規則進行解析

d)將解析數據封裝到item中，而後提交給管道進行持久化存儲

　　話很少說, 上代碼

# 爬取糗事百科糗圖板塊的全部頁碼數據

# -*- coding: utf-8 -*-
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class CrawldemoSpider(CrawlSpider): name = 'qiubai'
    #allowed_domains = ['www.qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/pic/'] #鏈接提取器：會去起始url響應回來的頁面中提取指定的url
    link = LinkExtractor(allow=r'/pic/page/\d+\?') #s=爲隨機數
    link1 = LinkExtractor(allow=r'/pic/$')#爬取第一頁
    #rules元組中存放的是不一樣的規則解析器（封裝好了某種解析規則)
    rules = ( #規則解析器：能夠將鏈接提取器提取到的全部鏈接表示的頁面進行指定規則（回調函數）的解析
        Rule(link, callback='parse_item', follow=True), Rule(link1, callback='parse_item', follow=True), ) def parse_item(self, response): print(response)

　　上面是牛刀小試,下邊是一個完整的流程

　　爬蟲文件

# -*- coding: utf-8 -*-
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from qiubaiBycrawl.items import QiubaibycrawlItem import re class QiubaitestSpider(CrawlSpider): name = 'qiubaiTest'
    #起始url
    start_urls = ['http://www.qiushibaike.com/'] #定義連接提取器，且指定其提取規則
    page_link = LinkExtractor(allow=r'/8hr/page/\d+/') rules = ( #定義規則解析器，且指定解析規則經過callback回調函數
        Rule(page_link, callback='parse_item', follow=True), ) #自定義規則解析器的解析規則函數
    def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: #定義item
            item = QiubaibycrawlItem() #根據xpath表達式提取糗百中段子的做者
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n') #根據xpath表達式提取糗百中段子的內容
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n') yield item #將item提交至管道

　　items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html

import scrapy class QiubaibycrawlItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() #做者
    content = scrapy.Field() #內容

　　pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class QiubaibycrawlPipeline(object): def __init__(self): self.fp = None def open_spider(self,spider): print('開始爬蟲') self.fp = open('./data.txt','w') def process_item(self, item, spider): #將爬蟲文件提交的item寫入文件進行持久化存儲
        self.fp.write(item['author']+':'+item['content']+'\n') return item def close_spider(self,spider): print('結束爬蟲') self.fp.close()

二 . 分佈式爬蟲

　　首先咱們先考慮一個問題: scrapy框架是否能夠本身實現分佈式?

不能夠。緣由有二。

　　其一：由於多臺機器上部署的scrapy會各自擁有各自的調度器，這樣就使得多臺機器沒法分配start_urls列表中的url。（多臺機器沒法共享同一個調度器）

　　其二：多臺機器爬取到的數據沒法經過同一個管道對數據進行統一的數據持久出存儲。（多臺機器沒法共享同一個管道）

　　基於scrapy-redis組件的分佈式爬蟲

scrapy-redis能夠解決上述兩個問題 scrapy-redis組件中爲咱們封裝好了能夠被多臺機器共享的調度器和管道，咱們能夠直接使用並實現分佈式數據爬取。
 實現方式： 1.基於該組件的RedisSpider類 2.基於該組件的RedisCrawlSpider類

　　分佈式實現流程

1.下載scrapy-redis組件：pip install scrapy-redis 2. redis配置文件的配置： - 註釋該行：bind 127.0.0.1，表示可讓其餘ip訪問redis - 將yes該爲no：protected-mode no，表示可讓其餘ip操做redis

　　修改爬蟲文件中的相關代碼:

# 先導入包: from scrapy_redis.spiders import RedisCrawlSpider
- 將爬蟲類的父類修改爲基於RedisSpider或者RedisCrawlSpider。 　　注意：若是原始爬蟲文件是基於Spider的，則應該將父類修改爲RedisSpider， 　　若是原始爬蟲文件是基於CrawlSpider的，則應該將其父類修改爲RedisCrawlSpider。 - 註釋或者刪除start_urls列表，且加入redis_key屬性，屬性值爲scrpy-redis組件中調度器隊列的名稱

　　在配置文件中進行相關配置，開啓使用scrapy-redis組件中封裝好的管道

ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400   # 直接複製粘貼就行
}

　　在配置文件中進行相關配置，開啓使用scrapy-redis組件中封裝好的調度器

# 使用scrapy-redis組件的去重隊列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件本身的調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否容許暫停
SCHEDULER_PERSIST = True

　　在配置文件中進行爬蟲程序連接redis的配置

REDIS_HOST = 'redis服務的ip地址' REDIS_PORT = 6379

1. 開啓redis服務器：redis-server 配置文件數據庫

2. 開啓redis客戶端：redis-cli服務器

3. 運行爬蟲文件：scrapy runspider SpiderFile框架

4. 向調度器隊列中扔入一個起始url（在redis客戶端中操做）：lpush redis_key屬性值起始urldom

三 . 增量式爬蟲

　　什麼是增量是爬蟲?

說白了就是你爬完一個網站的數據後,他又更新了新的數據,而你不須要從新全爬一邊,只須要把更新的數據爬下來就能夠啦,這就是增量式爬蟲!

　　如何進行增量式爬取工做呢?

第一種方法:在發送請求以前判斷這個URL是否是以前爬取過
第二種方法:在解析內容後判斷這部份內容是否是以前爬取過
第三種方法:寫入存儲數據庫時判斷內容是否是已經在數據庫中存在

　　實現上述方法的核心其實就是去重

第一種方法適合不斷有新網頁出現的網站,好比小說的新章節,天天最新的新聞等等
第二種方法適合內容更新的網站
第三種方法是最大程度上去重

　　去重方法

1.將爬取過程當中產生的url進行存儲，存儲在redis的set中。當下次進行數據爬取時， 首先對即將要發起的請求對應的url在存儲的url的set中作判斷，若是存在則不進行請求，不然才進行請求。
 2.對爬取到的網頁內容進行惟一標識的制定，而後將該惟一表示存儲至redis的set中。 當下次爬取到網頁數據的時候，在進行持久化存儲以前，首先能夠先判斷該數據的惟一標識在redis的set中是否存在，在決定是否進行持久化存儲。

　　案例1: 爬取4567tv網站中全部的電影詳情數據。(基於url是否重複)

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(CrawlSpider): name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.4567tv.tv/frim/index7-11.html'] rules = ( Rule(LinkExtractor(allow=r'/frim/index7-\d+\.html'), callback='parse_item', follow=True), ) #建立redis連接對象
    conn = Redis(host='127.0.0.1',port=6379) def parse_item(self, response): li_list = response.xpath('//li[@class="p1 m1"]') for li in li_list: #獲取詳情頁的url
            detail_url = 'http://www.4567tv.tv'+li.xpath('./a/@href').extract_first() #將詳情頁的url存入redis的set中
            ex = self.conn.sadd('urls',detail_url) if ex == 1: print('該url沒有被爬取過，能夠進行數據的爬取') yield scrapy.Request(url=detail_url,callback=self.parst_detail) else: print('數據尚未更新，暫無新數據可爬取！') #解析詳情頁中的電影名稱和類型，進行持久化存儲
    def parst_detail(self,response): item = IncrementproItem() item['name'] = response.xpath('//dt[@class="name"]/text()').extract_first() item['kind'] = response.xpath('//div[@class="ct-c"]/dl/dt[4]//text()').extract() item['kind'] = ''.join(item['kind']) yield item

　　pipelines.py

# -*- coding: utf-8 -*-

from redis import Redis class IncrementproPipeline(object): conn = None def open_spider(self,spider): self.conn = Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'name':item['name'], 'kind':item['kind'] } print(dic) self.conn.lpush('movieData',dic) return item

　　案例2: 爬取糗事百科中的段子和做者數據。(基於內容的惟一標識)

# -*- coding: utf-8 -*-
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from incrementByDataPro.items import IncrementbydataproItem from redis import Redis import hashlib class QiubaiSpider(CrawlSpider): name = 'qiubai'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/text/'] rules = ( Rule(LinkExtractor(allow=r'/text/page/\d+/'), callback='parse_item', follow=True), Rule(LinkExtractor(allow=r'/text/$'), callback='parse_item', follow=True), ) #建立redis連接對象
    conn = Redis(host='127.0.0.1',port=6379) def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: item = IncrementbydataproItem() item['author'] = div.xpath('./div[1]/a[2]/h2/text() | ./div[1]/span[2]/h2/text()').extract_first() item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first() #將解析到的數據值生成一個惟一的標識進行redis存儲
            source = item['author']+item['content'] source_id = hashlib.sha256(source.encode()).hexdigest() #將解析內容的惟一表示存儲到redis的data_id中
            ex = self.conn.sadd('data_id',source_id) if ex == 1: print('該條數據沒有爬取過，能夠爬取......') yield item else: print('該條數據已經爬取過了，不須要再次爬取了!!!')

　　pipelines.py

# -*- coding: utf-8 -*-

from redis import Redis class IncrementbydataproPipeline(object): conn = None def open_spider(self, spider): self.conn = Redis(host='127.0.0.1', port=6379) def process_item(self, item, spider): dic = { 'author': item['author'], 'content': item['content'] } # print(dic)
        self.conn.lpush('qiubaiData', dic) return item