解讀Scrapy框架

  Scrapy框架基礎:Twsited

Scrapy內部基於事件循環的機制實現爬蟲的併發。
原來:css

url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',]

for item in url_list:
  response = requests.get(item)
  print(response.text)
原來執行多個請求任務

如今: html

from twisted.web.client import getPage, defer
from twisted.internet import reactor

# 第一部分:代理開始接收任務
def callback(contents):
  print(contents)

deferred_list = [] # [(龍泰,貝貝),(劉淞,寶件套),(呼呼,東北)]
url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ]
for url in url_list:
  deferred = getPage(bytes(url, encoding='utf8')) # (我,要誰)
  deferred.addCallback(callback)
  deferred_list.append(deferred)


# # 第二部分:代理執行完任務後,中止
dlist = defer.DeferredList(deferred_list)

def all_done(arg):
  reactor.stop()

dlist.addBoth(all_done)

# 第三部分:代理開始去處理吧
reactor.run()
twisted
什麼是twisted? 
  • 官方:基於事件循環的異步非阻塞模塊。
  • 白話:一個線程同時能夠向多個目標發起Http請求。
非阻塞:不等待,全部請求同時發出。  我向請求A、請求B、請求C發起鏈接請求的時候,不等鏈接返回結果以後再去連下一個,而是發送一個以後,立刻發送下一個。
import socket 
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.1,80))

import socket 
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.2,80))

import socket 
sk = socket.socket()
sk.setblocking(False)
sk.connect((1.1.1.3,80))
socket非阻塞
異步:回調。我一旦幫助callback_A、callback_B、callback_F找到想要的A,B,C,我會主動通知他們。
def callback(contents):
    print(contents)
callback
事件循環: 我,我一直在循環三個socket任務(即:請求A、請求B、請求C),檢查他三個狀態:是否鏈接成功;是否返回結果。

它和requests的區別?
requests是一個Python實現的能夠僞造瀏覽器發送http請求的模塊
    -封裝socket發送請求
    
twisted是基於事件循環的異步非阻塞網絡框架
    -封裝socket發送請求
    -單線程完成完成併發請求
    PS:三個關鍵詞
        -非阻塞:不等待
        -異步:回調
        -事件循環:一直循環去檢查狀態
twisted和requests的區別

  Scrapy

  Scrapy是一個爲了爬取網站數據,提取結構性數據而編寫的應用框架。 其能夠應用在數據挖掘,信息處理或存儲歷史數據等一系列的程序中。
其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的, 也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途普遍,能夠用於數據挖掘、監測和自動化測試。python

Scrapy 使用了 Twisted異步網絡庫來處理網絡通信。總體架構大體以下:mysql

Scrapy主要包括瞭如下組件:react

  • 引擎(Scrapy)

    用來處理整個系統的數據流處理, 觸發事務(框架核心)web

  • 調度器(Scheduler)

  用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL(抓取網頁的網址或者說是連接)的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址ajax

  • 下載器(Downloader)

  用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是創建在twisted這個高效的異步模型上的)redis

  • 爬蟲(Spiders)

  爬蟲是主要幹活的, 用於從特定的網頁中提取本身須要的信息, 即所謂的實體(Item)。用戶也能夠從中提取出連接,讓Scrapy繼續抓取下一個頁面算法

  • 項目管道(Pipeline)

  負責處理爬蟲從網頁中抽取的實體,主要的功能是持久化實體、驗證明體的有效性、清除不須要的信息。當頁面被爬蟲解析後,將被髮送到項目管道,並通過幾個特定的次序處理數據。sql

  • 下載器中間件(Downloader Middlewares)

  位於Scrapy引擎和下載器之間的框架,主要是處理Scrapy引擎與下載器之間的請求及響應。

  • 爬蟲中間件(Spider Middlewares)

  介於Scrapy引擎和爬蟲之間的框架,主要工做是處理蜘蛛的響應輸入和請求輸出。

  • 調度中間件(Scheduler Middewares)

  介於Scrapy引擎和調度之間的中間件,從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程大概以下:

  1. 引擎找到要執行的爬蟲,並執行爬蟲的 start_requests 方法,並獲得一個迭代器。
  2. 迭代器循環時會獲取Request對象,而Request對象中封裝了要訪問的URL和回調函數,將全部的request對象(任務)放到調度器中,放入一個請求隊列,同時去重。
  3. 下載器去引擎中獲取要下載任務(就是Request對象),引擎向調度器獲取任務,調度器從隊列中取出一個Request返回引擎,引擎交給下載器
  4. 下載器下載完成後返回Response對象交給引擎執行回調函數。
  5. 回到spider的回調函數中,爬蟲解析Response
  6. yield Item(): 解析出實體(Item),則交給實體管道進行進一步的處理
  7. yield Request()解析出的是連接(URL),則把URL交給調度器等待抓取

   一. 基本命令及項目結構 

 基本命令

 1     創建項目:scrapy startproject 項目名稱
 2               在當前目錄中建立中建立一個項目文件(相似於Django) 
 3     建立爬蟲應用
 4         cd 項目名稱
 5             scrapy genspider [-t template] <name> <domain>
 6             scrapy gensipider -t basic oldboy oldboy.com
 7        scrapy genspider -t crawl weisuen sohu.com
 8         PS:
 9             查看全部命令:scrapy gensipider -l
10             查看模板命令:scrapy gensipider -d 模板名稱
11     scrapy list
12             展現爬蟲應用列表
13     運行爬蟲應用
14           scrapy crawl 爬蟲應用名稱
15           Scrapy crawl quotes
16           Scrapy runspider quote
17           scrapy crawl lagou -s JOBDIR=job_info/001 暫停與重啓
18     保存文件:Scrapy crawl quotes –o quotes.json
19   shell腳本測試 scrapy shell 'http://scrapy.org' --nolog

項目結構

 1  project_name/
 2    scrapy.cfg
 3    project_name/
 4        __init__.py
 5        items.py
 6        pipelines.py
 7        settings.py
 8        spiders/
 9            __init__.py
10            爬蟲1.py
11            爬蟲2.py
12            爬蟲3.py

文件說明:

  • scrapy.cfg  項目的主配置信息。(真正爬蟲相關的配置信息在settings.py文件中)
  • items.py    設置數據存儲模板,用於結構化數據,如:Django的Model
  • pipelines    數據處理行爲,如:通常結構化的數據持久化
  • settings.py 配置文件,如:遞歸的層數、併發數,延遲下載等
  • spiders      爬蟲目錄,如:建立文件,編寫爬蟲規則

注意:通常建立爬蟲文件時,以網站域名命名

   二. spider編寫

1.start_urls
內部原理
scrapy引擎來爬蟲中取起始URL:
    1. 調用start_requests並獲取返回值 2. v = iter(返回值) 3. req1 = 執行 v.__next__() req2 = 執行 v.__next__() req3 = 執行 v.__next__() ... 4. req所有放到調度器中
 
class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']
    cookie_dict = {}
    
    def start_requests(self):
        # 方式一:
        for url in self.start_urls:
            yield Request(url=url)
        # 方式二:
        # req_list = []
        # for url in self.start_urls:
        #     req_list.append(Request(url=url))
        # return req_list
        
        
- 定製:能夠去redis中獲取
兩種實現方式
2. 響應:
# response封裝了響應相關的全部數據:
    - response.text 
    - response.encoding
    - response.body 
   - response.meta['depth':'深度']
- response.request # 當前響應是由那個請求發起;請求中 封裝(要訪問的url,下載完成以後執行那個函數)

   3. 選擇器

 1 from scrapy.selector import Selector
 2 from scrapy.http import HtmlResponse
 3 
 4 html = """<!DOCTYPE html>
 5 <html>
 6     <head lang="en">
 7         <meta charset="UTF-8">
 8         <title></title>
 9     </head>
10     <body>
11         <ul>
12             <li class="item-"><a id='i1' href="link.html">first item</a></li>
13             <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
14             <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
15         </ul>
16         <div><a href="llink2.html">second item</a></div>
17     </body>
18 </html>
19 """
20 response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
21 # hxs = Selector(response=response).xpath('//a')
22 # print(hxs)
23 # hxs = Selector(text=html).xpath('//a')
24 # print(hxs)
25 # hxs = Selector(response=response).xpath('//a[2]')
26 # print(hxs)
27 # hxs = Selector(response=response).xpath('//a[@id]')
28 # print(hxs)
29 # hxs = Selector(response=response).xpath('//a[@id="i1"]')
30 # print(hxs)
31 # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
32 # print(hxs)
33 # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
34 # print(hxs)
35 # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
36 # print(hxs)
37 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
38 # print(hxs)
39 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
40 # print(hxs)
41 # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
42 # print(hxs)
43 # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
44 # print(hxs)
45 # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
46 # print(hxs)
47 
48 # ul_list = Selector(response=response).xpath('//body/ul/li')
49 # for item in ul_list:
50 #     v = item.xpath('./a/span')
51 #     # 或
52 #     # v = item.xpath('a/span')
53 #     # 或
54 #     # v = item.xpath('*/a/span')
55 #     print(v)

response.css('...') 返回一個response xpath對象
response.css('....').extract() 返回一個列表
response.css('....').extract_first() 提取列表中的元素

def parse_detail(self, response):
        # items = JobboleArticleItem()
        # title = response.xpath('//div[@class="entry-header"]/h1/text()')[0].extract()
        # create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace('·','').strip()
        # praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first())
        # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first()
        # try:
        #     if re.match('.*?(\d+).*', fav_nums).group(1):
        #         fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
        #     else:
        #         fav_nums = 0
        # except:
        #     fav_nums = 0
        # comment_nums = response.xpath('//a[contains(@href,"#article-comment")]/span/text()').extract()[0]
        # try:
        #     if re.match('.*?(\d+).*',comment_nums).group(1):
        #         comment_nums = int(re.match('.*?(\d+).*',comment_nums).group(1))
        #     else:
        #         comment_nums = 0
        # except:
        #     comment_nums = 0
        # contente = response.xpath('//div[@class="entry"]').extract()[0]
        # tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
        # tag_list = [tag for tag in tag_list if not tag.strip().endswith('評論')]
        # tags = ",".join(tag_list)
        # items['title'] = title
        # try:
        #     create_date = datetime.datetime.strptime(create_date,'%Y/%m/%d').date()
        # except:
        #     create_date = datetime.datetime.now()
        # items['date'] = create_date
        # items['url'] = response.url
        # items['url_object_id'] = get_md5(response.url)
        # items['img_url'] = [img_url]
        # items['praise_nums'] = praise_nums
        # items['fav_nums'] = fav_nums
        # items['comment_nums'] = comment_nums
        # items['content'] = contente
        # items['tags'] = tags
xpath解析jobble
# title = response.css('.entry-header h1::text')[0].extract()
        # create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip()
        # praise_nums = int(response.css(".vote-post-up h10::text").extract_first()
        # fav_nums = response.css(".bookmark-btn::text").extract_first()
        # if re.match('.*?(\d+).*', fav_nums).group(1):
        #     fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
        # else:
        #     fav_nums = 0
        # comment_nums = response.css('a[href="#article-comment"] span::text').extract()[0]
        # if re.match('.*?(\d+).*', comment_nums).group(1):
        #     comment_nums = int(re.match('.*?(\d+).*', comment_nums).group(1))
        # else:
        #     comment_nums = 0
        # content = response.css('.entry').extract()[0]
        # tag_list = response.css('p.entry-meta-hide-on-mobile a::text')
        # tag_list = [tag for tag in tag_list if not tag.strip().endswith('評論')]
        # tags = ",".join(tag_list)
        # xpath選擇器 /@href    /text()
css解析jobbole
    def parse_detail(self, response):
        img_url = response.meta.get('img_url','')
        item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
        item_loader.add_css("title", ".entry-header h1::text")
        item_loader.add_value('url',response.url)
        item_loader.add_value('url_object_id', get_md5(response.url))
        item_loader.add_css('date', 'p.entry-meta-hide-on-mobile::text')
        item_loader.add_value("img_url", [img_url])
        item_loader.add_css("praise_nums", ".vote-post-up h10::text")
        item_loader.add_css("fav_nums", ".bookmark-btn::text")
        item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
        item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
        item_loader.add_css("content", "div.entry")
        items = item_loader.load_item()
        yield items
item_loader版本

  4. 再次發起請求

  yield Request(url='xxxx',callback=self.parse)
  yield Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)

  5. 攜帶cookie

方式一:攜帶

cookie_dict
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)

# 去對象中將cookie解析到字典
for k, v in cookie_jar._cookies.items():
    for i, j in v.items():
        for m, n in j.items():
            cookie_dict[m] = n.value
解析cookie
        yield Request(
            url='https://dig.chouti.com/login',
            method='POST',
            body='phone=8615735177116&password=zyf123&oneMonth=1',
            headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            # cookies=cookie_obj._cookies,
            cookies=self.cookies_dict,
            callback=self.check_login,
        )
攜帶

方式二: meta

yield Request(url=url, callback=self.login, meta={'cookiejar': True})

  6. 回調函數傳遞值:meta

def parse(self, response):
  yield
scrapy.Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail) def parse_detail(self, response): img_url = response.meta.get('img_url','')
from urllib.parse import urljoin

import scrapy
from scrapy import Request
from scrapy.http.cookies import CookieJar


class SpiderchoutiSpider(scrapy.Spider):
    name = 'choutilike'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    cookies_dict = {}

    def parse(self, response):
        # 去響應頭中獲取cookie,cookie保存在cookie_jar對象
        cookie_obj = CookieJar()
        cookie_obj.extract_cookies(response, response.request)

        # 去對象中將cookie解析到字典
        for k, v in cookie_obj._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookies_dict[m] = n.value

        # self.cookies_dict = cookie_obj._cookies

        yield Request(
            url='https://dig.chouti.com/login',
            method='POST',
            body='phone=8615735177116&password=zyf123&oneMonth=1',
            headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            # cookies=cookie_obj._cookies,
            cookies=self.cookies_dict,
            callback=self.check_login,
        )

    def check_login(self,response):
        # print(response.text)
        yield Request(url='https://dig.chouti.com/all/hot/recent/1',
                      cookies=self.cookies_dict,
                      callback=self.good,
                      )

    def good(self,response):
        id_list = response.css('div.part2::attr(share-linkid)').extract()
        for id in id_list:
            url = 'https://dig.chouti.com/link/vote?linksId={}'.format(id)
            yield Request(
                url=url,
                method='POST',
                cookies=self.cookies_dict,
                callback=self.show,
            )
        pages = response.css('#dig_lcpage a::attr(href)').extract()
        for page in pages:
            url = urljoin('https://dig.chouti.com/',page)
            yield Request(url=url,callback=self.good)

    def show(self,response):
        print(response.text)
抽屜自動登陸並點贊chouti.py

   3、持久化 

   1. 書寫順序

  • a. 先寫pipeline類
  • b. 寫Item類
import scrapy


class ChoutiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    href = scrapy.Field()
items.py
  • c. 配置settings
ITEM_PIPELINES = {
    # 'chouti.pipelines.XiaohuaImagesPipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'chouti.pipelines.ChoutiPipeline': 300,
    # 'chouti.pipelines.Chouti2Pipeline': 301,
}
ITEM_PIPELINES
  • d. 爬蟲,yield每執行一次,process_item就調用一次。
yield Item對象

   2. pipeline的編寫                     

源碼執行流程
    1. 判斷當前XdbPipeline類中是否有from_crawler
        有:
            obj = XdbPipeline.from_crawler(....)
        否:
            obj = XdbPipeline()
    2. obj.open_spider()
    
    3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
    
    4. obj.close_spider()
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem


class ChoutiPipeline(object):
    def __init__(self,conn_str):
        self.conn_str = conn_str

    @classmethod
    def from_crawler(cls,crawler):
        """
        初始化的時候,用於建立pipeline對象
        :param crawler:
        :return:
        """
        conn_str = crawler.settings.get('DB')
        return cls(conn_str)

    def open_spider(self,spider):
        """
        爬蟲開始時調用
        :param spider:
        :return:
        """
        self.conn = open(self.conn_str,'a',encoding='utf-8')

    def process_item(self, item, spider):
        if spider.name == 'spiderchouti':
            self.conn.write('{}\n{}\n'.format(item['title'],item['href']))
            #交給下一個pipline使用
            return item
            #丟棄item,不交給下一個pipline
            # raise DropItem()

    def close_spider(self,spider):
        """
        爬蟲關閉時調用
        :param spider:
        :return:
        """
        self.conn.close()
存儲文件pipeline

注意:pipeline是全部爬蟲公用,若是想要給某個爬蟲定製須要使用spider參數本身進行處理。

json文件

class JsonExporterPipeline(JsonItemExporter):
    #調用scrapy提供的json export 導出json文件
    def __init__(self):
        self.file = open('articleexpoter.json','wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()#開始導出

    def close_spider(self):
        self.exporter.finish_exporting() #中止導出
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item


class JsonWithEncodingPipeline(object):
    #自定義json文件的導出
    def __init__(self):
        self.file = codecs.open('article.json','w',encoding='utf-8')

    def process_item(self,item,spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.file.write(lines)
        return item

    def spider_closed(self):
        self.file.close()
pipeline

存儲圖片

# -*- coding: utf-8 -*-
from urllib.parse import urljoin

import scrapy
from ..items import XiaohuaItem


class XiaohuaSpider(scrapy.Spider):
    name = 'xiaohua'
    allowed_domains = ['www.xiaohuar.com']
    start_urls = ['http://www.xiaohuar.com/list-1-{}.html'.format(i) for i in range(11)]

    def parse(self, response):
        items = response.css('.item_list .item')
        for item in items:
            url = item.css('.img img::attr(src)').extract()[0]
            url = urljoin('http://www.xiaohuar.com',url)
            title = item.css('.title span a::text').extract()[0]
            obj = XiaohuaItem(img_url=[url],title=title)
            yield obj
spider
class XiaohuaItem(scrapy.Item):
    img_url = scrapy.Field()
    title = scrapy.Field()
    img_path = scrapy.Field()
item
class XiaohuaImagesPipeline(ImagesPipeline):
    #調用scrapy提供的imagepipeline下載圖片
    def item_completed(self, results, item, info):
        if "img_url" in item:
            for ok,value in results:
                print(ok,value)
                img_path = value['path']
                item['img_path'] = img_path
        return item

    def get_media_requests(self, item, info):  # 下載圖片
        if "img_url" in item:
            for img_url in item['img_url']:
                yield scrapy.Request(img_url, meta={'item': item, 'index': item['img_url'].index(img_url)})  # 添加meta是爲了下面重命名文件名使用

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        if "img_url" in item:# 經過上面的meta傳遞過來item
            index = request.meta['index']  # 經過上面的index傳遞過來列表中當前下載圖片的下標

            # 圖片文件名,item['carname'][index]獲得汽車名稱,request.url.split('/')[-1].split('.')[-1]獲得圖片後綴jpg,png
            image_guid = item['title'] + '.' + request.url.split('/')[-1].split('.')[-1]
            # 圖片下載目錄 此處item['country']即須要前面item['country']=''.join()......,不然目錄名會變成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg
            filename = u'full/{0}'.format(image_guid)
            return filename
pipeline
ITEM_PIPELINES = {
    # 'chouti.pipelines.XiaohuaImagesPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
ITEM_PIPELINES

mysql數據庫

class MysqlPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect('localhost', 'root','0000', 'crawed', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()
    
    def process_item(self, item, spider):
        insert_sql = """insert into article(title,url,create_date,fav_nums) values (%s,%s,%s,%s)"""
        self.cursor.execute(insert_sql,(item['title'],item['url'],item['date'],item['fav_nums']))
        self.conn.commit()


class MysqlTwistePipeline(object):
    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls,settings):
        dbparms = dict(
            host=settings['MYSQL_HOST'],
            db=settings['MYSQL_DB'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWORD'],
            charset='utf8',
            cursorclass=pymysql.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
        return cls(dbpool)

    def process_item(self, item, spider):
        #使用twisted將mysql插入變異步執行
        query = self.dbpool.runInteraction(self.do_insert,item)
        # query.addErrorback(self.handle_error) #處理異常
        query.addErrback(self.handle_error) #處理異常

    def handle_error(self,failure):
        #處理異步插入的異常
        print(failure)

    def do_insert(self,cursor,item):
        insert_sql, params = item.get_insert_sql()
        try:
            cursor.execute(insert_sql,params)
            print('插入成功')
        except Exception as e:
            print('插入失敗')
pipeline
MYSQL_HOST = 'localhost'
MYSQL_USER = 'root'
MYSQL_PASSWORD = '0000'
MYSQL_DB = 'crawed'

SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"
RANDOM_UA_TYPE = "random"
ES_HOST = "127.0.0.1"
settings

   4、去重規則

Scrapy默認去重規則:
from scrapy.dupefilter import RFPDupeFilter
from __future__ import print_function
import os
import logging

from scrapy.utils.job import job_dir
from scrapy.utils.request import request_fingerprint


class BaseDupeFilter(object):

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        return False

    def open(self):  # can return deferred
        pass

    def close(self, reason):  # can return a deferred
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass


class RFPDupeFilter(BaseDupeFilter):
    """Request Fingerprint duplicates filter"""

    def __init__(self, path=None, debug=False):
        self.file = None
        self.fingerprints = set()
        self.logdupes = True
        self.debug = debug
        self.logger = logging.getLogger(__name__)
        if path:
            self.file = open(os.path.join(path, 'requests.seen'), 'a+')
            self.file.seek(0)
            self.fingerprints.update(x.rstrip() for x in self.file)

    @classmethod
    def from_settings(cls, settings):
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(job_dir(settings), debug)

    def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

    def request_fingerprint(self, request):
        return request_fingerprint(request)

    def close(self, reason):
        if self.file:
            self.file.close()

    def log(self, request, spider):
        if self.debug:
            msg = "Filtered duplicate request: %(request)s"
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
        elif self.logdupes:
            msg = ("Filtered duplicate request: %(request)s"
                   " - no more duplicates will be shown"
                   " (see DUPEFILTER_DEBUG to show all duplicates)")
            self.logger.debug(msg, {'request': request}, extra={'spider': spider})
            self.logdupes = False

        spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
dupefilters
自定義去重規則
1.編寫類
# -*- coding: utf-8 -*-

"""
@Datetime: 2018/8/31
@Author: Zhang Yafei
"""
from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint


class RepeatFilter(BaseDupeFilter):

    def __init__(self):
        self.visited_fd = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        fd = request_fingerprint(request=request)
        if fd in self.visited_fd:
            return True
        self.visited_fd.add(fd)

    def open(self):  # can return deferred
        print('open')
        pass

    def close(self, reason):  # can return a deferred
        print('close')
        pass

    def log(self, request, spider):  # log that a request has been filtered
        pass
dupeFilter.py

 2. 配置

# 默認去重規則
# DUPEFILTER_CLASS = "chouti.duplication.RepeatFilter"
DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"

   3. 爬蟲使用

from urllib.parse import urljoin
from ..items import ChoutiItem
import scrapy
from scrapy.http import Request


class SpiderchoutiSpider(scrapy.Spider):
    name = 'spiderchouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def parse(self, response):
        #獲取當前頁的標題
        print(response.request.url)
        # news = response.css('.content-list .item')
        # for new in news:
        #     title = new.css('.show-content::text').extract()[0].strip()
        #     href = new.css('.show-content::attr(href)').extract()[0]
        #     item = ChoutiItem(title=title,href=href)
        #     yield item

        #獲取全部頁碼
        pages = response.css('#dig_lcpage a::attr(href)').extract()
        for page in pages:
            url = urljoin(self.start_urls[0],page)
            #將新要訪問的url添加到調度器
            yield Request(url=url,callback=self.parse)
chouti.py

注意:

  • request_seen中編寫正確邏輯
  • dont_filter=False

   5、中間件

  下載中間件

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        print('md1.process_request',request)
        # 1. 返回Response
        # import requests
        # result = requests.get(request.url)
        # return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
        # 2. 返回Request
        # return Request('https://dig.chouti.com/r/tec/hot/1')

        # 3. 拋出異常
        # from scrapy.exceptions import IgnoreRequest
        # raise IgnoreRequest

        # 4. 對請求進行加工(*)
        # request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

        pass

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        print('m1.process_response',request,response)
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
downloadmiddleware.py
DOWNLOADER_MIDDLEWARES = {
   #'xdb.middlewares.XdbDownloaderMiddleware': 543,
    # 'xdb.proxy.XdbProxyMiddleware':751,
    'xdb.md.Md1':666,
    'xdb.md.Md2':667,
}
配置
 應用:- user-agent 
      - 代理 

爬蟲中間件

class Sd1(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    # 只在爬蟲啓動時,執行一次。
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r
spidermiddleware.py
SPIDER_MIDDLEWARES = {
   # 'xdb.middlewares.XdbSpiderMiddleware': 543,
    'xdb.sd.Sd1': 666,
    'xdb.sd.Sd2': 667,
}
配置
應用:
    - 深度
    - 優先級
class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        下載完成,執行,而後交給parse處理
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider處理完成,返回時調用
        :param response:
        :param result:
        :param spider:
        :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        異常調用
        :param response:
        :param exception:
        :param spider:
        :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬蟲啓動時調用
        :param start_requests:
        :param spider:
        :return: 包含 Request 對象的可迭代對象
        """
        return start_requests
爬蟲中間件
class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        請求須要被下載時,通過全部下載器中間件的process_request調用
        :param request: 
        :param spider: 
        :return:  
            None,繼續後續中間件去下載;
            Response對象,中止process_request的執行,開始執行process_response
            Request對象,中止中間件的執行,將Request從新調度器
            raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider處理完成,返回時調用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 對象:轉交給其餘中間件process_response
            Request 對象:中止中間件,request會被從新調度下載
            raise IgnoreRequest 異常:調用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:繼續交給後續中間件處理異常;
            Response對象:中止後續process_exception方法
            Request對象:中止中間件,request將會被從新調用下載
        """
        return None
下載器中間件
設置代理
在爬蟲啓動時,提早在os.envrion中設置代理便可。
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['chouti.com']
        start_urls = ['https://dig.chouti.com/']
        cookie_dict = {}

        def start_requests(self):
            import os
            os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/"
            os.environ['HTTP_PROXY'] = '19.11.2.32',
            for url in self.start_urls:
                yield Request(url=url,callback=self.parse)
meta:
    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['chouti.com']
        start_urls = ['https://dig.chouti.com/']
        cookie_dict = {}

        def start_requests(self):
            for url in self.start_urls:
                yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
內置方式(2種)
import base64
import random
from six.moves.urllib.parse import unquote
try:
    from urllib2 import _parse_proxy
except ImportError:
    from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes

class XdbProxyMiddleware(object):

    def _basic_auth_header(self, username, password):
        user_pass = to_bytes(
            '%s:%s' % (unquote(username), unquote(password)),
            encoding='latin-1')
        return base64.b64encode(user_pass).strip()

    def process_request(self, request, spider):
        PROXIES = [
            "http://root:woshiniba@192.168.11.11:9999/",
            "http://root:woshiniba@192.168.11.12:9999/",
            "http://root:woshiniba@192.168.11.13:9999/",
            "http://root:woshiniba@192.168.11.14:9999/",
            "http://root:woshiniba@192.168.11.15:9999/",
            "http://root:woshiniba@192.168.11.16:9999/",
        ]
        url = random.choice(PROXIES)

        orig_type = ""
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None
        request.meta['proxy'] = proxy_url
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

class DdbProxyMiddleware(object):
    def process_request(self, request, spider):
        PROXIES = [
            {'ip_port': '111.11.228.75:80', 'user_pass': ''},
            {'ip_port': '120.198.243.22:80', 'user_pass': ''},
            {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
            {'ip_port': '101.71.27.120:80', 'user_pass': ''},
            {'ip_port': '122.96.59.104:80', 'user_pass': ''},
            {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
        ]
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
            encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass']))
            request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
        else:
            request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
自定義代理

  6、定製命令

  單爬蟲運行  main.py

 1 from scrapy.cmdline import execute
 2 import sys
 3 import os
 4 
 5 sys.path.append(os.path.dirname(__file__))
 6 
 7 # execute(['scrapy','crawl','spiderchouti','--nolog'])
 8 # os.system('scrapy crawl spiderchouti')
 9 # os.system('scrapy crawl xiaohua')
10 os.system('scrapy crawl choutilike --nolog')

   全部爬蟲:

  1.  在spiders同級建立任意目錄,如:commands
  2.  在其中建立 crawlall.py 文件 (此處文件名就是自定義的命令)
  3.  在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱'
  4.  在項目目錄執行命令:scrapy crawlall
# -*- coding: utf-8 -*-

"""
@Datetime: 2018/9/1
@Author: Zhang Yafei
"""
from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings


class Command(ScrapyCommand):

    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        print(type(self.crawler_process))
        from scrapy.crawler import CrawlerProcess
        # 1. 執行CrawlerProcess構造方法
        # 2. CrawlerProcess對象(含有配置文件)的spiders
        # 2.1,爲每一個爬蟲建立一個Crawler
        # 2.2,執行 d = Crawler.crawl(...)   # ************************ #
        #           d.addBoth(_done)
        # 2.3, CrawlerProcess對象._active = {d,}

        # 3. dd = defer.DeferredList(self._active)
        #    dd.addBoth(self._stop_reactor)  # self._stop_reactor ==> reactor.stop()

        #    reactor.run

        #找到全部的爬蟲名稱
        spider_list = self.crawler_process.spiders.list()
        # spider_list = ['choutilike','xiaohua']爬取任意項目
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()
crawlall.py

   7、信號

信號就是使用框架預留的位置,幫助你自定義一些功能。
內置信號

# 引擎開始和結束
engine_started = object()
engine_stopped = object()

# spider開始和結束
spider_opened = object()
# 請求閒置
spider_idle = object()
# spider關閉
spider_closed = object()
# spider發生異常
spider_error = object()

# 請求放入調度器
request_scheduled = object()
# 請求被丟棄
request_dropped = object()
# 接收到響應
response_received = object()
# 響應下載完
response_downloaded = object()
# item
item_scraped = object()
# item被丟棄
item_dropped = object()
 

自定義擴展

class MyExtend():
    def __init__(self,crawler):
        self.crawler = crawler
        #鉤子上掛障礙物
        #在指定信息上註冊操做
        crawler.signals.connect(self.start,signals.engine_started)
        crawler.signals.connect(self.close,signals.engine_stopped)

    @classmethod
    def from_crawler(cls,crawler):
        return cls(crawler)

    def start(self):
        print('signals.engine_started start')

    def close(self):
        print('signals.engine_stopped close')
擴展一
from scrapy import signals

class MyExtend(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        self = cls()

        crawler.signals.connect(self.x1, signal=signals.spider_opened)
        crawler.signals.connect(self.x2, signal=signals.spider_closed)

        return self

    def x1(self, spider):
        print('open')

    def x2(self, spider):
        print('close')
擴展二

配置

EXTENSIONS = {
   # 'scrapy.extensions.telnet.TelnetConsole': None,
    'chouti.extensions.MyExtend':200,
}

  8、配置文件

Scrapy默認配置文件

"""
This module contains the default values for all settings used by Scrapy.

For more information about these settings you can read the settings
documentation in docs/topics/settings.rst

Scrapy developers, if you add a setting here remember to:

* add it in alphabetical order
* group similar settings without leaving blank lines
* add its documentation to the available settings documentation
  (docs/topics/settings.rst)

"""

import sys
from importlib import import_module
from os.path import join, abspath, dirname

import six

AJAXCRAWL_ENABLED = False

AUTOTHROTTLE_ENABLED = False
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 60.0
AUTOTHROTTLE_START_DELAY = 5.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

BOT_NAME = 'scrapybot'

CLOSESPIDER_TIMEOUT = 0
CLOSESPIDER_PAGECOUNT = 0
CLOSESPIDER_ITEMCOUNT = 0
CLOSESPIDER_ERRORCOUNT = 0

COMMANDS_MODULE = ''

COMPRESSION_ENABLED = True

CONCURRENT_ITEMS = 100

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 0

COOKIES_ENABLED = True
COOKIES_DEBUG = False

DEFAULT_ITEM_CLASS = 'scrapy.item.Item'

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}

DEPTH_LIMIT = 0
DEPTH_STATS = True
DEPTH_PRIORITY = 0

DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 10000
DNS_TIMEOUT = 60

DOWNLOAD_DELAY = 0

DOWNLOAD_HANDLERS = {}
DOWNLOAD_HANDLERS_BASE = {
    'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}

DOWNLOAD_TIMEOUT = 180      # 3mins

DOWNLOAD_MAXSIZE = 1024*1024*1024   # 1024m
DOWNLOAD_WARNSIZE = 32*1024*1024    # 32m

DOWNLOAD_FAIL_ON_DATALOSS = True

DOWNLOADER = 'scrapy.core.downloader.Downloader'

DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform,
                                     # also allowing negotiation

DOWNLOADER_MIDDLEWARES = {}

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}

DOWNLOADER_STATS = True

DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'

EDITOR = 'vi'
if sys.platform == 'win32':
    EDITOR = '%s -m idlelib.idle'

EXTENSIONS = {}

EXTENSIONS_BASE = {
    'scrapy.extensions.corestats.CoreStats': 0,
    'scrapy.extensions.telnet.TelnetConsole': 0,
    'scrapy.extensions.memusage.MemoryUsage': 0,
    'scrapy.extensions.memdebug.MemoryDebugger': 0,
    'scrapy.extensions.closespider.CloseSpider': 0,
    'scrapy.extensions.feedexport.FeedExporter': 0,
    'scrapy.extensions.logstats.LogStats': 0,
    'scrapy.extensions.spiderstate.SpiderState': 0,
    'scrapy.extensions.throttle.AutoThrottle': 0,
}

FEED_TEMPDIR = None
FEED_URI = None
FEED_URI_PARAMS = None  # a function to extend uri arguments
FEED_FORMAT = 'jsonlines'
FEED_STORE_EMPTY = False
FEED_EXPORT_ENCODING = None
FEED_EXPORT_FIELDS = None
FEED_STORAGES = {}
FEED_STORAGES_BASE = {
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
FEED_EXPORTERS = {}
FEED_EXPORTERS_BASE = {
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}
FEED_EXPORT_INDENT = 0

FILES_STORE_S3_ACL = 'private'

FTP_USER = 'anonymous'
FTP_PASSWORD = 'guest'
FTP_PASSIVE_MODE = True

HTTPCACHE_ENABLED = False
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_MISSING = False
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_ALWAYS_STORE = False
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_IGNORE_SCHEMES = ['file']
HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = []
HTTPCACHE_DBM_MODULE = 'anydbm' if six.PY2 else 'dbm'
HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
HTTPCACHE_GZIP = False

HTTPPROXY_ENABLED = True
HTTPPROXY_AUTH_ENCODING = 'latin-1'

IMAGES_STORE_S3_ACL = 'private'

ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'

ITEM_PIPELINES = {}
ITEM_PIPELINES_BASE = {}

LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_FORMATTER = 'scrapy.logformatter.LogFormatter'
LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
LOG_STDOUT = False
LOG_LEVEL = 'DEBUG'
LOG_FILE = None
LOG_SHORT_NAMES = False

SCHEDULER_DEBUG = False

LOGSTATS_INTERVAL = 60.0

MAIL_HOST = 'localhost'
MAIL_PORT = 25
MAIL_FROM = 'scrapy@localhost'
MAIL_PASS = None
MAIL_USER = None

MEMDEBUG_ENABLED = False        # enable memory debugging
MEMDEBUG_NOTIFY = []            # send memory debugging report by mail at engine shutdown

MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 0
MEMUSAGE_NOTIFY_MAIL = []
MEMUSAGE_WARNING_MB = 0

METAREFRESH_ENABLED = True
METAREFRESH_MAXDELAY = 100

NEWSPIDER_MODULE = ''

RANDOMIZE_DOWNLOAD_DELAY = True

REACTOR_THREADPOOL_MAXSIZE = 10

REDIRECT_ENABLED = True
REDIRECT_MAX_TIMES = 20  # uses Firefox default setting
REDIRECT_PRIORITY_ADJUST = +2

REFERER_ENABLED = True
REFERRER_POLICY = 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'

RETRY_ENABLED = True
RETRY_TIMES = 2  # initial response + 2 retries = 3 requests
RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]
RETRY_PRIORITY_ADJUST = -1

ROBOTSTXT_OBEY = False

SCHEDULER = 'scrapy.core.scheduler.Scheduler'
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'

SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
SPIDER_LOADER_WARN_ONLY = False

SPIDER_MIDDLEWARES = {}

SPIDER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
    # Spider side
}

SPIDER_MODULES = []

STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
STATS_DUMP = True

STATSMAILER_RCPTS = []

TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates'))

URLLENGTH_LIMIT = 2083

USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__

TELNETCONSOLE_ENABLED = 1
TELNETCONSOLE_PORT = [6023, 6073]
TELNETCONSOLE_HOST = '127.0.0.1'

SPIDER_CONTRACTS = {}
SPIDER_CONTRACTS_BASE = {
    'scrapy.contracts.default.UrlContract': 1,
    'scrapy.contracts.default.ReturnsContract': 2,
    'scrapy.contracts.default.ScrapesContract': 3,
}
默認配置文件

1.深度和優先級

- 深度 
    - 最開始是0
    - 每次yield時,會根據原來請求中的depth + 1
    配置:DEPTH_LIMIT 深度控制
- 優先級 
    - 請求被下載的優先級 -= 深度 * 配置 DEPTH_PRIORITY 
    配置:DEPTH_PRIORITY 
    def parse(self, response):
        #獲取當前頁的標題
        print(response.request.url, response.meta.get('depth', 0))

配置文件解讀

# -*- coding: utf-8 -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

# 1. 爬蟲名稱
BOT_NAME = 'step8_king'

# 2. 爬蟲應用路徑
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客戶端 user-agent請求頭
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬蟲配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 併發請求數
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延遲下載秒數
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 單域名訪問併發數,而且延遲下次秒數也應用在每一個域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 單IP訪問併發數,若是有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,而且延遲下次秒數也應用在每一個IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar進行操做cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用於查看當前爬蟲的信息,操做爬蟲等...
#    使用telnet ip port ,而後經過命令操做
# engine.pause()  暫停
# engine.unpause()  重啓
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默認請求頭
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定義pipeline處理請求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定義擴展,基於信號進行調用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬蟲容許的最大深度,能夠經過meta查看當前深度;0表示無深度
# DEPTH_LIMIT = 3

# 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo

# 後進先出,深度優先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先進先出,廣度優先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 調度器隊列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 訪問URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自動限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自動限速設置
    1. 獲取最小延遲 DOWNLOAD_DELAY
    2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY
    3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY
    4. 當請求下載完成後,獲取其"鏈接"時間 latency,即:請求鏈接到接受到響應頭之間的時間
    5. 用於計算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 開始自動限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下載延遲
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下載延遲
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒併發數
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否顯示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 啓用緩存
    目的用於將已經發送的請求或相應緩存下來,以便之後使用

    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否啓用緩存策略
# HTTPCACHE_ENABLED = True

# 緩存策略:全部請求均緩存,下次在請求直接訪問原來的緩存便可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 緩存超時時間
# HTTPCACHE_EXPIRATION_SECS = 0

# 緩存保存路徑
# HTTPCACHE_DIR = 'httpcache'

# 緩存忽略的Http狀態碼
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 緩存存儲的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理,須要在環境變量中設置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware

    方式一:使用默認
        os.environ
        {
            http_proxy:http://root:woshiniba@192.168.11.11:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二:使用自定義下載中間件

    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)

    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }

"""

"""
20. Https訪問
    Https訪問時有兩種狀況:
    1. 要爬取網站使用的可信任證書(默認支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"

    2. 要爬取網站使用的自定義證書
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"

        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)

        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey對象
                    certificate=v2,  # X509對象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其餘:
        相關類
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相關配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""



"""
21. 爬蟲中間件
    class SpiderMiddleware(object):

        def process_spider_input(self,response, spider):
            '''
            下載完成,執行,而後交給parse處理
            :param response:
            :param spider:
            :return:
            '''
            pass

        def process_spider_output(self,response, result, spider):
            '''
            spider處理完成,返回時調用
            :param response:
            :param result:
            :param spider:
            :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)
            '''
            return result

        def process_spider_exception(self,response, exception, spider):
            '''
            異常調用
            :param response:
            :param exception:
            :param spider:
            :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline
            '''
            return None


        def process_start_requests(self,start_requests, spider):
            '''
            爬蟲啓動時調用
            :param start_requests:
            :param spider:
            :return: 包含 Request 對象的可迭代對象
            '''
            return start_requests

    內置爬蟲中間件:
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
        'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
        'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
        'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,

"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'step8_king.middlewares.SpiderMiddleware': 543,
}


"""
22. 下載中間件
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            '''
            請求須要被下載時,通過全部下載器中間件的process_request調用
            :param request:
            :param spider:
            :return:
                None,繼續後續中間件去下載;
                Response對象,中止process_request的執行,開始執行process_response
                Request對象,中止中間件的執行,將Request從新調度器
                raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception
            '''
            pass



        def process_response(self, request, response, spider):
            '''
            spider處理完成,返回時調用
            :param response:
            :param result:
            :param spider:
            :return:
                Response 對象:轉交給其餘中間件process_response
                Request 對象:中止中間件,request會被從新調度下載
                raise IgnoreRequest 異常:調用Request.errback
            '''
            print('response1')
            return response

        def process_exception(self, request, exception, spider):
            '''
            當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
            :param response:
            :param exception:
            :param spider:
            :return:
                None:繼續交給後續中間件處理異常;
                Response對象:中止後續process_exception方法
                Request對象:中止中間件,request將會被從新調用下載
            '''
            return None


    默認下載中間件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }

"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }
#23 Logging日誌功能
Scrapy提供了log功能,能夠經過 logging 模塊使用

能夠修改配置文件settings.py,任意位置添加下面兩行

LOG_FILE = "mySpider.log"
LOG_LEVEL = "INFO"
Scrapy提供5層logging級別:

CRITICAL - 嚴重錯誤(critical)
ERROR - 通常錯誤(regular errors)
WARNING - 警告信息(warning messages)
INFO - 通常信息(informational messages)
DEBUG - 調試信息(debugging messages)
logging設置
經過在setting.py中進行如下設置能夠被用來配置logging:

LOG_ENABLED 默認: True,啓用logging
LOG_ENCODING 默認: 'utf-8',logging使用的編碼
LOG_FILE 默認: None,在當前目錄裏建立logging輸出文件的文件名
LOG_LEVEL 默認: 'DEBUG',log的最低級別
LOG_STDOUT 默認: False 若是爲 True,進程全部的標準輸出(及錯誤)將會被重定向到log中。例如,執行 print "hello" ,其將會在Scrapy log中顯示
settings
配置文件解讀

   9、scraoy-redis實現分佈式

 1.基於scrapy-redis的去重規則:基於redis集合

  方案一:徹底自定製

class RedisFilter(BaseDupeFilter):

    def __init__(self):
        from redis import Redis, ConnectionPool
        pool = ConnectionPool(host='127.0.0.1', port='6379')
        self.conn = Redis(connection_pool=pool)

    def request_seen(self, request):
        """
        檢測當前請求是否已經被訪問過
        :param request:
        :return: True表示已經訪問過;False表示未訪問過
        """
        fd = request_fingerprint(request=request)
        added = self.conn.sadd('visited_urls', fd)
        return added == 0
徹底自定製

  方案二:徹底依賴scrapy-redis

# ############### scrapy redis鏈接 ####################

REDIS_HOST = '127.0.0.1'                            # 主機名
REDIS_PORT = 6379                                   # 端口
# REDIS_PARAMS  = {'password':'beta'}                 # Redis鏈接參數             默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
REDIS_ENCODING = "utf-8"                            # redis編碼類型             默認:'utf-8'
# REDIS_URL = 'redis://user:pass@hostname:9001'       # 鏈接URL(優先於以上配置)

# ############### scrapy redis去重 ####################
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'

# 修改默認去重規則
# DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"
DUPEFILTER_CLASS ='scrapy_redis.dupefilter.RFPDupeFilter'

  方案三: 繼承scrapy-redis 實現自定製

class RedisDupeFilter(RFPDupeFilter):
    @classmethod
    def from_settings(cls, settings):
        """Returns an instance from given settings.

        This uses by default the key ``dupefilter:<timestamp>``. When using the
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
        it needs to pass the spider name in the key.

        Parameters
        ----------
        settings : scrapy.settings.Settings

        Returns
        -------
        RFPDupeFilter
            A RFPDupeFilter instance.


        """
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy's default scheduler
        # if scrapy passes spider on open() method this wouldn't be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = defaults.DUPEFILTER_KEY % {'timestamp': 'test_scrapy_redis'}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)
繼承scrapy-redis,實現自定製

 2. 調度器

  •  配置
鏈接redis配置:
    REDIS_HOST = '127.0.0.1'                            # 主機名
    REDIS_PORT = 6073                                   # 端口
    # REDIS_PARAMS  = {'password':'xxx'}                                  # Redis鏈接參數             默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
    REDIS_ENCODING = "utf-8"                            # redis編碼類型             默認:'utf-8'
    
去重的配置:
    DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
    DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

調度器配置:
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"

    DEPTH_PRIORITY = 1  # 廣度優先
    # DEPTH_PRIORITY = -1 # 深度優先
    SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'  # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)

    # 廣度優先
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'  # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
    # 深度優先
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  # 默認使用優先級隊列(默認),其餘:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
    SCHEDULER_QUEUE_KEY = '%(spider)s:requests'  # 調度器中請求存放在redis中的key

    SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 對保存到redis中的數據進行序列化,默認使用pickle

    SCHEDULER_PERSIST = False  # 是否在關閉時候保留原來的調度器和去重記錄,True=保留,False=清空
    SCHEDULER_FLUSH_ON_START = True  # 是否在開始以前清空 調度器和去重記錄,True=清空,False=不清空
    # SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去調度器中獲取數據時,若是爲空,最多等待時間(最後沒數據,未獲取到)。


    SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重規則,在redis中保存時對應的key

    # 優先使用DUPEFILTER_CLASS,若是麼有就是用SCHEDULER_DUPEFILTER_CLASS
    SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重規則對應處理的類
  • 執行流程
1. scrapy crawl chouti --nolog
    
2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置並實例化調度器對象
    - 執行Scheduler.from_crawler
    - 執行Scheduler.from_settings
        - 讀取配置文件:
            SCHEDULER_PERSIST             # 是否在關閉時候保留原來的調度器和去重記錄,True=保留,False=清空
            SCHEDULER_FLUSH_ON_START     # 是否在開始以前清空 調度器和去重記錄,True=清空,False=不清空
            SCHEDULER_IDLE_BEFORE_CLOSE  # 去調度器中獲取數據時,若是爲空,最多等待時間(最後沒數據,未獲取到)。
        - 讀取配置文件:    
            SCHEDULER_QUEUE_KEY             # %(spider)s:requests
            SCHEDULER_QUEUE_CLASS         # scrapy_redis.queue.FifoQueue
            SCHEDULER_DUPEFILTER_KEY     # '%(spider)s:dupefilter'
            DUPEFILTER_CLASS             # 'scrapy_redis.dupefilter.RFPDupeFilter'
            SCHEDULER_SERIALIZER         # "scrapy_redis.picklecompat"

        - 讀取配置文件:
            REDIS_HOST = '140.143.227.206'                            # 主機名
            REDIS_PORT = 8888                                   # 端口
            REDIS_PARAMS  = {'password':'beta'}                                  # Redis鏈接參數             默認:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
            REDIS_ENCODING = "utf-8"      
    - 示例Scheduler對象
    
3. 爬蟲開始執行起始URL
    - 調用 scheduler.enqueue_requests()
        def enqueue_request(self, request):
            # 請求是否須要過濾?
            # 去重規則中是否已經有?(是否已經訪問過,若是未訪問添加到去重記錄中。)
            if not request.dont_filter and self.df.request_seen(request):
                # 若是須要過濾且已經被訪問過,返回false
                self.df.log(request, self.spider)
                # 已經訪問過就不要再訪問了
                return False
            
            if self.stats:
                self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
            # print('未訪問過,添加到調度器', request)
            self.queue.push(request)
            return True
    
4. 下載器去調度器中獲取任務,去下載
    
    - 調用 scheduler.next_requests()
        def next_request(self):
            block_pop_timeout = self.idle_before_close
            request = self.queue.pop(block_pop_timeout)
            if request and self.stats:
                self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
            return request
調度器執行流程
  • 數據持久化
定義持久化,爬蟲yield Item對象時執行RedisPipeline
     
    a. 將item持久化到redis時,指定key和序列化函數
     
        REDIS_ITEMS_KEY = '%(spider)s:items'
        REDIS_ITEMS_SERIALIZER = 'json.dumps'
     
    b. 使用列表保存item數據
  • 起始url相關
"""
起始URL相關
 
    a. 獲取起始URL時,去集合中獲取仍是去列表中獲取?True,集合;False,列表
        REDIS_START_URLS_AS_SET = False    # 獲取起始URL時,若是爲True,則使用self.server.spop;若是爲False,則使用self.server.lpop
    b. 編寫爬蟲時,起始URL從redis的Key中獲取
        REDIS_START_URLS_KEY = '%(name)s:start_urls'
         
"""
# If True, it uses redis' ``spop`` operation. This could be useful if you
# want to avoid duplicates in your start urls list. In this cases, urls must
# be added via ``sadd`` command or you will get a type error from redis.
# REDIS_START_URLS_AS_SET = False
 
# Default start urls key for RedisSpider and RedisCrawlSpider.
# REDIS_START_URLS_KEY = '%(name)s:start_urls'
settings配置
from scrapy_redis.spiders import RedisSpider

class SpiderchoutiSpider(RedisSpider):
    name = 'spiderchouti'
    allowed_domains = ['dig.chouti.com']
    # 不用寫start_urls
spider類繼承RedisSpider
from redis import Redis, ConnectionPool

pool = ConnectionPool(host='127.0.0.1', port=6379)
conn = Redis(connection_pool=pool)

conn.lpush('spiderchouti:start_urls','https://dig.chouti.com/')
redis添加spider_name:start_urls的列表
  • 一些知識點
1. 什麼是深度優先?什麼是廣度優先?
        就像一顆樹,深度優先先執行一顆子樹中的全部節點在執行另外一顆子樹中的全部節點,廣度優先先執行完一層,在執行下一層全部節點
    2. scrapy中如何實現深度和廣度優先?
    基於棧和隊列實現:
        先進先出,廣度優先
        後進先出,深度優先
     基於有序集合實現:
        優先級隊列:
            DEPTH_PRIORITY = 1  # 廣度優先
            DEPTH_PRIORITY = -1 # 深度優先
            
    3. scrapy中 調度器 和 隊列  和 dupefilter的關係?
        
        調度器,調配添加或獲取那個request.
        隊列,存放request,先進先出(深度優先),後進先出(廣度優先),優先級隊列。
        dupefilter,訪問記錄,去重規則。
知識點

   10、TinyScrapy

from twisted.internet import reactor   # 事件循環(終止條件,全部的socket都已經移除)
from twisted.web.client import getPage # socket對象(若是下載完成,自動從時間循環中移除...)
from twisted.internet import defer     # defer.Deferred 特殊的socket對象 (不會發請求,手動移除)
from queue import Queue


class Request(object):
    """
    用於封裝用戶請求相關信息,供用戶編寫spider時發送請求所用
    """
    def __init__(self,url,callback):
        self.url = url
        self.callback = callback


class HttpResponse(object):
    """
    經過響應請求返回的數據和穿入的request對象封裝成一個response對象
    目的是爲了將請求返回的數據不只包括返回的content數據,使其擁有更多的屬性,好比請求頭,請求url,請求的cookies等等
    更方便的被回調函數所解析有用的數據
    """
    def __init__(self,content,request):
        self.content = content
        self.request = request
        self.url = request.url
        self.text = str(content,encoding='utf-8')


class Scheduler(object):
    """
    任務調度器:
    1.初始化一個隊列
    2.next_request方法:讀取隊列中的下一個請求
    3.enqueue_request方法:將請求加入到隊列
    4.size方法:返回當前隊列請求的數量
    5.open方法:無任何操做,返回一個空值,用於引擎中用裝飾器裝飾的open_spider方法返回一個yield對象
    """
    def __init__(self):
        self.q = Queue()

    def open(self):
        pass

    def next_request(self):
        try:
            req = self.q.get(block=False)
        except Exception as e:
            req = None
        return req

    def enqueue_request(self,req):
        self.q.put(req)

    def size(self):
        return self.q.qsize()


class ExecutionEngine(object):
    """
    引擎:全部的調度
    1.經過open_spider方法將start_requests中的每個請求加入到scdeuler中的隊列當中,
    2.處理每個請求響應以後的回調函數(get_response_callback)和執行下一次請求的調度(_next_request)
    """
    def __init__(self):
        self._close = None
        self.scheduler = None
        self.max = 5
        self.crawlling = []

    def get_response_callback(self,content,request):
        self.crawlling.remove(request)
        response = HttpResponse(content,request)
        result = request.callback(response)
        import types
        if isinstance(result,types.GeneratorType):
            for req in result:
                self.scheduler.enqueue_request(req)

    def _next_request(self):
        """
        1.對spider對象的請求進行調度
        2.設置事件循環終止條件:調度器隊列中請求的個數爲0,正在執行的請求數爲0
        3.設置最大併發數,根據正在執行的請求數量知足最大併發數條件對sceduler隊列中的請求進行調度執行
        4.包括對請求進行下載,以及對返回的數據執行callback函數
        5.開始執行事件循環的下一次請求的調度
        """
        if self.scheduler.size() == 0 and len(self.crawlling) == 0:
            self._close.callback(None)
            return
        """設置最大併發數爲5"""
        while len(self.crawlling) < self.max:
            req = self.scheduler.next_request()
            if not req:
                return
            self.crawlling.append(req)
            d = getPage(req.url.encode('utf-8'))
            d.addCallback(self.get_response_callback,req)
            d.addCallback(lambda _:reactor.callLater(0,self._next_request))

    @defer.inlineCallbacks
    def open_spider(self,start_requests):
        """
        1.建立一個調度器對象
        2.將start_requests中的每個請求加入到scheduler隊列中去
        3.而後開始事件循環執行下一次請求的調度
        注:每個@defer.inlineCallbacks裝飾的函數都必須yield一個對象,即便爲None
        """
        self.scheduler = Scheduler()
        yield self.scheduler.open()
        while True:
            try:
                req = next(start_requests)
            except StopIteration as e:
                break
            self.scheduler.enqueue_request(req)
        reactor.callLater(0,self._next_request)

    @defer.inlineCallbacks
    def start(self):
        """不發送任何請求,須要手動中止,目的是爲了夯住循環"""
        self._close = defer.Deferred()
        yield self._close


class Crawler(object):
    """
    1.用戶封裝調度器以及引擎
    2.經過傳入的spider對象的路徑建立spider對象
    3.建立引擎去打開spider對象,對spider中的每個請求加入到調度器中的隊列中去,經過引擎對請求去進行調度
    """
    def _create_engine(self):
        return ExecutionEngine()

    def _create_spider(self,spider_cls_path):
        """

        :param spider_cls_path:  spider.chouti.ChoutiSpider
        :return:
        """
        module_path,cls_name = spider_cls_path.rsplit('.',maxsplit=1)
        import importlib
        m = importlib.import_module(module_path)
        cls = getattr(m,cls_name)
        return cls()

    @defer.inlineCallbacks
    def crawl(self,spider_cls_path):
        engine = self._create_engine()
        spider = self._create_spider(spider_cls_path)
        start_requests = iter(spider.start_requests())
        yield engine.open_spider(start_requests) #將start_requests中的每個請求加入到調度器的隊列中去,並有引擎調度請求的執行
        yield engine.start() #建立一個defer對象,目的是爲了夯住事件循環,手動中止


class CrawlerProcess(object):
    """
    1.建立一個Crawler對象
    2.將傳入的每個spider對象的路徑傳入Crawler.crawl方法
    3.並將返回的socket對象加入到集合中
    4.開始事件循環
    """
    def __init__(self):
        self._active = set()

    def crawl(self,spider_cls_path):
        """
        :param spider_cls_path:
        :return:
        """
        crawler = Crawler()
        d = crawler.crawl(spider_cls_path)
        self._active.add(d)

    def start(self):
        dd = defer.DeferredList(self._active)
        dd.addBoth(lambda _:reactor.stop())

        reactor.run()


class Command(object):
    """
    1.建立開始運行的命令
    2.將每個spider對象的路徑傳入到crawl_process.crawl方法中去
    3.crawl_process.crawl方法建立一個Crawler對象,經過調用Crawler.crawl方法建立一個引擎和spider對象
    4.經過引擎的open_spider方法建立一個scheduler對象,將每個spider對象加入到schedler隊列中去,而且經過自身的_next_request方法對下一次請求進行調度
    5.
    """
    def run(self):
        crawl_process = CrawlerProcess()
        spider_cls_path_list = ['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider',]
        for spider_cls_path in spider_cls_path_list:
            crawl_process.crawl(spider_cls_path)
        crawl_process.start()


if __name__ == '__main__':
    cmd = Command()
    cmd.run()
tinyscrapy
相關文章
相關標籤/搜索