Scrapy學習

1.Scrapy總體框架css

 Scrapy採用了Twisted異步網絡來處理請求,總體框架以下:html

 

Scrapy Engine爬蟲引擎:協調整個框架組件間的數據交互,是框架的核心node

Schedule調度器:接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址。(至關於須要爬取的url的集合)python

Downloader下載器:下載指定的url的網頁文本,並傳遞給spiders處理。react

spiders 爬蟲:處理爬取下來的網頁文本,提取出所須要的信息。能夠提取出數據Item,傳遞到Item Pipeline保存, 也能夠提取出url,傳遞給Schedule的url任務隊列。web

Item Pipeline 項目管道: 接受spiders傳遞過來的數據Item,進行持久化。寫入文件或數據庫等。ajax

Schedule Middleware 調度中間件:引擎和調度器之間的交互正則表達式

Spider Middleware 爬蟲中間件:引擎和爬蟲之間的交互算法

Downloader Middleware下載器中間件:引擎和下載器之間的交互數據庫

一次完整的流程能夠簡單總結爲:

  1.首先Spiders(爬蟲)將須要發送請求的url(requests)經ScrapyEngine(引擎)交給Scheduler(調度器)。
  2.Scheduler(排序,入隊)處理後,經ScrapyEngine,DownloaderMiddlewares(可選,主要有User_Agent, Proxy 代理)交給Downloader。

  3.Downloader 向互聯網發送請求,並接收下載響應(response)。將響應(response)經ScrapyEngine,SpiderMiddlewares(可選)交給Spiders。

  4.Spiders 處理response,提取數據並將數據經ScrapyEngine 交給ItemPipeline 保存(能夠是本地,能夠是數據庫)。提取url 從新經ScrapyEngine 交給    Scheduler 進行下一個循環。直到無Url請求程序中止結束。

2,經常使用命令語句:

官方文檔:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/commands.html

 

  1 scrapy startproject project_name  : 當前目錄下建立爬蟲項目

  2 scrapy genspider [-t template] <spider_name> <domain>    根據模板建立爬蟲應用(先進入建立的爬蟲項目目錄)

  (模板有basic,crawl,csvfeed,xmlfeed,默認使用basic模板,scrapy genspider -t basic)

    scrapy genspider -l :查看全部模板

    scrapy genspider -d template_name   : 查看模板名稱

  3 scrapy list   查看建立的全部爬蟲應用

  4 scrapy crawl spider_name   運行單獨的爬蟲應用

          scrapy crawl spider_name --nolog  不顯示多有的記錄

3. 爬蟲項目結構

建立後的爬蟲項目目錄以下:

 

scrapy.cfg : 項目的主配置信息。(真正爬蟲相關的配置信息在settings.py文件中)

items.py: 設置數據存儲模板,用於結構化數據,如:Django的Model

pipelines.py: 數據處理行爲,如:通常結構化的數據持久化

settings.py: 配置文件,如:遞歸的層數、併發數,延遲下載等

spiders 爬蟲應用目錄,包含建立的全部爬蟲應用(cnblog.py)

建立後的cnblog.py中代碼以下

# -*- coding: utf-8 -*-
import scrapy
class CnblogSpider(scrapy.Spider):
    name = "cnblog"     #爬蟲應用名稱 
    allowed_domains = ["cnblogs.com"]    #限制爬蟲域名,其餘域名不爬取
    start_urls = (
        'http://www.cnblogs.com/',    # 爬蟲起始url
    )
    def parse(self, response):
        pass                          # 訪問起始URL並獲取結果後的回調函數, response爲下載器返回的結果,response.text即網頁文本

若windows輸出編碼亂碼:UnicodeEncodeError: 'gbk' codec can't encode character u'\xbb'   (windows採用gbk,下載器下載的網頁文本爲unicode字符串),解決方案以下:

python 3:在代碼前加入下面代碼
import sys,io sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030'
) #gb18030能夠兼容全部gb系列的編碼,能夠有效地避免少部分GBK沒法解碼的內容

python 2:輸出文檔時對文檔格式進行設置
python 2 不支持sys.stdout.buffer,對於要打印的內容設置以下編碼,:
  print response.text.encode('gb18030')

4 選擇器(Selector)
官方文檔:https://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html

構造選擇器
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
#經過Selector類
response = HtmlResponse(url='http://example.com', body=html_body)
Selector(response=response).xpath()  
#經過selector屬性,xpath(),css()方法
response.selector.xpath()
response.xpath()
response.css()
篩選表達式含義: 
https://www.jianshu.com/p/2391950137a4
https://blog.csdn.net/manongpengzai/article/details/77109600

*  匹配任何元素節點

@*  匹配任何屬性節點

node()匹配任何類型的節點

text()匹配文本值

extract()拿到對象中的字符竄

string()

 
# hxs = Selector(response=response).xpath('//a')      # 選擇文檔中的全部a元素
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')  # 選擇文檔中的第二個a元素
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]') #選擇文檔中的具備id屬性的a元素
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')  #選擇文檔中的id=「i1」的a元素
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')  #href屬性值包含 「link」
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')  #href屬性值以 「link」開始
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')   #正則表達式,id屬性值 和「i\d+」進行匹配
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # 匹配的a元素的文本值
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # 匹配的a元素的href屬性值
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()  # 逐級查找
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()   # 只返回第一個
# print(hxs)
 
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#     v = item.xpath('./a/span')
#     # 或
#     # v = item.xpath('a/span')
#     # 或
#     # v = item.xpath('*/a/span')
#     print(v)

 4,實戰項目

1,爬取博客園主頁文章標題,並自動翻頁

import scrapy
from scrapy.http.request import Request

class CnblogSpider(scrapy.Spider):
    name = "cnblog"
    allowed_domains = ["cnblogs.com"]
    start_urls = (
        'https://www.cnblogs.com/',
    )

    has_request_set={}

    def parse(self, response):

        #print response.text.encode("gb18030")
        #print dir(response)
        page_title = response.xpath('//div[@class="post_item"]//h3/a/text()').extract_first()
        print response.url, page_title
        pager_list=response.xpath('//div[@class="pager"]/a/@href').extract()
        for item in pager_list:
            url = 'https://www.cnblogs.com/%s'%item
            import hashlib
            hash = hashlib.md5()
            hash.update(url)
            key = hash.hexdigest()  #對url加密,方便比較,不訪問重複的url
            if key in self.has_request_set:
                print u"已經下載了"  #使用unicode時不亂碼
            else:
                self.has_request_set[key]=url
                yield Request(url=url,method='GET')
            # Request()中未設置callback=, 默認採用self.parse()處理返回response,即遞歸調用
            # 在settings.py 中設置DEPTH_LIMIT=1 來設置遞歸調用的深度
爬取博客園文章標題

2,利用cookie登錄抽屜熱搜榜,實現批量點贊

import scrapy
from scrapy.http.cookies import CookieJar
from scrapy.http.request import Request

#運行爬蟲進行批量點贊前,在設置文件中設置DEPTH_LIMIT =4,否則遞歸次數多,太暴力了!!!!!

class ChoutiSpider(scrapy.Spider):
    name = "chouti"
    allowed_domains = ["chouti.com"]
    start_urls = (
        'https://dig.chouti.com/',
    )
    cookies_dict={}
    has_request_set={}

    #訪問主頁面,獲取cookie
    def parse(self, response):
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m,n in j.items():
                    self.cookies_dict[m]=n.value    # n 爲一個cookie實例對象  Cookie()
                    #print n.value,type(n)
        #print self.cookies_dict


        #帶着cookie去登錄,對cookie受權
        url = "https://dig.chouti.com/login"
        yield Request(
            url=url,
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            #必須設置Content-Type,post提交的數據才能被正確處理
            body='oneMonth=1&password=19930624&phone=8618626429847',
            cookies=self.cookies_dict,
            callback=self.check_login
        )

    #拿着受權後的cookie去訪問
    def check_login(self,response):
        yield Request(
            url="https://dig.chouti.com/",
            method='GET',
            cookies=self.cookies_dict,
            callback=self.do_favor
        )

    #進行批量點贊
    def do_favor(self,response):
        linkid_list = response.xpath('//div[@share-linkid]/@share-linkid').extract()
        #print linkid
        user = response.xpath('//span[@id="userProNick"]/text()').extract()
        #print user
        for id in linkid_list:
            url = "https://dig.chouti.com/link/vote?linksId=%s"%id
            yield Request(
            url=url,
            method='POST',
            cookies=self.cookies_dict,
            callback=self.show_favor
        )

        # 拿到頁碼,自動翻頁
        pager_list = response.xpath('//div[@id="dig_lcpage"]/ul/li/a/@href').extract()
        #print pager_list
        for page in pager_list:
            page_url = "https://dig.chouti.com%s"%page
            import hashlib
            hash = hashlib.md5()
            hash.update(page_url)
            key = hash.hexdigest()
            if key in self.has_request_set.keys():
                pass
            else:
                self.has_request_set[key]=page_url
                #print page_url
                yield Request(
                    url=page_url,
                    method='GET',
                    cookies=self.cookies_dict,
                    callback=self.do_favor    # 遞歸調用,從而對每一頁進行點贊; 
                )
    #打印點贊後的返回結果:推薦成功
    def show_favor(self,response):
        print response.text
cookie登錄,批量點贊

 

5,數據格式化處理

對於上面實例的數據能夠在parse中簡單處理,但若要進行數據格式化和持久化,能夠用items格式化數據,並交給pipeline處理。

items

items官方文檔:https://doc.scrapy.org/en/latest/topics/items.html

Item的定義相似django中的model,每一個Item對象有若干屬性,其使用起來和dict很類似,並能夠與dict互相轉換

Creating Item
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

Getting Field
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

Setting Field
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

Creating dicts from items:
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}

Creating items from dicts
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
Item

pipeline

pipeline官方文檔:https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 經過語句yield item,會將item傳遞給pipeline中定義的process_item()方法處理,根據在settings中設置的權重不一樣,各個pipeline類的process_item()方法會依次執行(若process_item()未return item,該item會被丟棄,不會向一個pipeline類的process_item()方法傳遞)。除了process_item()方法外,pipeline還能夠實現其餘的方法,以下:

from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self,v):
        self.value = v

    def process_item(self, item, spider):
        # 操做並進行持久化

        # return表示會被後續的pipeline繼續處理
        return item

        # 表示將item丟棄,不會被後續pipeline處理
        # raise DropItem()


    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化時候,用於建立pipeline對象
        :param crawler: 
        :return: 
        """
        val = crawler.settings.getint('MMMM')
        return cls(val)

    def open_spider(self,spider):
        """
        爬蟲開始執行時,調用
        :param spider: 
        :return: 
        """
        print('000000')

    def close_spider(self,spider):
        """
        爬蟲關閉時,被調用
        :param spider: 
        :return: 
        """
        print('111111')
pipeline自定義

爬取鏈家房產信息,並保存:

# -*- coding: utf-8 -*-
import scrapy
from ..items import LianjiaItem
from scrapy.http.request import Request
import json

class LianjiaSpider(scrapy.Spider):
    name = "lianjia"
    allowed_domains = ["lianjia.com"]
    start_urls = (
        'http://wh.lianjia.com/ershoufang/',
    )
    has_request_set={}
    def parse(self, response):
        sell_list = response.xpath('//ul[@class="sellListContent"]/li')

        #print sell_list
        for item in sell_list:
            img_src = item.xpath('./a/img[@class="lj-lazy"]/@data-original').extract_first()   #不要爬取src屬性,獲得的爲空圖片
            house_name =item.xpath('.//div[@class="houseInfo"]/a/text()').extract_first()
            house_desc = item.xpath('.//div[@class="houseInfo"]/text()').extract_first()
            total_price = item.xpath('.//div[@class="totalPrice"]/span/text()').extract_first()
            unit_price = item.xpath('.//div[@class="unitPrice"]/span/text()').extract_first()
            house_item = LianjiaItem(img_src=img_src,house_name=house_name,
                                     house_desc=house_desc,total_price=total_price,unit_price=unit_price)
            yield house_item

        #沒法從返回的頁面中拿到分頁頁碼,只能拿到總頁碼數?
        pager_data = response.xpath('//div[@comp-module="page"]/@page-data').extract()
        #print pager_data
        total_page = json.loads(pager_data[0])["totalPage"]

        #for i in range(2,total_page)
        for i in range(2,4):   #只爬取第2,3頁
            page_url="https://wh.lianjia.com/ershoufang/pg%s/"%i
            yield Request(url=page_url,callback=self.parse)
lianjia.py
import scrapy

class LianjiaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_src = scrapy.Field()
    house_name = scrapy.Field()
    house_desc = scrapy.Field()
    total_price= scrapy.Field()
    unit_price = scrapy.Field()
items.py
import json
import requests
import os
class LianjiaPipeline(object):
    def __init__(self):
        self.file=open('lianjia.txt','a')  #在當前路徑下建立文件並追加內容

    def process_item(self, item, spider):
        if item['house_name']:
            data= json.dumps(dict(item),ensure_ascii=False).encode("utf8")+"\n"
            self.file.write(data)
            self.file.close()
        return item
class ImgPipeline(object):
    def __init__(self):
        if not os.path.exists('images'):  #當前路徑不存在文件夾時建立文件夾
            os.mkdir('images')

    def process_item(self,item, spider):
        response = requests.get(item['img_src'], stream=True) #stream=True邊下載邊從內存保存到硬盤,而不是所有下載到內存
        file_name=u'%s_%s萬.jpg'%(item['house_name'],item['total_price'])
        with open(os.path.join('images',file_name),'wb') as f:
            f.write(response.content)
        return item
pipelines.py
ITEM_PIPELINES = {
   'mySpider.pipelines.LianjiaPipeline': 100,
    'mySpider.pipelines.ImgPipeline': 200,
}
# 值爲0-1000,數字越小,優先度越高,先執行其process_item()方法
settings.py

 

6. 中間件

spider Middleware 爬蟲中間件: 介於引擎和爬蟲之間,自定義爬蟲中間件類,實現相應的方法,在settings中設置便可。數字越小越靠近引擎,process_spider_input()優先處理,數字越大越靠近spider,process_spider_output()優先處理,關閉用None。

官方文檔:https://scrapy.readthedocs.io/en/latest/topics/spider-middleware.html

https://zhuanlan.zhihu.com/p/42498126

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
        """
        從引擎傳來的response,先在這裏處理,而後交給spider
        :param response: 
        :param spider: 
        :return: 
        """
        pass

    def process_spider_output(self,response, result, spider):
        """
        spider處理完成,返回結果時調用 (返回的結果在這裏處理,後傳給引擎)
        :param response:
        :param result:
        :param spider:
        :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)
        """
        return result

    def process_spider_exception(self,response, exception, spider):
        """
        異常調用
        :param response:
        :param exception:
        :param spider:
        :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline
        """
        return None


    def process_start_requests(self,start_requests, spider):
        """
        爬蟲啓動時調用
        :param start_requests:
        :param spider:
        :return: 包含 Request 對象的可迭代對象
        """
        return start_requests
爬蟲中間件定義
SPIDER_MIDDLEWARES = {
   'mySpider.middlewares.MyCustomSpiderMiddleware': 543,
}

# 會與 SPIDER_MIDDLEWARES_BASE中的中間件合併,根據權重,依次執行;
'''
SPIDER_MIDDLEWARES_BASE=
{
    'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
}
'''
爬蟲中間件設置

download Middleware 下載器中間件:介於引擎和下載器之間,須要自定義和設置,數字越小,越靠近引擎,數字越大越靠近下載器。數字越小的,process_request()優先處理;數字越大的,process_response()優先處理;若須要關閉某個中間件直接設爲None便可

DOWNLOADER_MIDDLEWARES = {
   'mySpider.middlewares.MyCustomDownloaderMiddleware': 543,
}

#設置後會和DOWNLOADER_MIDDLEWARES_BASE合併,根據權重依次執行
'''
DOWNLOADER_MIDDLEWARES_BASE=
{
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
'''
下載中間件設置
class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
       從引擎傳來的request,通過全部下載器中間件的process_request調用
        :param request: 
        :param spider: 
        :return:  
            None,繼續後續中間件去下載;
            Response對象,中止process_request的執行,開始執行process_response
            Request對象,中止中間件的執行,將Request從新調度器
            raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception
        """
        pass

    def process_response(self, request, response, spider):
        """
        下載器處理完成返回的response,通過全部下載器中間件的process_response
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 對象:轉交給其餘中間件process_response
            Request 對象:中止中間件,request會被從新調度下載
            raise IgnoreRequest 異常:調用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:繼續交給後續中間件處理異常;
            Response對象:中止後續process_exception方法
            Request對象:中止中間件,request將會被從新調用下載
        """
        return None
    from_crawler(cls, crawler):
       # 利用crawler建立中間件實例
        return        
下載中間件定義

7. 自定義命令

官方文檔:https://doc.scrapy.org/en/latest/topics/commands.html?highlight=COMMANDS_MODULE

在settings.py 中添加配置 COMMANDS_MODULE = '項目名稱.目錄名稱

8. 信號機制

官方文檔:https://scrapy.readthedocs.io/en/latest/topics/signals.html

scrapy中設置了不少信號,在特定事情發生時會被調用,能夠自定義相應的處理函數

from scrapy import signals

class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('MMMM')
        ext = cls(val)

        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)

        return ext

    def spider_opened(self, spider):
        print('open')

    def spider_closed(self, spider):
        print('close')
自定義一
from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]


    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider


    def spider_closed(self, spider):
        spider.logger.info('Spider closed: %s', spider.name)


    def parse(self, response):
        pass
自定義二

9.url去重設置

官方文檔:https://doc.scrapy.org/en/latest/topics/settings.html?highlight=DUPEFILTER_CLASS

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'  :默認處理重複請求的類
DUPEFILTER_DEBUG = False  #RFPDupeFilter默認爲False,只記錄第一個重複的request。設置True時記錄全部的
Request( dont_filter=True),該Request的url不被去重
自定義?
class RepeatUrl:
    def __init__(self):
        self.visited_url = set()

    @classmethod
    def from_settings(cls, settings):
        """
        初始化時,調用
        :param settings: 
        :return: 
        """
        return cls()

    def request_seen(self, request):
        """
        檢測當前請求是否已經被訪問過
        :param request: 
        :return: True表示已經訪問過;False表示未訪問過
        """
        if request.url in self.visited_url:
            return True
        self.visited_url.add(request.url)
        return False

    def open(self):
        """
        開始爬去請求時,調用
        :return: 
        """
        print('open replication')

    def close(self, reason):
        """
        結束爬蟲爬取時,調用
        :param reason: 
        :return: 
        """
        print('close replication')

    def log(self, request, spider):
        """
        記錄日誌
        :param request: 
        :param spider: 
        :return: 
        """
        print('repeat', request.url)

複製代碼
自定義去重類

10. settings各項含義

1. 爬蟲名稱
BOT_NAME = 'step8_king'

# 2. 爬蟲應用路徑
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 3. 客戶端 user-agent請求頭
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 4. 禁止爬蟲配置
# ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 5. 併發請求數
# CONCURRENT_REQUESTS = 4

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 6. 延遲下載秒數
# DOWNLOAD_DELAY = 2


# The download delay setting will honor only one of:
# 7. 單域名訪問併發數,而且延遲下次秒數也應用在每一個域名
# CONCURRENT_REQUESTS_PER_DOMAIN = 2
# 單IP訪問併發數,若是有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,而且延遲下次秒數也應用在每一個IP
# CONCURRENT_REQUESTS_PER_IP = 3

# Disable cookies (enabled by default)
# 8. 是否支持cookie,cookiejar進行操做cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True

# Disable Telnet Console (enabled by default)
# 9. Telnet用於查看當前爬蟲的信息,操做爬蟲等...
#    使用telnet ip port ,而後經過命令操做
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]


# 10. 默認請求頭
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#     'Accept-Language': 'en',
# }


# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定義pipeline處理請求
# ITEM_PIPELINES = {
#    'step8_king.pipelines.JsonPipeline': 700,
#    'step8_king.pipelines.FilePipeline': 500,
# }



# 12. 自定義擴展,基於信號進行調用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#     # 'step8_king.extensions.MyExtension': 500,
# }


# 13. 爬蟲容許的最大深度,能夠經過meta查看當前深度;0表示無深度
# DEPTH_LIMIT = 3

# 14. 爬取時,0表示深度優先Lifo(默認);1表示廣度優先FiFo

# 後進先出,深度優先
# DEPTH_PRIORITY = 0
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先進先出,廣度優先

# DEPTH_PRIORITY = 1
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

# 15. 調度器隊列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler


# 16. 訪問URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'


# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

"""
17. 自動限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自動限速設置
    1. 獲取最小延遲 DOWNLOAD_DELAY
    2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY
    3. 設置初始下載延遲 AUTOTHROTTLE_START_DELAY
    4. 當請求下載完成後,獲取其"鏈接"時間 latency,即:請求鏈接到接受到響應頭之間的時間
    5. 用於計算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""

# 開始自動限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下載延遲
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下載延遲
# AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒併發數
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:
# 是否顯示
# AUTOTHROTTLE_DEBUG = True

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings


"""
18. 啓用緩存
    目的用於將已經發送的請求或相應緩存下來,以便之後使用
    
    from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
    from scrapy.extensions.httpcache import DummyPolicy
    from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否啓用緩存策略
# HTTPCACHE_ENABLED = True

# 緩存策略:全部請求均緩存,下次在請求直接訪問原來的緩存便可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 緩存策略:根據Http響應頭:Cache-Control、Last-Modified 等進行緩存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"

# 緩存超時時間
# HTTPCACHE_EXPIRATION_SECS = 0

# 緩存保存路徑
# HTTPCACHE_DIR = 'httpcache'

# 緩存忽略的Http狀態碼
# HTTPCACHE_IGNORE_HTTP_CODES = []

# 緩存存儲的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


"""
19. 代理,須要在環境變量中設置
    from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
    
    方式一:使用默認
        os.environ
        {
            http_proxy:http://root:woshiniba@192.168.11.11:9999/
            https_proxy:http://192.168.11.11:9999/
        }
    方式二:使用自定義下載中間件
    
    def to_bytes(text, encoding=None, errors='strict'):
        if isinstance(text, bytes):
            return text
        if not isinstance(text, six.string_types):
            raise TypeError('to_bytes must receive a unicode, str or bytes '
                            'object, got %s' % type(text).__name__)
        if encoding is None:
            encoding = 'utf-8'
        return text.encode(encoding, errors)
        
    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            PROXIES = [
                {'ip_port': '111.11.228.75:80', 'user_pass': ''},
                {'ip_port': '120.198.243.22:80', 'user_pass': ''},
                {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
                {'ip_port': '101.71.27.120:80', 'user_pass': ''},
                {'ip_port': '122.96.59.104:80', 'user_pass': ''},
                {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
            ]
            proxy = random.choice(PROXIES)
            if proxy['user_pass'] is not None:
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
                encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
                request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
                print "**************ProxyMiddleware have pass************" + proxy['ip_port']
            else:
                print "**************ProxyMiddleware no pass************" + proxy['ip_port']
                request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
    
    DOWNLOADER_MIDDLEWARES = {
       'step8_king.middlewares.ProxyMiddleware': 500,
    }
    
"""

"""
20. Https訪問
    Https訪問時有兩種狀況:
    1. 要爬取網站使用的可信任證書(默認支持)
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
        
    2. 要爬取網站使用的自定義證書
        DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
        DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
        
        # https.py
        from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
        from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
        
        class MySSLFactory(ScrapyClientContextFactory):
            def getCertificateOptions(self):
                from OpenSSL import crypto
                v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
                v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
                return CertificateOptions(
                    privateKey=v1,  # pKey對象
                    certificate=v2,  # X509對象
                    verify=False,
                    method=getattr(self, 'method', getattr(self, '_ssl_method', None))
                )
    其餘:
        相關類
            scrapy.core.downloader.handlers.http.HttpDownloadHandler
            scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
            scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
        相關配置
            DOWNLOADER_HTTPCLIENTFACTORY
            DOWNLOADER_CLIENTCONTEXTFACTORY

"""



"""
21. 爬蟲中間件
    class SpiderMiddleware(object):

        def process_spider_input(self,response, spider):
            '''
            下載完成,執行,而後交給parse處理
            :param response: 
            :param spider: 
            :return: 
            '''
            pass
    
        def process_spider_output(self,response, result, spider):
            '''
            spider處理完成,返回時調用
            :param response:
            :param result:
            :param spider:
            :return: 必須返回包含 Request 或 Item 對象的可迭代對象(iterable)
            '''
            return result
    
        def process_spider_exception(self,response, exception, spider):
            '''
            異常調用
            :param response:
            :param exception:
            :param spider:
            :return: None,繼續交給後續中間件處理異常;含 Response 或 Item 的可迭代對象(iterable),交給調度器或pipeline
            '''
            return None
    
    
        def process_start_requests(self,start_requests, spider):
            '''
            爬蟲啓動時調用
            :param start_requests:
            :param spider:
            :return: 包含 Request 對象的可迭代對象
            '''
            return start_requests
    
    內置爬蟲中間件:
        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
        'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
        'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
        'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,

"""
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   # 'step8_king.middlewares.SpiderMiddleware': 543,
}


"""
22. 下載中間件
    class DownMiddleware1(object):
        def process_request(self, request, spider):
            '''
            請求須要被下載時,通過全部下載器中間件的process_request調用
            :param request:
            :param spider:
            :return:
                None,繼續後續中間件去下載;
                Response對象,中止process_request的執行,開始執行process_response
                Request對象,中止中間件的執行,將Request從新調度器
                raise IgnoreRequest異常,中止process_request的執行,開始執行process_exception
            '''
            pass
    
    
    
        def process_response(self, request, response, spider):
            '''
            spider處理完成,返回時調用
            :param response:
            :param result:
            :param spider:
            :return:
                Response 對象:轉交給其餘中間件process_response
                Request 對象:中止中間件,request會被從新調度下載
                raise IgnoreRequest 異常:調用Request.errback
            '''
            print('response1')
            return response
    
        def process_exception(self, request, exception, spider):
            '''
            當下載處理器(download handler)或 process_request() (下載中間件)拋出異常
            :param response:
            :param exception:
            :param spider:
            :return:
                None:繼續交給後續中間件處理異常;
                Response對象:中止後續process_exception方法
                Request對象:中止中間件,request將會被從新調用下載
            '''
            return None

    
    默認下載中間件
    {
        'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
        'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
        'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
        'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
        'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
        'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
        'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
        'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
        'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
        'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
        'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
    }

"""
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
#    'step8_king.middlewares.DownMiddleware1': 100,
#    'step8_king.middlewares.DownMiddleware2': 500,
# }
settings.py

11. 自定義簡單版scrapy框架

準備知識:Twisted中reactor, defer, deferredlist, inlineCallback, getpage,參考https://www.cnblogs.com/silence-cho/p/9898984.html

項目架構及代碼:

 

#coding:utf-8

from twisted.web.client import defer,getPage
from twisted.internet import reactor

from Queue import Queue

class Request(object):

    def __init__(self,url,callback):
        self.url = url
        self.callback = callback

class HttpResponse(object):

    def __init__(self,content,request):
        self.response = content
        self.request = request
    @property
    def text(self):
        return self.response

class Scheduler(object):
    def __init__(self):
        self.q = Queue()

    def open(self):
        pass

    def enqueue_request(self,req):
        self.q.put(req)

    def next_request(self):
        try:
            req = self.q.get(block=False)
        except Exception as e:
            req = None
        return req

    def size(self):
        return self.q.qsize()

class ExecutionEngine(object):
    def __init__(self):
        self._close = None
        self.scheduler = None
        self.max = 5
        self.crawling = []
    def get_response_callback(self,content,request):
        print request.url
        # print self.crawling
        self.crawling.remove(request)
        # print self.crawling
        response = HttpResponse(content,request)
        result = request.callback(response)
        import types
        if isinstance(result,types.GeneratorType):
            for req in result:
                self.scheduler.enqueue_request(req)

    def _next_request(self):

        if self.scheduler.size()==0 and len(self.crawling)==0:
            self._close.callback(None)
            return

        while len(self.crawling) < self.max:
            req = self.scheduler.next_request()
            if not req:
                return
            #print req.url
            self.crawling.append(req)
            #print self.crawling
            d = getPage(req.url.encode('utf-8'))
            d.addCallback(self.get_response_callback,req)
            d.addCallback(lambda _:reactor.callLater(0,self._next_request))

    @defer.inlineCallbacks
    def open_spider(self,start_requests):
        self.scheduler = Scheduler()
        yield self.scheduler.open()
        while True:
            try:
                req = next(start_requests)
                self.scheduler.enqueue_request(req)
            except StopIteration as e:
                break
        reactor.callLater(0, self._next_request)

    @defer.inlineCallbacks
    def start(self):
        self._close = defer.Deferred()
        yield self._close

class Crawler(object):
    def __init__(self,spider_cls_path):
        self.spider_cls_path = spider_cls_path

    def _create_engine(self):
        return ExecutionEngine()

    def _create_spider(self):
        module_path, cls_name = self.spider_cls_path.rsplit('.',1)
        import importlib
        module = importlib.import_module(module_path)
        cls = getattr(module,cls_name)
        #print cls,'----'
        return cls()

    @defer.inlineCallbacks
    def crawl(self):
        spider = self._create_spider()
        start_requests = iter(spider.start_request())
        engine = self._create_engine()
        yield engine.open_spider(start_requests)
        yield engine.start()

class CrawlProcess(object):
    def __init__(self):
        self.active = set()

    def crawl(self,spider_cls_path):
        crawler =Crawler(spider_cls_path)
        d=crawler.crawl()
        self.active.add(d)


    def start(self):
        dd=defer.DeferredList(self.active)
        dd.addBoth(lambda _:reactor.stop())
        reactor.run()


class Command(object):
    def run(self):
        spider_cls_paths=['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider'] #'spider.cnblogs.CnblogsSpider'
        crawlProcess = CrawlProcess()
        for spider_cls_path in spider_cls_paths:
            crawlProcess.crawl(spider_cls_path)
        crawlProcess.start()

if __name__ == '__main__':
    c = Command()
    c.run()
engine.py
#coding:utf-8

from engine import Request

class CnblogsSpider(object):
    name = 'Cnblogs'

    def start_request(self):
        start_url = ['https://www.cnblogs.com/','https://www.baidu.com/' ] #'https://www.baidu.com/'
        for url in start_url:
            yield Request(url, self.parse)
    def parse(self, response):

        print response
        #print response.text
cnblogs.py
#coding:utf-8

from engine import Request
class ChoutiSpider(object):
    name = 'chouti'

    def start_request(self):
        start_url = ['https://dig.chouti.com/','https://www.baidu.com/']
        for url in start_url:
            yield Request(url, self.parse)

    def parse(self,response):
        #print response
        yield Request('https://www.sina.com.cn/',self.call)
        #print response.text

    def call(self,response):
        print '爬取新浪'
chouti.py

 

參考博客:http://www.cnblogs.com/wupeiqi/articles/6229292.html
相關文章
相關標籤/搜索