scrapy框架之下載中間件

介紹

  中間件是Scrapy裏面的一個核心概念。使用中間件能夠在爬蟲的請求發起以前或者請求返回以後對數據進行定製化修改,從而開發出適應不一樣狀況的爬蟲。html

「中間件」這個中文名字和前面章節講到的「中間人」只有一字之差。它們作的事情確實也很是類似。中間件和中間人都能在中途劫持數據,作一些修改再把數據傳遞出去。不一樣點在於,中間件是開發者主動加進去的組件,而中間人是被動的,通常是惡意地加進去的環節。中間件主要用來輔助開發,而中間人卻多被用來進行數據的竊取、僞造甚至攻擊。java

在Scrapy中有兩種中間件:下載器中間件(Downloader Middleware)和爬蟲中間件(Spider Middleware)。android

Scrapy的官方文檔中,對下載器中間件的解釋以下。json

下載器中間件是介於Scrapy的request/response處理的鉤子框架,是用於全局修改Scrapy request和response的一個輕量、底層的系統。

這個介紹看起來很是繞口,但其實用容易理解的話表述就是:更換代理IP,更換Cookies,更換User-Agent,自動重試。小程序

若是徹底沒有中間件,爬蟲的流程以下圖所示。cookie

 

使用了中間件之後,爬蟲的流程以下圖所示。框架

代理中間件

  在爬蟲開發中,更換代理IP是很是常見的狀況,有時候每一次訪問都須要隨機選擇一個代理IP來進行。dom

中間件自己是一個Python的類,只要爬蟲每次訪問網站以前都先「通過」這個類,它就能給請求換新的代理IP,這樣就能實現動態改變代理。scrapy

在建立一個Scrapy工程之後,工程文件夾下會有一個middlewares.py文件,打開之後其內容以下ide

from scrapy import signals

class MeiziDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Scrapy自動生成的這個文件名稱爲middlewares.py,名字後面的s表示複數,說明這個文件裏面能夠放不少箇中間件。Scrapy自動建立的這個中間件是一個爬蟲中間件,這種類型在第三篇文章會講解。如今先來建立一個自動更換代理IP的中間件。

在middlewares.py中添加下面一段代碼:

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        proxy = random.choice(settings['PROXIES'])
        request.meta['proxy'] = proxy

要修改請求的代理,就須要在請求的meta裏面添加一個Key爲proxy,Value爲代理IP的項。

因爲用到了random和settings,因此須要在middlewares.py開頭導入它們:

import random
from scrapy.conf import settings

在下載器中間件裏面有一個名爲process_request()的方法,這個方法中的代碼會在每次爬蟲訪問網頁以前執行。

打開settings.py,首先添加幾個代理IP:

PROXIES = ['https://114.217.243.25:8118',
          'https://125.37.175.233:8118',
          'http://1.85.116.218:8118']

須要注意的是,代理IP是有類型的,須要先看清楚是HTTP型的代理IP仍是HTTPS型的代理IP。若是用錯了,就會致使沒法訪問。

激活中間件

中間件寫好之後,須要去settings.py中啓動。在settings.py中找到下面這一段被註釋的語句:

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'AdvanceSpider.middlewares.MyCustomDownloaderMiddleware': 543,
#}

解除註釋並修改,從而引用ProxyMiddleware。修改成:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
}

這其實就是一個字典,字典的Key就是用點分隔的中間件路徑,後面的數字表示這種中間件的順序。因爲中間件是按順序運行的,所以若是遇到後一箇中間件依賴前一箇中間件的狀況,中間件的順序就相當重要。

如何肯定後面的數字應該怎麼寫呢?最簡單的辦法就是從543開始,逐漸加一,這樣通常不會出現什麼大問題。若是想把中間件作得更專業一點,那就須要知道Scrapy自帶中間件的順序,以下所示。

DOWNLOADER_MIDDLEWARES_BASE
{
    'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
    'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
    'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
    'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}

數字越小的中間件越先執行,例如Scrapy自帶的第1箇中間件RobotsTxtMiddleware,它的做用是首先查看settings.py中ROBOTSTXT_OBEY這一項的配置是True仍是False。若是是True,表示要遵照Robots.txt協議,它就會檢查將要訪問的網址能不能被運行訪問,若是不被容許訪問,那麼直接就取消這一次請求,接下來的和此次請求有關的各類操做所有都不須要繼續了。

開發者自定義的中間件,會被按順序插入到Scrapy自帶的中間件中。爬蟲會按照從100~900的順序依次運行全部的中間件。直到全部中間件所有運行完成,或者遇到某一箇中間件而取消了此次請求。

Scrapy其實自帶了UA中間件(UserAgentMiddleware)、代理中間件(HttpProxyMiddleware)和重試中間件(RetryMiddleware)。因此,從「原則上」說,要本身開發這3箇中間件,須要先禁用Scrapy裏面自帶的這3箇中間件。要禁用Scrapy的中間件,須要在settings.py裏面將這個中間件的順序設爲None:

DOWNLOADER_MIDDLEWARES = {
  'AdvanceSpider.middlewares.ProxyMiddleware': 543,
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
  'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None
}

配置好之後運行爬蟲,爬蟲會在每次請求前都隨機設置一個代理。要測試代理中間件的運行效果,可使用下面這個練習頁面:

http://httpbin.org/get

這個頁面會返回爬蟲的IP地址;

案例演示:

免費代理:http://www.goubanjia.com/

import scrapy
import json

class MeinvSpider(scrapy.Spider):
    name = 'meinv'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):
     
        str_info = response.body.decode()
        dic_info = json.loads(str_info)
        print(dic_info["origin"])
spider
import random
from meizi.settings import PROXIES
class ProxyMiddleware(object):

    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        request.meta['proxy'] = proxy

        return None
Middlewares
# 代理池
PROXIES = ['http://117.191.11.102:80',
          'http://117.191.11.107:80',
          'http://117.191.11.72:8080']

# 開啓代理中間件
DOWNLOADER_MIDDLEWARES = {
   'meizi.middlewares.ProxyMiddleware': 543,
}
settings

輸入信息

UA中間件

開發UA中間件和開發代理中間件幾乎同樣,它也是從settings.py配置好的UA列表中隨機選擇一項,加入到請求頭中。代碼以下:

class UAMiddleware(object):

    def process_request(self, request, spider):
        ua = random.choice(settings['USER_AGENT_LIST'])
        request.headers['User-Agent'] = ua

比IP更好的是,UA不會存在失效的問題,因此只要收集幾十個UA,就能夠一直使用。常見的UA以下:

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  "Dalvik/1.6.0 (Linux; U; Android 4.2.1; 2013022 MIUI/JHACNBL30.0)",
  "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; HUAWEI MT7-TL00 Build/HuaweiMT7-TL00) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  "AndroidDownloadManager",
  "Apache-HttpClient/UNAVAILABLE (java 1.4)",
  "Dalvik/1.6.0 (Linux; U; Android 4.3; SM-N7508V Build/JLS36C)",
  "Android50-AndroidPhone-8000-76-0-Statistics-wifi",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.4; MI 3 MIUI/V7.2.1.0.KXCCNDA)",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.2; Lenovo A3800-d Build/LenovoA3800-d)",
  "Lite 1.0 ( http://litesuits.com )",
  "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
  "Mozilla/5.0 (Linux; U; Android 4.1.1; zh-cn; HTC T528t Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360browser(securitypay,securityinstalled); 360(android,uppayplugin); 360 Aphone Browser (2.0.4)",
]

配置好UA之後,在settings.py下載器中間件裏面激活它,並使用UA練習頁來驗證UA是否每一次都不同。練習頁的地址爲:

http://httpbin.org/get

案例演示:

import scrapy
import json

class MeinvSpider(scrapy.Spider):
    name = 'meinv'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://httpbin.org/get']

    def parse(self, response):

        str_info = response.body.decode()
        dic_info = json.loads(str_info)
        print(dic_info["headers"]['User-Agent'])
spider
import random
from meizi.settings import USER_AGENT_LIST
class UAMiddleware(object):

    def process_request(self, request, spider):
        ua = random.choice(USER_AGENT_LIST)
        request.headers['User-Agent'] = ua

        return None
Middlewares
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  "Dalvik/1.6.0 (Linux; U; Android 4.2.1; 2013022 MIUI/JHACNBL30.0)",
  "Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; HUAWEI MT7-TL00 Build/HuaweiMT7-TL00) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
  "AndroidDownloadManager",
  "Apache-HttpClient/UNAVAILABLE (java 1.4)",
  "Dalvik/1.6.0 (Linux; U; Android 4.3; SM-N7508V Build/JLS36C)",
  "Android50-AndroidPhone-8000-76-0-Statistics-wifi",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.4; MI 3 MIUI/V7.2.1.0.KXCCNDA)",
  "Dalvik/1.6.0 (Linux; U; Android 4.4.2; Lenovo A3800-d Build/LenovoA3800-d)",
  "Lite 1.0 ( http://litesuits.com )",
  "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0",
  "Mozilla/5.0 (Linux; U; Android 4.1.1; zh-cn; HTC T528t Build/JRO03H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30; 360browser(securitypay,securityinstalled); 360(android,uppayplugin); 360 Aphone Browser (2.0.4)",
]

DOWNLOADER_MIDDLEWARES = {
   'meizi.middlewares.ProxyMiddleware': None,
   'meizi.middlewares.UAMiddleware': 543,
}
settings

輸入結果:

Cookies中間件

對於須要登陸的網站,可使用Cookies來保持登陸狀態。那麼若是單獨寫一個小程序,用Selenium持續不斷地用不一樣的帳號登陸網站,就能夠獲得不少不一樣的Cookies。因爲Cookies本質上就是一段文本,因此能夠把這段文本放在Redis裏面。這樣一來,當Scrapy爬蟲請求網頁時,能夠從Redis中讀取Cookies並給爬蟲換上。這樣爬蟲就能夠一直保持登陸狀態。

如下面這個練習頁面爲例:

http://exercise.kingname.info/exercise_login_success

若是直接用Scrapy訪問,獲得的是登陸界面的源代碼,以下圖所示。

相關文章
相關標籤/搜索