Scrapy框架--Requests對象

時間 2019-11-22

原文原文鏈接

Scrapy使用request對象來爬取web站點。javascript

request對象由spiders對象產生，經由Scheduler傳送到Downloader,Downloader執行request並返回response給spiders。php

Scrapy架構：css

一、Request objects

class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])html

一個request對象表明一個HTTP請求，一般有Spider產生，經Downloader執行從而產生一個Response。java

Paremeters: url(string): 用於請求的URLpython

callback(callable):指定一個回調函數，該回調函數以這個request是的response做爲第一個參數。若是未指定callback，web

則默認使用spider的parse()方法。api

method(string):HTTP請求的方法，默認爲GET（看到GET你應該明白了，過不不明白建議先學習urllib或者requets模塊）瀏覽器

meta(dict):指定Request.meta屬性的初始值。若是給了該參數，dict將會淺拷貝。(淺拷貝不懂的趕忙回爐)cookie

body(str):the request body.(這個沒有理解，如有哪位大神明白，請指教，謝謝）

headers(dict):request的頭信息。

cookies(dict or list):cookie有兩種格式。

一、使用dict:

request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'})

二、使用字典的list

request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])

後面這種形式能夠定製cookie的domain和path屬性，只有cookies爲接下來的請求保存的時候纔有用。

當網站在response中返回cookie時，這些cookie將被保存以便將來的訪問請求。這是常規瀏覽器的行爲。若是你想避免修改當前

正在使用的cookie,你能夠經過設置Request.meta中的dont_merge_cookies爲True來實現。

request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'},
                               meta={'dont_merge_cookies': True})

encoding(string):請求的編碼，默認爲utf-8

priority(int):請求的優先級

dont_filter(boolean):指定該請求是否被 Scheduler過濾。該參數能夠是request重複使用（Scheduler默認過濾重複請求）。謹慎使用！！

errback(callable):處理異常的回調函數。

屬性和方法：

url: 包含request的URL的字符串

method: 表明HTTP的請求方法的字符串，例如'GET', 'POST'...

headers: request的頭信息

body: 請求體

meta: 一個dict，包含request的任意元數據。該dict在新Requests中爲空，當Scrapy的其餘擴展啓用的時候填充數據。dict在傳輸是淺拷貝。

copy(): 拷貝當前Request

replace([url, method, headers, body, cookies, meta, encoding, dont_filter, callback, errback]): 返回一個參數相同的Request，

能夠爲參數指定新數據。

給回調函數傳遞數據

當request的response被下載是，就會調用回調函數，並以response對象爲第一個參數

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                          callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

example

在某些狀況下，你但願在回調函數們之間傳遞參數，可使用Request.meta。（其實有點相似全局變量的趕腳）

def parse_page1(self, response):
    item = MyItem()
    item['main_url'] = response.url
    request = scrapy.Request("http://www.example.com/some_page.html",
                             callback=self.parse_page2)
    request.meta['item'] = item
    yield request

def parse_page2(self, response):
    item = response.meta['item']
    item['other_url'] = response.url
    yield item

View Code

使用errback來捕獲請求執行中的異常

當request執行時有異常拋出將會調用errback回調函數。

它接收一個Twisted Failure實例做爲第一個參數，並被用來回溯鏈接超時或DNS錯誤等。

 1 import scrapy
 2 
 3 from scrapy.spidermiddlewares.httperror import HttpError
 4 from twisted.internet.error import DNSLookupError
 5 from twisted.internet.error import TimeoutError, TCPTimedOutError
 6 
 7 class ErrbackSpider(scrapy.Spider):
 8     name = "errback_example"
 9     start_urls = [
10         "http://www.httpbin.org/",              # HTTP 200 expected
11         "http://www.httpbin.org/status/404",    # Not found error
12         "http://www.httpbin.org/status/500",    # server issue
13         "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
14         "http://www.httphttpbinbin.org/",       # DNS error expected
15     ]
16 
17     def start_requests(self):
18         for u in self.start_urls:
19             yield scrapy.Request(u, callback=self.parse_httpbin,
20                                     errback=self.errback_httpbin,
21                                     dont_filter=True)
22 
23     def parse_httpbin(self, response):
24         self.logger.info('Got successful response from {}'.format(response.url))
25         # do something useful here...
26 
27     def errback_httpbin(self, failure):
28         # log all failures
29         self.logger.error(repr(failure))
30 
31         # in case you want to do something special for some errors,
32         # you may need the failure's type:
33 
34         if failure.check(HttpError):
35             # these exceptions come from HttpError spider middleware
36             # you can get the non-200 response
37             response = failure.value.response
38             self.logger.error('HttpError on %s', response.url)
39 
40         elif failure.check(DNSLookupError):
41             # this is the original request
42             request = failure.request
43             self.logger.error('DNSLookupError on %s', request.url)
44 
45         elif failure.check(TimeoutError, TCPTimedOutError):
46             request = failure.request
47             self.logger.error('TimeoutError on %s', request.url)

example

Request.meta的特殊關鍵字

Request.meta能夠包含任意的數據，但Scrapy和內置擴展提供了一些特殊的關鍵字

dont_redirect （其實dont就是don't,嗯哼~）
dont_retry
handle_httpstatus_list
handle_httpstatus_all
dont_merge_cookies (see cookies parameter of Request constructor)
cookiejar
dont_cache
redirect_urls
bindaddress
dont_obey_robotstxt
download_timeout(下載超時)
download_maxsize
download_latency(下載延時)
proxy

二、Request subclasses

FormRequest object

FormRequest繼承自Request類，增長了處理HTML表單數據的功能

class scrapy.http.FormRequset(url[, formdata,...])

FormRequest類新增了'formdata'參數在構造方法中，其餘參數與Request類相同，再也不贅述。

Parameters:

formdata (dict or iterable of tuple)是一個字典（或鍵值對的可迭代元組），包含HTML表單數據（會被url_encode）並部署到請求體重。

FormRequest對象支持一個標準Request方法以外的類方法

classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None,

clickdata=None, dont_click=False, ...])

根據response找到HTML的<from>元素，以此來填充給定的form字段值，並返回一個新的FormRequest對象。

在任何看起來可點擊（例如<input type="submit">）的表單控制處，該策略默認自動模擬點擊。雖然很方便，但有時會形成很難debug的問題，例如當盡是javascript的and/or提交時,

默認的from_response就再也不合適了。能夠經過設置dont_click爲True關閉這個動做。你也可使用clickdata參數來改變對點擊的控制。

parameters:

response(Response object): 包含HTML form的response對象，用來填充form字段

formname(string): 若是設置，name爲該值的form將被使用

formid(string): 若是設置，id爲該值的form將被使用。

formxpath(string): 若是設置，和xpath匹配的第一個form將被使用

formcss(string): 若是設置，和css選擇器匹配的第一個form將被使用

formnumber(integer): 當response包含多個form的時候，指定使用的數量。第一個爲0 (也是默認值）

formdata(dict): 用來重寫form數據的字段。若是某個字段在response的<form>元素中已經存在一個值，那麼現存的值將被重寫。

clickdata(dict): (沒明白，暫時不寫）

dont_click(boolean): 若是爲True, form數據將會提交而不點擊任何元素。

Request應用實例

使用FormRequest經過HTML POST發送數據

若是你想在爬蟲中模擬HTML Form POST併發送鍵值對字段，你能夠返回一個FormRequest對象（從你的spider）：

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

FormRequest

使用FormRequest.from_response模擬用戶登陸

web站點一般經過<input type="hidden">元素要求填充Form字段，好比會話相關數據或者驗證口令（登陸界面）。在爬取時，你想自動填充並重寫這些字段，就像輸入用戶名和密碼。可使用

FormRequest.from_response()來實現。

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...