Python之Requests的高級用法

時間 2019-11-11

標籤 python requests 高級用法欄目 Python 简体版

原文原文鏈接

# 高級用法

本篇文檔涵蓋了Requests的一些更加高級的特性。html

## 會話對象

會話對象讓你可以跨請求保持某些參數。它也會在同一個Session實例發出的全部請求之間保持cookies。python

會話對象具備主要的Requests API的全部方法。linux

咱們來跨請求保持一些cookies:git

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

會話也可用來爲請求方法提供缺省數據。這是經過爲會話對象的屬性提供數據來實現的:github

s = requests.Session()
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent
s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

任何你傳遞給請求方法的字典都會與已設置會話層數據合併。方法層的參數覆蓋會話的參數。web

從字典參數中移除一個值
有時你會想省略字典參數中一些會話層的鍵。要作到這一點，你只需簡單地在方法層參數中將那個鍵的值設置爲 None ，那個鍵就會被自動省略掉。json

包含在一個會話中的全部數據你均可以直接使用。學習更多細節請閱讀會話API文檔。api

## 請求與響應對象

任什麼時候候調用requests.*()你都在作兩件主要的事情。其一，你在構建一個 Request 對象，該對象將被髮送到某個服務器請求或查詢一些資源。其二，一旦 requests 獲得一個從服務器返回的響應就會產生一個 Response 對象。該響應對象包含服務器返回的全部信息，也包含你原來建立的 Request 對象。以下是一個簡單的請求，從Wikipedia的服務器獲得一些很是重要的信息:瀏覽器

>>> r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')

若是想訪問服務器返回給咱們的響應頭部信息，能夠這樣作:bash

>>> r.headers
{'content-length': '56170', 'x-content-type-options': 'nosniff', 'x-cache':
'HIT from cp1006.eqiad.wmnet, MISS from cp1010.eqiad.wmnet', 'content-encoding':
'gzip', 'age': '3080', 'content-language': 'en', 'vary': 'Accept-Encoding,Cookie',
'server': 'Apache', 'last-modified': 'Wed, 13 Jun 2012 01:33:50 GMT',
'connection': 'close', 'cache-control': 'private, s-maxage=0, max-age=0,
must-revalidate', 'date': 'Thu, 14 Jun 2012 12:59:39 GMT', 'content-type':
'text/html; charset=UTF-8', 'x-cache-lookup': 'HIT from cp1006.eqiad.wmnet:3128,
MISS from cp1010.eqiad.wmnet:80'}

然而，若是想獲得發送到服務器的請求的頭部，咱們能夠簡單地訪問該請求，而後是該請求的頭部:

>>> r.request.headers
{'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*', 'User-Agent': 'python-requests/0.13.1'}

## Prepared Request

當你從API調用或Session調用獲得一個Response對象，對於這個的request屬性其實是被使用的PreparedRequest，在某些狀況下你可能但願在發送請求以前對body和headers(或其餘東西)作些額外的工做，一個簡單的例子以下:

from requests import Request, Session

s = Session()
req = Request('GET', url,
    data=data,
    headers=header
)
prepped = req.prepare()

# do something with prepped.body
# do something with prepped.headers

resp = s.send(prepped,
    stream=stream,
    verify=verify,
    proxies=proxies,
    cert=cert,
    timeout=timeout
)

print(resp.status_code)

由於你沒有用Request對象作任何特別的事情，你應該當即封裝它和修改 PreparedRequest 對象，而後攜帶着你想要發送到requests.* 或 Session.*的其餘參數來發送它

可是，上面的代碼會喪失一些Requests Session對象的優點，特別的，Session層的狀態好比cookies不會被應用到你的其餘請求中，要使它獲得應用，你能夠用Session.prepare_request()來替換 Request.prepare()，好比下面的例子:

from requests import Request, Session

s = Session()
req = Request('GET',  url,
    data=data
    headers=headers
)

prepped = s.prepare_request(req)

# do something with prepped.body
# do something with prepped.headers

resp = s.send(prepped,
    stream=stream,
    verify=verify,
    proxies=proxies,
    cert=cert,
    timeout=timeout
)

print(resp.status_code)

## SSL證書驗證

Requests能夠爲HTTPS請求驗證SSL證書，就像web瀏覽器同樣。要想檢查某個主機的SSL證書，你可使用 verify 參數:

>>> requests.get('https://kennethreitz.com', verify=True)
requests.exceptions.SSLError: hostname 'kennethreitz.com' doesn't match either of '*.herokuapp.com', 'herokuapp.com'

在該域名上我沒有設置SSL，因此失敗了。但Github設置了SSL:

>>> requests.get('https://github.com', verify=True)
<Response [200]>

對於私有證書，你也能夠傳遞一個CA_BUNDLE文件的路徑給 verify 。你也能夠設置 REQUEST_CA_BUNDLE 環境變量。

若是你將verify設置爲False，Requests也能忽略對SSL證書的驗證。

>>> requests.get('https://kennethreitz.com', verify=False)
<Response [200]>

默認狀況下， verify 是設置爲True的。選項 verify 僅應用於主機證書。

你也能夠指定一個本地證書用做客戶端證書，能夠是單個文件（包含密鑰和證書）或一個包含兩個文件路徑的元組:

>>> requests.get('https://kennethreitz.com', cert=('/path/server.crt', '/path/key'))
<Response [200]>

若是你指定了一個錯誤路徑或一個無效的證書:

>>> requests.get('https://kennethreitz.com', cert='/wrong_path/server.pem')
SSLError: [Errno 336265225] _ssl.c:347: error:140B0009:SSL routines:SSL_CTX_use_PrivateKey_file:PEM lib

## 響應體內容工做流

默認狀況下，當你進行網絡請求後，響應體會當即被下載。你能夠經過 stream 參數覆蓋這個行爲，推遲下載響應體直到訪問 Response.content 屬性:

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)

此時僅有響應頭被下載下來了，鏈接保持打開狀態，所以容許咱們根據條件獲取內容:

if int(r.headers['content-length']) < TOO_LONG:
    content = r.content
    ...

你能夠進一步使用 Response.iter_content 和 Response.iter_lines 方法來控制工做流，或者以 Response.raw 從底層urllib3的 urllib3.HTTPResponse <urllib3.response.HTTPResponse 讀取。

若是當你請求時設置stream爲True，Requests將不能釋放這個鏈接爲鏈接池，除非你讀取了所有數據或者調用了Response.close，這樣會使鏈接變得低效率。若是當你設置 stream = True 時你發現你本身部分地讀取了響應體數據(或者徹底沒讀取響應體數據)，你應該考慮使用contextlib.closing,好比下面的例子:

from contextlib import closing

with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
    # Do things with the response here.

## 保持活動狀態（持久鏈接）

好消息 - 歸功於urllib3，同一會話內的持久鏈接是徹底自動處理的！同一會話內你發出的任何請求都會自動複用恰當的鏈接！

注意：只有全部的響應體數據被讀取完畢鏈接纔會被釋放爲鏈接池；因此確保將 stream 設置爲 False 或讀取 Response 對象的 content 屬性。

## 流式上傳

Requests支持流式上傳，這容許你發送大的數據流或文件而無需先把它們讀入內存。要使用流式上傳，僅需爲你的請求體提供一個類文件對象便可:

with open('massive-body') as f:
    requests.post('http://some.url/streamed', data=f)

## 塊編碼請求

對於出去和進來的請求，Requests也支持分塊傳輸編碼。要發送一個塊編碼的請求，僅需爲你的請求體提供一個生成器（或任意沒有具體長度(without a length)的迭代器）:

def gen():
    yield 'hi'
    yield 'there'

requests.post('http://some.url/chunked', data=gen())

## POST 多個編碼(Multipart-Encoded)文件

你能夠在一個請求中發送多個文件，例如，假設你但願上傳圖像文件到一個包含多個文件字段‘images’的HTML表單

<input type=」file」 name=」images」 multiple=」true」 required=」true」/>

達到這個目的，僅僅只須要設置文件到一個包含(form_field_name, file_info)的元組的列表：

>>> url = 'http://httpbin.org/post'
>>> multiple_files = [('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),
                      ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]
>>> r = requests.post(url, files=multiple_files)
>>> r.text
{
  ...
  'files': {'images': 'data:image/png;base64,iVBORw ....'}
  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',
  ...
}

## 事件掛鉤

Requests有一個鉤子系統，你能夠用來操控部分請求過程，或信號事件處理。

可用的鉤子:

response:

從一個請求產生的響應

你能夠經過傳遞一個 {hook_name: callback_function} 字典給 hooks 請求參數爲每一個請求分配一個鉤子函數:

hooks=dict(response=print_url)

callback_function 會接受一個數據塊做爲它的第一個參數。

def print_url(r):
    print(r.url)

若執行你的回調函數期間發生錯誤，系統會給出一個警告。

若回調函數返回一個值，默認以該值替換傳進來的數據。若函數未返回任何東西，也沒有什麼其餘的影響。

咱們來在運行期間打印一些請求方法的參數:

>>> requests.get('http://httpbin.org', hooks=dict(response=print_url))
http://httpbin.org
<Response [200]>

## 自定義身份驗證

Requests容許你使用本身指定的身份驗證機制。

任何傳遞給請求方法的 auth 參數的可調用對象，在請求發出以前都有機會修改請求。

自定義的身份驗證機制是做爲 requests.auth.AuthBase 的子類來實現的，也很是容易定義。

Requests在 requests.auth 中提供了兩種常見的的身份驗證方案： HTTPBasicAuth 和 HTTPDigestAuth 。

假設咱們有一個web服務，僅在 X-Pizza 頭被設置爲一個密碼值的狀況下才會有響應。雖然這不太可能，但就以它爲例好了

from requests.auth import AuthBase

class PizzaAuth(AuthBase):
    """Attaches HTTP Pizza Authentication to the given Request object."""
    def __init__(self, username):
        # setup any auth-related data here
        self.username = username

    def __call__(self, r):
        # modify and return the request
        r.headers['X-Pizza'] = self.username
        return r

而後就可使用咱們的PizzaAuth來進行網絡請求:

>>> requests.get('http://pizzabin.org/admin', auth=PizzaAuth('kenneth'))
<Response [200]>

## 流式請求

使用 requests.Response.iter_lines() 你能夠很方便地對流式API（例如 Twitter的流式API ）進行迭代。簡單地設置 stream 爲 True 即可以使用 iter_lines() 對相應進行迭代:

import json
import requests

r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():

    # filter out keep-alive new lines
    if line:
        print(json.loads(line))

## 代理

若是須要使用代理，你能夠經過爲任意請求方法提供 proxies 參數來配置單個請求:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

你也能夠經過環境變量 HTTP_PROXY 和 HTTPS_PROXY 來配置代理。

$ export HTTP_PROXY="http://10.10.1.10:3128"
$ export HTTPS_PROXY="http://10.10.1.10:1080"
$ python

>>> import requests
>>> requests.get("http://example.org")

若你的代理須要使用HTTP Basic Auth，可使用 http://user:password@host/ 語法:

proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}

## 合規性

Requests符合全部相關的規範和RFC，這樣不會爲用戶形成沒必要要的困難。但這種對規範的考慮致使一些行爲對於不熟悉相關規範的人來講看似有點奇怪。

編碼方式

當你收到一個響應時，Requests會猜想響應的編碼方式，用於在你調用 Response.text 方法時對響應進行解碼。Requests首先在HTTP頭部檢測是否存在指定的編碼方式，若是不存在，則會使用 charade 來嘗試猜想編碼方式。

只有當HTTP頭部不存在明確指定的字符集，而且 Content-Type 頭部字段包含 text 值之時， Requests纔不去猜想編碼方式。

在這種狀況下， RFC 2616 指定默認字符集必須是 ISO-8859-1 。Requests聽從這一規範。若是你須要一種不一樣的編碼方式，你能夠手動設置 Response.encoding 屬性，或使用原始的 Response.content 。(可結合上一篇安裝使用快速上手中的 響應內容 學習)

## HTTP請求類型(附加例子)

Requests提供了幾乎全部HTTP請求類型的功能：GET，OPTIONS， HEAD，POST，PUT，PATCH和DELETE。如下內容爲使用Requests中的這些請求類型以及Github API提供了詳細示例。

我將從最常使用的請求類型GET開始。HTTP GET是一個冪等的方法，從給定的URL返回一個資源。於是，當你試圖從一個web位置獲取數據之時，你應該使用這個請求類型。一個使用示例是嘗試從Github上獲取關於一個特定commit的信息。假設咱們想獲取Requests的commit a050faf 的信息。咱們能夠這樣去作:

>>> import requests
>>> r = requests.get('https://api.github.com/repos/kennethreitz/requests/git/commits/a050faf084662f3a352dd1a941f2c7c9f886d4ad')

咱們應該確認Github是否正確響應。若是正確響應，咱們想弄清響應內容是什麼類型的。像這樣去作:

>>> if (r.status_code == requests.codes.ok):
...     print r.headers['content-type']
...
application/json; charset=utf-8

可見，GitHub返回了JSON數據，很是好，這樣就可使用 r.json 方法把這個返回的數據解析成Python對象。

>>> commit_data = r.json()
>>> print commit_data.keys()
[u'committer', u'author', u'url', u'tree', u'sha', u'parents', u'message']
>>> print commit_data[u'committer']
{u'date': u'2012-05-10T11:10:50-07:00', u'email': u'me@kennethreitz.com', u'name': u'Kenneth Reitz'}
>>> print commit_data[u'message']
makin' history

到目前爲止，一切都很是簡單。嗯，咱們來研究一下GitHub的API。咱們能夠去看看文檔，但若是使用Requests來研究也許會更有意思一點。咱們能夠藉助Requests的OPTIONS請求類型來看看咱們剛使用過的url 支持哪些HTTP方法。

>>> verbs = requests.options(r.url)
>>> verbs.status_code
500

額，這是怎麼回事？毫無幫助嘛！原來GitHub，與許多API提供方同樣，實際上並未實現OPTIONS方法。這是一個惱人的疏忽，但不要緊，那咱們可使用枯燥的文檔。然而，若是GitHub正確實現了OPTIONS，那麼服務器應該在響應頭中返回容許用戶使用的HTTP方法，例如：

>>> verbs = requests.options('http://a-good-website.com/api/cats')
>>> print verbs.headers['allow']
GET,HEAD,POST,OPTIONS

轉而去查看文檔，咱們看到對於提交信息，另外一個容許的方法是POST，它會建立一個新的提交。因爲咱們正在使用Requests代碼庫，咱們應儘量避免對它發送笨拙的POST。做爲替代，咱們來玩玩GitHub的Issue特性。

本篇文檔是迴應Issue #482而添加的。鑑於該問題已經存在，咱們就以它爲例。先獲取它。

>>> r = requests.get('https://api.github.com/repos/kennethreitz/requests/issues/482')
>>> r.status_code
200
>>> issue = json.loads(r.text)
>>> print issue[u'title']
Feature any http verb in docs
>>> print issue[u'comments']
3

Cool，有3個評論。咱們來看一下最後一個評論。

>>> r = requests.get(r.url + u'/comments')
>>> r.status_code
200
>>> comments = r.json()
>>> print comments[0].keys()
[u'body', u'url', u'created_at', u'updated_at', u'user', u'id']
>>> print comments[2][u'body']
Probably in the "advanced" section

嗯，那看起來彷佛是個愚蠢之處。咱們發表個評論來告訴這個評論者他本身的愚蠢。那麼，這個評論者是誰呢？

>>> print comments[2][u'user'][u'login']
kennethreitz

好，咱們來告訴這個叫肯尼思的傢伙，這個例子應該放在快速上手指南中。根據GitHub API文檔，其方法是POST到該話題。咱們來試試看。

>>> body = json.dumps({u"body": u"Sounds great! I'll get right on it!"})
>>> url = u"https://api.github.com/repos/kennethreitz/requests/issues/482/comments"
>>> r = requests.post(url=url, data=body)
>>> r.status_code
404

額，這有點古怪哈。可能咱們須要驗證身份。那就有點糾結了，對吧？不對。Requests簡化了多種身份驗證形式的使用，包括很是常見的Basic Auth。

>>> from requests.auth import HTTPBasicAuth
>>> auth = HTTPBasicAuth('fake@example.com', 'not_a_real_password')
>>> r = requests.post(url=url, data=body, auth=auth)
>>> r.status_code
201
>>> content = r.json()
>>> print(content[u'body'])
Sounds great! I'll get right on it.

精彩！噢，不！我本來是想說等我一會，由於我得去喂一下個人貓。若是我可以編輯這條評論那就行了！幸運的是，GitHub容許咱們使用另外一個HTTP動詞，PATCH，來編輯評論。咱們來試試。

>>> print(content[u"id"])
5804413
>>> body = json.dumps({u"body": u"Sounds great! I'll get right on it once I feed my cat."})
>>> url = u"https://api.github.com/repos/kennethreitz/requests/issues/comments/5804413"
>>> r = requests.patch(url=url, data=body, auth=auth)
>>> r.status_code
200

很是好。如今，咱們來折磨一下這個叫肯尼思的傢伙，我決定要讓他急得團團轉，也不告訴他是我在搗蛋。這意味着我想刪除這條評論。GitHub容許咱們使用徹底名副其實的DELETE方法來刪除評論。咱們來清除該評論。

>>> r = requests.delete(url=url, auth=auth)
>>> r.status_code
204
>>> r.headers['status']
'204 No Content'

很好。不見了。最後一件我想知道的事情是我已經使用了多少限額（ratelimit）。查查看，GitHub在響應頭部發送這個信息，所以沒必要下載整個網頁，我將使用一個HEAD請求來獲取響應頭。

>>> r = requests.head(url=url, auth=auth)
>>> print r.headers
...
'x-ratelimit-remaining': '4995'
'x-ratelimit-limit': '5000'
...

很好。是時候寫個Python程序以各類刺激的方式濫用GitHub的API，還可使用4995次呢。

## 響應頭連接字段

許多HTTP API都有響應頭連接字段的特性，它們使得API可以更好地自我描述和自我顯露。

GitHub在API中爲分頁使用這些特性，例如:

>>> url = 'https://api.github.com/users/kennethreitz/repos?page=1&per_page=10'
>>> r = requests.head(url=url)
>>> r.headers['link']
'<https://api.github.com/users/kennethreitz/repos?page=2&per_page=10>; rel="next", <https://api.github.com/users/kennethreitz/repos?page=6&per_page=10>; rel="last"'

Requests會自動解析這些響應頭連接字段，並使得它們很是易於使用:

>>> r.links["next"]
{'url': 'https://api.github.com/users/kennethreitz/repos?page=2&per_page=10', 'rel': 'next'}

>>> r.links["last"]
{'url': 'https://api.github.com/users/kennethreitz/repos?page=7&per_page=10', 'rel': 'last'}

## Transport Adapters

As of v1.0.0, Requests has moved to a modular internal design. Part of the reason this was done was to implement Transport Adapters, originally described here. Transport Adapters provide a mechanism to define interaction methods for an HTTP service. In particular, they allow you to apply per-service configuration.

Requests ships with a single Transport Adapter, the HTTPAdapter. This adapter provides the default Requests interaction with HTTP and HTTPS using the powerful urllib3 library. Whenever a Requests Session is initialized, one of these is attached to the Session object for HTTP, and one for HTTPS.

Requests enables users to create and use their own Transport Adapters that provide specific functionality. Once created, a Transport Adapter can be mounted to a Session object, along with an indication of which web services it should apply to.

>>> s = requests.Session()
>>> s.mount('http://www.github.com', MyAdapter())

The mount call registers a specific instance of a Transport Adapter to a prefix. Once mounted, any HTTP request made using that session whose URL starts with the given prefix will use the given Transport Adapter.

Many of the details of implementing a Transport Adapter are beyond the scope of this documentation, but take a look at the next example for a simple SSL use- case. For more than that, you might look at subclassing requests.adapters.BaseAdapter.

Example: Specific SSL Version

The Requests team has made a specific choice to use whatever SSL version is default in the underlying library (urllib3). Normally this is fine, but from time to time, you might find yourself needing to connect to a service-endpoint that uses a version that isn’t compatible with the default.

You can use Transport Adapters for this by taking most of the existing implementation of HTTPAdapter, and adding a parameter ssl_version that gets passed-through to urllib3. We’ll make a TA that instructs the library to use SSLv3:

import ssl

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.poolmanager import PoolManager


class Ssl3HttpAdapter(HTTPAdapter):
    """"Transport adapter" that allows us to use SSLv3."""

    def init_poolmanager(self, connections, maxsize, block=False):
        self.poolmanager = PoolManager(num_pools=connections,
                                       maxsize=maxsize,
                                       block=block,
                                       ssl_version=ssl.PROTOCOL_SSLv3)

## Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see 流式請求) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python’s asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

## Timeouts

Most requests to external servers should have a timeout attached, in case the server is not responding in a timely manner. Without a timeout, your code may hang for minutes or more.

The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine (corresponding to the connect()) call on the socket. It’s a good practice to set connect timeouts to slightly larger than a multiple of 3, which is the default TCP packet retransmission window.

Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte).

If you specify a single value for the timeout, like this:

r = requests.get('https://github.com', timeout=5)

The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately:

r = requests.get('https://github.com', timeout=(3.05, 27))

If the remote server is very slow, you can tell Requests to wait forever for a response, by passing None as a timeout value and then retrieving a cup of coffee.

r = requests.get('https://github.com', timeout=None)

## CA Certificates

By default Requests bundles a set of root CAs that it trusts, sourced from the Mozilla trust store. However, these are only updated once for each Requests version. This means that if you pin a Requests version your certificates can become extremely out of date.

From Requests version 2.4.0 onwards, Requests will attempt to use certificates from certifi if it is present on the system. This allows for users to update their trusted certificates without having to change the code that runs on their system.

For the sake of security we recommend upgrading certifi frequently!