Python：requests：詳解超時和重試

時間 2019-11-16

原文原文鏈接

網絡請求不可避免會趕上請求超時的狀況，在 requests 中，若是不設置你的程序可能會永遠失去響應。
超時又可分爲鏈接超時和讀取超時。html

鏈接超時

鏈接超時指的是在你的客戶端實現到遠端機器端口的鏈接時（對應的是connect()），Request 等待的秒數。python

import time
import requests

url = 'http://www.google.com.hk'

print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
    html = requests.get(url, timeout=5).text
    print('success')
except requests.exceptions.RequestException as e:
    print(e)

print(time.strftime('%Y-%m-%d %H:%M:%S'))

由於 google 被牆了，因此沒法鏈接，錯誤信息顯示 connect timeout（鏈接超時）。git

2018-12-14 14:38:20
HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x00000000047F80F0>, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))
2018-12-14 14:38:25

就算不設置，也會有一個默認的鏈接超時時間（我測試了下，大概是21秒）。github

讀取超時

讀取超時指的就是客戶端等待服務器發送請求的時間。（特定地，它指的是客戶端要等待服務器發送字節之間的時間。在 99.9% 的狀況下這指的是服務器發送第一個字節以前的時間）。服務器

簡單的說，鏈接超時就是發起請求鏈接到鏈接創建之間的最大時長，讀取超時就是鏈接成功開始到服務器返回響應之間等待的最大時長。cookie

若是你設置了一個單一的值做爲 timeout，以下所示：網絡

r = requests.get('https://github.com', timeout=5)

這一 timeout 值將會用做 connect 和 read 兩者的 timeout。若是要分別制定，就傳入一個元組：session

r = requests.get('https://github.com', timeout=(3.05, 27))

黑板課爬蟲闖關的第四關正好網站人爲設置了一個15秒的響應等待時間，拿來作說明最好不過了。app

import time
import requests

url_login = 'http://www.heibanke.com/accounts/login/?next=/lesson/crawler_ex03/'

session = requests.Session()
session.get(url_login)

token = session.cookies['csrftoken']
session.post(url_login, data={'csrfmiddlewaretoken': token, 'username': 'xx', 'password': 'xx'})

print(time.strftime('%Y-%m-%d %H:%M:%S'))

url_pw = 'http://www.heibanke.com/lesson/crawler_ex03/pw_list/'
try:
    html = session.get(url_pw, timeout=(5, 10)).text
    print('success')
except requests.exceptions.RequestException as e:
    print(e)

print(time.strftime('%Y-%m-%d %H:%M:%S'))

錯誤信息中顯示的是 read timeout（讀取超時）。less

2018-12-14 15:20:47
HTTPConnectionPool(host='www.heibanke.com', port=80): Read timed out. (read timeout=10)
2018-12-14 15:20:57

讀取超時是沒有默認值的，若是不設置，程序將一直處於等待狀態。咱們的爬蟲常常卡死又沒有任何的報錯信息，緣由就在這裏了。

超時重試

通常超時咱們不會當即返回，而會設置一個三次重連的機制。

def gethtml(url):
    i = 0
    while i < 3:
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1

其實 requests 已經幫咱們封裝好了。（可是代碼好像變多了...）

import time
import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))

print(time.strftime('%Y-%m-%d %H:%M:%S'))
try:
    r = s.get('http://www.google.com.hk', timeout=5)
    return r.text
except requests.exceptions.RequestException as e:
    print(e)
print(time.strftime('%Y-%m-%d %H:%M:%S'))

max_retries 爲最大重試次數，重試3次，加上最初的一次請求，一共是4次，因此上述代碼運行耗時是20秒而不是15秒

2018-12-14 15:34:03
HTTPConnectionPool(host='www.google.com.hk', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x0000000013269630>, 'Connection to www.google.com.hk timed out. (connect timeout=5)'))
2018-12-14 15:34:23