使用requests

時間 2019-11-13

標籤使用 requests 简体版

原文原文鏈接

使用urllib庫有不少不便之處，如處理網頁驗證和Cookies時，須要寫Opener和Handler來處理，爲了更方便地實現這些操做，就須要使用到requestsgit

一.基本用法

1）實例引入

urllib庫的urlopen()方法其實是以GET方式請求網頁，而requests中相應的方法就是get()方法github

import requests
r=requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

其還能夠用post(),put(),delete()等方法實現POST，PUT，DELETE等請求正則表達式

2）GET請求

import requests
r=requests.get('http://httpbin.org/get')
print(r.text)

請求的連接是http://httpbin.org/get,該網站會判斷若是客戶端發起的是GET請求的話，它返回相應的請求信息，返回的結果中包含請求頭，URL，IP等信息json

對於GET請求若是要添加參數，使用params這個參數就能夠了瀏覽器

import requests
data={
    'name':'dengwenxiong',
    'age':23
}
r=requests.get('http://httpbin.org/get',params=data)
print(r.text)

另外，網頁的返回類型其實是str類型，可是它很特殊，是JSON格式的，因此想直接解析返回結果，獲得一個字典格式的話，能夠直接調用json（）方法服務器

import requests
r=requests.get("http://httpbin.org/get")
print(type(r.text))
print(r.json())
print(type(r.json()))

A.抓取網頁cookie

請求普通的網頁網絡

import requests
import re
headers={
    'User-Agent': 'Mozilla/5.0(Macintosh;Inter Mac OS X 10_11_4) AppleWebKit/537.36(KHTML.like Gecko) Chrome/52.0.2743.116 Safari/537.36 '
}
r=requests.get("https://www.zhihu.com/explore",headers=headers)
pattern=re.compile('explore-feed.*?question_link.*?>(.*?)</a>',re.S)
title=re.findall(pattern,r.text)
print(title)

這裏加入了headers信息，其中包含了User-Agent字段信息，即瀏覽器標識信息，不加入這個知乎就會禁止爬取，而後用正則表達式匹配出問題的內容數據結構

B.抓取二進制數據post

圖片，音頻，視頻這些文件本質上都是由二進制碼組成，想要抓取他們，就要拿到它們的二進制碼

import requests
r=requests.get('https://github.com/favicon.ico')
print(r.text)
print(r.content)

這裏抓取的內容是Github的站點圖標，因爲圖片是二進制數據，因此前者在打印時轉化爲str類型，即圖片直接轉化爲字符串，因此出現亂碼；後者是bytes類型的數據

將提取的圖片保存下來

import requests
r=requests.get("https://github.com/favicon.ico")
with open("favicon.ico",'wb') as f:
    f.write(r.content)

音頻視頻也能夠用這種方法獲取

3）POST請求

import requests
data={'name':'dengwenxiong',
      'age':23}
r=requests.post("http://httpbin.org/post",data=data)
print(r.text)

4）響應

import requests
r=requests.get("http://www.jianshu.com")
print(type(r.status_code),r.status_code)
print(type(r.headers),r.headers)
print(type(r.cookies),r.cookies)
print(type(r.url),r.url)
print(type(r.history),r.history)

獲得的響應不只可使用text和content獲取響應的內容，還可使用status_code屬性獲取狀態碼，使用headers屬性獲取響應頭，使用cookies屬性獲取cookies，使用url屬性獲取url，使用history屬性獲取請求歷史

還有內置的狀態碼查詢對象requests.codes

二.高級用法

1）文件上傳

import requests
files={'file':open('favicon.ico','rb')}
r=requests.post("http://httpbin.org/post",files=files)
print(r.text)

文件上傳部分會單獨有一個files字段來標識

2）Cookies

獲取cookies的代碼

import requests
r=requests.get("https://www.baidu.com")
print(r.cookies)
for key,value in r.cookies.items():
    print(key+'='+value)

這裏首先調用cookies屬性獲得Cookies，其是一個RequestCookieJar類型，而後用items()方法將其轉化爲元組組成的列表，遍歷輸出每個cookie的名稱和值，實現Cookie的遍歷解析

3）會話維持

在requests中，若是直接利用get（）或post（）等方法能夠作到模擬網頁請求，但這其實是至關於兩個不一樣的會話，及至關於用了兩個瀏覽器打開了不一樣的頁面；要解決這個問題的主要方法是維持同一個會話，及至關於打開一個新的瀏覽器選項卡而不是新開一個瀏覽器，這時就要用到Session對象，利用它，能夠很方便地維持一個會話，不用擔憂cookies問題，它會自動幫咱們處理好。

示例以下：

import requests
requests.get("http://httpbin.org/cookies/set/number/123456789")
r=requests.get("http://httpbin.org/cookies")
print(r.text)

請求了http://httpbin.org/cookies/set/number/123456789，請求這個網址時，能夠設置一個cookies，名稱叫做number。內容是123456789，隨後又請求了http://httpbin.org/cookies此網址能夠獲取當前的Cookies

結果並不行

import requests
s=requests.Session()
s.get("http://httpbin.org/cookies/set/number/123456789")
r=s.get("http://httpbin.org/cookies")
print(r.text)

這樣就能成功獲取。

4）SSL證書驗證

requests提供了證書驗證的功能；當發送HTTP請求時，它會檢查SSL證書，咱們可使用verify參數控制是否檢查此證書，若是不加verify參數的話，默認時True，會自動驗證

import requests
from requests.packages import urllib3
urllib3.disable_warnings()#忽略警告
response=requests.get("https://www.12306.cn",verify=False)#設置不驗證證書
print(resposne.status_code)

#經過捕獲警告到日誌的方式忽略警告
import requests
import logging
logging.captureWarnings(True)
response=requests.get("https://www.12306.cn"，verify=False)
print(response.status_code)

也能夠指定一個本地證書用做客戶端證書，這裏能夠是單個文件（包含證書和密鑰）或一個包含兩個文件路徑的元組：

import reqeusts
response=requests.get("https://www.12306.cn",cert=("/path/server.crt","/path/key"))
print(response.status_code)

5）代理設置

對於某些網站，在測試的時候請求幾回能正常獲取內容，但一旦開始大規模爬取，對於大規模且頻繁的請求，網站可能會彈出驗證碼，或者跳轉到登陸認證頁面，更有可能會直接封禁客戶端IP，致使必定時間段內沒法訪問。

爲了防止這種狀況發生，咱們須要設置代理來解決這個問題，這就須要用到proxies 參數

import  requests
proxies={
"http":"http://10.10.1.10:3128",
「https」:"https://10.10.1.10:1080"
}
requests.get("https://www.taobao.com",proxies=proxies)

若代理須要使用HTTP Basic Auth，可使用相似http://user:password@host:port這樣的語法來設置代理

import requests
proxies={
"http":"http://user:password@10.10.1.10:3128",
}
requests.get("https://www.taobao.com",proxies=proxies)

除了最基本的HTTP代理外，requests還支持SOCKS協議的代理，首先要安裝socks這個庫，而後就可使用SOCKS協議代理了

import requests
proxies={
"http":"sock5://user:password@host:port"
"https":"sock5://user:password@host:port"
}
requests.get("https://www.taobao.com",proxies=proxies)

6）超時設置

在本機網絡狀況很差或者服務器網絡響應太慢甚至無響應時，咱們可能會等待特別久的時間才能收到響應，甚至到最後收不到響應而報錯，爲了防止服務器不能及時響應，應該設置一個超時時間，即超過這個時間還沒獲得響應那就報錯，這就須要用到timeout參數，這個時間是發出請求到服務器返回響應的時間

import requests
response=requests.get('https://www.taobao.com',timeout=1)
print(response.status_code)

實際上，請求分爲兩個階段，鏈接（connect）和讀取（read）

timeout能夠指定兩者的總和，也能夠傳入一個元組分別指定

request.get("https://www.taobao.com",timeout=(5,11,30))

若是向=想永久等待，能夠直接將timeout設置爲None，或者不設置直接留空

7）身份認證

在訪問網站時，可能會遇到須要認證的網頁，這時可使用requests自帶的身份認證功能

import requests
from requests.auth import HTTPBasicAuth
r=requests.get("http://localhost:5000",auth=HTTPBasicAuth('username','password'))
print(r.status_code)

若是直接傳一個元組，它會默認使用HTTPBasicAuth這個類來認證

因此上面代碼簡寫爲

import requests
r=requests.get("http://localhost:5000",auth=('username','password'))
print(r.status_code)

8)Prepared Request

前面介紹urllib時，咱們能夠將請求表示爲數據結構，其中各個參數均可以經過一個Request對象來表示；這在requests一樣能夠作到，這個數據結構就叫Prepared Request

from requests import Request,Session
url="http://httpbin.org/post"
data={
    'name':'dengwenxiong'
}
headers={
    'User-agent':'Mozilla/5.0(Macintosh;Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/53.0.2785.116 Safari/537.36'
}
s=Session()
req=Request("POST",url,data=data,headers=headers)
prepped=s.prepare_request(req)
r=s.send(prepped)
print(r.text)

這裏引入了Request，而後用url，data和headers參數構造了一個Request對象，這時須要再調用Session的prepare_request()方法將其轉換爲一個Prepared Request對象，而後調用send()方法便可

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。