02 requests模塊

時間 2019-11-17

標籤 requests 模塊简体版

原文原文鏈接

requests模塊

requests是用python語言基於urllib編寫的，採用的是Apache2 Licensed開源協議的HTTP庫，requests會比urllib更加方便，能夠節約咱們大量的工做。html

requests是python實現的最簡單易用的HTTP庫，建議爬蟲使用requests庫。默認安裝好python以後，是沒有安裝requests模塊的，須要單獨經過pip安裝python

requests請求

requests模塊支持的請求

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

GET請求

1 基本請求json

res = requests.get('https://www.jd.com/')

with open("jd.html", 'wb') as f:
    f.write(res.content)

2 含參數請求服務器

params參數指url '?'後的鍵值對cookie

res = requests.get('https://list.tmall.com/search_product.html')
res = requests.get('https://list.tmall.com/search_product.htm', params={"q": "手機"})

with open('tao_bao.html', 'wb') as f:
     f.write(res.content)

3 含請求頭請求session

res = requests.get("https://dig.chouti.com/", headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
})

with open('chouti.html', 'wb') as f:
    f.write(res.content)

4 含cookies請求app

import uuid

res = requests.get("http://httpbin.org/cookies", cookies={'sbid': str(uuid.uuid4()), 'a': '1'})
print(res.text)

5 session對象post

session = requests.session()
session.post('/login/')
session.get('/index/')

POST請求

requests.post()用法與requests.get()徹底一致，特殊的是requests.post()多了一個data參數網站

1 data參數ui

用於存放請求體數據。content-type默認爲application/x-www-form-urlencoed，此時請求體數據放於字典'form'鍵中

res = requests.post('http://httpbin.org/post', params={'a': '10'}, data={'name': 'ethan'})
print(res.text)

2 發送json數據

此時請求體數據放於字典'data'鍵中

res1 = requests.post('http://httpbin.org/post', json={'name': 'ethan'})    # 沒有指定請求頭，默認請求頭Content-Type: application/x-www-form-urlencoded
print(res1.json())  

res2 = requests.post('http://httpbin.org/post', json={'age': '24'})    # 默認請求頭Content-Type：application/json
print(res2.json())

代理

一些網站會有相應的反爬蟲措施，例如不少網站會檢測某一段時間某個IP的訪問次數，若是訪問頻率太快以致於看起來不像正常訪客，它可能就會禁止這個IP的訪問。因此咱們須要設置一些代理服務器，每隔一段時間換一個代理，就算IP被禁止，依然能夠換個IP繼續爬取。

res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json()
print(res)

免費代理

response對象

1 常見屬性

import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone屬性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)

2 編碼問題

requests默認編碼爲ISO-8859-1

res = requests.get('https://www.autohome.com.cn/beijing/')   # 該網站頁面內容爲gb2312編碼的,若是不設置成gbk則中文亂碼


# 爬取方式一
with open('autohome.html', 'wb') as f:
    f.write(res.content)


# 爬取方式二
res.encoding = 'gbk'
with open("autohome.html", 'w') as f:
    f.write(res.text)

3 下載二進制文件（圖片，視頻，音頻）

res = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551350578249&di=23ff7cbf4b8b47fe212e67ba3aab3267&imgtype=0&src=http%3A%2F%2Fimg.hx2cars.com%2Fupload%2Fnewimg2%2FM03%2FA9%2F03%2FClo8xFklT1GAU059AAR1t2rZPz4517_small_800_600.jpg')

with open('c180.jpg', 'wb') as f:
    # f.write(res.content)
    for line in res.iter_content():  # 按行寫入
        f.write(line)

4 解析json數據

res = requests.get('http://httpbin.org/get')
print(res.text)
print(type(res.text))    # str

print(res.json())
print(type(res.json()))  # dict

5 Redirection and history

默認狀況下，除了requests.head，requests會自動處理全部重定向。能夠使用響應對象的history方法來追蹤重定向。

response.history是一個response對象的列表，爲了完成請求而建立了這些對象。這個對象列表按照從最老到最近的請求進行排序

res = requests.get('http://www.jd.com/')
print(res.history)    # [<Response [302]>]
print(res.text)
print(res.status_code)  # 200

可經過allow_redirects參數禁用重定向處理：

res = requests.get('http://www.jd.com/', allow_redirects=False)
print(res.history)    # []
print(res.status_code)    # 302

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。