requests是用python語言基於urllib編寫的,採用的是Apache2 Licensed開源協議的HTTP庫,requests會比urllib更加方便,能夠節約咱們大量的工做。html
requests是python實現的最簡單易用的HTTP庫,建議爬蟲使用requests庫。默認安裝好python以後,是沒有安裝requests模塊的,須要單獨經過pip安裝python
import requests requests.get("http://httpbin.org/get") requests.post("http://httpbin.org/post") requests.put("http://httpbin.org/put") requests.delete("http://httpbin.org/delete") requests.head("http://httpbin.org/get") requests.options("http://httpbin.org/get")
1 基本請求json
res = requests.get('https://www.jd.com/') with open("jd.html", 'wb') as f: f.write(res.content)
2 含參數請求服務器
params參數指url '?'後的鍵值對cookie
res = requests.get('https://list.tmall.com/search_product.html') res = requests.get('https://list.tmall.com/search_product.htm', params={"q": "手機"}) with open('tao_bao.html', 'wb') as f: f.write(res.content)
3 含請求頭請求session
res = requests.get("https://dig.chouti.com/", headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' }) with open('chouti.html', 'wb') as f: f.write(res.content)
4 含cookies請求app
import uuid res = requests.get("http://httpbin.org/cookies", cookies={'sbid': str(uuid.uuid4()), 'a': '1'}) print(res.text)
5 session對象post
session = requests.session() session.post('/login/') session.get('/index/')
requests.post()用法與requests.get()徹底一致,特殊的是requests.post()多了一個data參數網站
1 data參數ui
用於存放請求體數據。content-type默認爲application/x-www-form-urlencoed,此時請求體數據放於字典'form'鍵中
res = requests.post('http://httpbin.org/post', params={'a': '10'}, data={'name': 'ethan'}) print(res.text)
2 發送json數據
此時請求體數據放於字典'data'鍵中
res1 = requests.post('http://httpbin.org/post', json={'name': 'ethan'}) # 沒有指定請求頭,默認請求頭Content-Type: application/x-www-form-urlencoded print(res1.json()) res2 = requests.post('http://httpbin.org/post', json={'age': '24'}) # 默認請求頭Content-Type:application/json print(res2.json())
一些網站會有相應的反爬蟲措施,例如不少網站會檢測某一段時間某個IP的訪問次數,若是訪問頻率太快以致於看起來不像正常訪客,它可能就會禁止這個IP的訪問。因此咱們須要設置一些代理服務器,每隔一段時間換一個代理,就算IP被禁止,依然能夠換個IP繼續爬取。
res=requests.get('http://httpbin.org/ip', proxies={'http':'110.83.40.27:9999'}).json() print(res)
1 常見屬性
import requests respone=requests.get('https://sh.lianjia.com/ershoufang/') # respone屬性 print(respone.text) print(respone.content) print(respone.status_code) print(respone.headers) print(respone.cookies) print(respone.cookies.get_dict()) print(respone.cookies.items()) print(respone.url) print(respone.history) print(respone.encoding)
2 編碼問題
requests默認編碼爲ISO-8859-1
res = requests.get('https://www.autohome.com.cn/beijing/') # 該網站頁面內容爲gb2312編碼的,若是不設置成gbk則中文亂碼 # 爬取方式一 with open('autohome.html', 'wb') as f: f.write(res.content)
# 爬取方式二 res.encoding = 'gbk' with open("autohome.html", 'w') as f: f.write(res.text)
3 下載二進制文件(圖片,視頻,音頻)
res = requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551350578249&di=23ff7cbf4b8b47fe212e67ba3aab3267&imgtype=0&src=http%3A%2F%2Fimg.hx2cars.com%2Fupload%2Fnewimg2%2FM03%2FA9%2F03%2FClo8xFklT1GAU059AAR1t2rZPz4517_small_800_600.jpg') with open('c180.jpg', 'wb') as f: # f.write(res.content) for line in res.iter_content(): # 按行寫入 f.write(line)
4 解析json數據
res = requests.get('http://httpbin.org/get') print(res.text) print(type(res.text)) # str print(res.json()) print(type(res.json())) # dict
5 Redirection and history
默認狀況下,除了requests.head,requests會自動處理全部重定向。能夠使用響應對象的history方法來追蹤重定向。
response.history是一個response對象的列表,爲了完成請求而建立了這些對象。這個對象列表按照從最老到最近的請求進行排序
res = requests.get('http://www.jd.com/') print(res.history) # [<Response [302]>] print(res.text) print(res.status_code) # 200
可經過allow_redirects參數禁用重定向處理:
res = requests.get('http://www.jd.com/', allow_redirects=False) print(res.history) # [] print(res.status_code) # 302