Python3網絡爬蟲——3、Requests庫的基本使用

時間 2019-11-11

標籤 python3 python 網絡爬蟲 requests 基本使用欄目 Python 简体版

原文原文鏈接

1、什麼是Requestspython

　　Requests是用Python語言編寫，基於urllib，採用Apache2 Licensed開元協議的HTTP庫。它比urllib更加的方便，能夠節約咱們大量的工做徹底知足HTTP測試需求。簡單來說，即Python實現的簡單易用的HTTP庫。git

2、Requests庫的安裝github

　　若是是初學者，建議使用原生Python3進行安裝。json

1 >> pip3 install requests

　　若是有必定的Python基礎（會基本語法便可），使用anaconda進行安裝更加方便，能夠避免一些版本問題，畢竟Python2和Python3是兩種不一樣的語言（高級黑(⊙﹏⊙)b）。api

1 >> conda install requests

3、經常使用方法瀏覽器

　　首先來感覺一下Requests的方便之處。安全

1 import requests
2 
3 response = requests.get('http://www.baidu.com')
4 
5 print(response.status_code)
6 print(response.text)
7 print(type(response.text))
8 print(response.cookies)

　　運行代碼，能夠看到response的類型爲str類型，即咱們不須要再用decode方法進行轉碼，其次能夠直接得到cookie對象。服務器

1 import requests
2 
3 requests.post('http://httpbin.org/post')
4 requests.put('http://httpbin.org/put')
5 requests.options('http://httpbin.org/get')

　　能夠看到咱們能夠方便的進行各類請求。httpbin.org是一個http驗證網址。下面看一下經常使用的一些方法。cookie

普通的get請求

 1 import requests
 2 
 3 response = requests.get('http://httpbin.org/get')
 4 print(response.text)
 5 '''
 6 {
 7   "args": {},
 8   "headers": {
 9     "Accept": "*/*",
10     "Accept-Encoding": "gzip, deflate",
11     "Connection": "close",
12     "Host": "httpbin.org",
13     "User-Agent": "python-requests/2.14.2"
14   },
15   "origin": "127.0.0.1",
16   "url": "http://httpbin.org/get"
17 }
18 '''

　　這是最簡單的get請求，能夠看一下返回結果（''' '''內的字符串）。是以字典形式返回的結果。注：沒有使用代理，爲了防止惡意的IP攻擊，將origin的值修改了下，實際返回的是請求的IP地址。session

帶參數的get請求

 1 import requests
 2 
 3 data = {
 4     'name':'zhangsan',
 5     'age':22
 6 }
 7 response = requests.get('http://httpbin.org/get',params=data)
 8 print(response.text)
 9 '''
10 {
11   "args": {
12     "age": "22",
13     "name": "zhangsan"
14   },
15   "headers": {
16     "Accept": "*/*",
17     "Accept-Encoding": "gzip, deflate",
18     "Connection": "close",
19     "Host": "httpbin.org",
20     "User-Agent": "python-requests/2.14.2"
21   },
22   "origin": "127.0.0.1",
23   "url": "http://httpbin.org/get?name=zhangsan&age=22"
24 }
25 '''

　　咱們能夠構造一個字典，傳給params參數，這樣就能夠向服務器發送參數，從url參數能夠看出，效果至關於utl?name=zhangsan&age=22。

解析json

 1 import requests
 2 import json
 3 
 4 response = requests.get('http://httpbin.org/get')
 5 print(response.json()) # 等同於 json.loads(response)
 6 print(type(response.json()))
 7 
 8 '''
 9 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip
10 , deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Ag
11 ent': 'python-requests/2.14.2'}, 'origin': '113.128.88.6', 'url':
12 'http://httpbin.org/get'}
13 <class 'dict'>
14 '''

　　這樣返回的數據就被轉換成了json格式，類型爲字典類型。

二進制數據

1 import requests
2 
3 response = requests.get('http://github.com/favicon.ico')
4 with open(r'F:\favicon.ico','wb') as f:
5     f.write(response.content)

　　以上將一張圖片保存到本地的過程。

1 >>> print(type(response.text))
2 <class 'str'>
3 >>> print(type(response.content))
4 <class 'bytes'>

　　能夠看出text和content的區別。content的內容爲二進制數據，因此想要進行存儲時，保存的是其二進制數據。

添加headers

1 import requests
2 
3 headers = {
4   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36'
5 }
6 
7 response = requests.get('http://www.zhihu.com/explore',headers=headers)

post請求

 1 import requests
 2 
 3 data = {'name':'zhangsan','age':'22'}
 4 response = requests.post('http://httpbin.org/post',data=data)
 5 print(response.text)
 6 
 7 '''
 8 {
 9   "args": {},
10   "data": "",
11   "files": {},
12   "form": {
13     "age": "22",
14     "name": "zhangsan"
15   },
16   "headers": {
17     "Accept": "*/*",
18     "Accept-Encoding": "gzip, deflate",
19     "Connection": "close",
20     "Content-Length": "20",
21     "Content-Type": "application/x-www-form-urlencoded",
22     "Host": "httpbin.org",
23     "User-Agent": "python-requests/2.14.2"
24   },
25   "json": null,
26   "origin": "127.0.0.1",
27   "url": "http://httpbin.org/post"
28 }
29 '''

　　能夠看到，data參數接收的數據，將以表單的形式進行提交。

file提交

1 import requests
2 
3 files = {'file':open(r'F:\favicon.ico','rb')}
4 response = requests.post('http://httpbin.org/post',files=files)

　　運行代碼，就會看到在key=file的值爲該本地圖片的二進制代碼。

獲取cookie並輸出

1 import requests
2 
3 response = requests.get('https://www.baidu.com')
4 for key,value in response.cookies.items():
5     print(key + ' = ' + value)
6 
7 # BDORZ = 27315

　　經過這種方式能夠得到cookie的具體信息。

會話維持

　　咱們獲取cookie信息是爲了維持會話，下面的例子用到了http測試網址的特性，即咱們先經過url進行cookie的設置，而後經過訪問服務器獲取cookie。

 1 import requests
 2 
 3 requests.get('http://httpbin.org/cookies/set/name/zhangsan')
 4 response = requests.get('http://httpbin.org/cookies')
 5 print(response.text)
 6 
 7 '''
 8 {
 9   "cookies": {}
10 }
11 '''

　　這時咱們看到，cookies信息爲空。這是由於咱們經過以上方式進行測試，至關於進行了兩次獨立的請求（能夠想象成用兩個瀏覽器進行請求），由於第一次設置的cookie在第二次訪問中並拿不到，因此咱們須要會話維持。

 1 import requests
 2 
 3 s = requests.Session()
 4 print(type(s)) # <class 'requests.sessions.Session'>
 5 
 6 s.get('http://httpbin.org/cookies/set/name/zhangsan')
 7 response = s.get('http://httpbin.org/cookies')
 8 print(response.text)
 9 
10 '''
11 {
12   "cookies": {
13     "name": "zhangsan"
14   }
15 }
16 '''

　　經過session對象咱們就能夠實現會話維持。

SSL證書驗證問題

1 import requests
2 
3 response = requests.get('http://www.12306.cn')
4 
5 '''
6 raise SSLError(e, request=request)
7 requests.exceptions.SSLError: ("bad handshake: Error([('SSL routin
8 es', 'ssl3_get_server_certificate', 'certificate verify failed')],
9 '''

　　https進行了網站的安全驗證，所以當咱們訪問一個沒有SSL證書的網址時會拋出SSL錯誤。爲了解決這個問題，須要進行參數設置。

1 import requests
2 from requests.packages import urllib3
3 urllib3.disable_warnings() # 消除警告信息
4  
5 response  =requests.get('https://www.12306.cn',verify=false)

　　這樣就能夠成功的返回網頁。

異常處理

 1 import requests
 2 from requests.exceptions import ReadTimeout,HTTPError,RequestException
 3 
 4 try:
 5     response = requests.get('http://www.baidu.com',timeout=0.01)
 6     print(response.status_code)
 7 except ReadTimeout:
 8     print('Timeout')
 9 except HTTPError:
10     print('http error')
11 except RequestException:
12     print('Error')
13 
14 # Timeout