python3 urllib及requests基本使用

時間 2019-11-18

標籤 python3 python urllib requests 基本使用欄目 Python 简体版

原文原文鏈接

在python中，urllib是請求url鏈接的標準庫，在python2中，分別有urllib和urllib，在python3中，整合成了一個，稱謂urllibhtml

一、urllib.requestpython

　　request主要負責構建和發起網絡請求git

　　1）GET請求（不帶參數）github

　　　　response = urllib.request.urlopen(url,data=None, [timeout, ]*)json

　　　　返回的response是一個http.client.HTTPResponse objectapi

　　　　response操做：瀏覽器

　　　　　　a) response.info() 能夠查看響應對象的頭信息,返回的是http.client.HTTPMessage object服務器

　　　　　　b) getheaders() 也能夠返回一個list列表頭信息cookie

　　　　　　c) response能夠經過read(), readline(), readlines()讀取，可是得到的數據是二進制的因此還須要decode將其轉化爲字符串格式。網絡

　　　　　　d) getCode() 查看請求狀態碼

　　　　　　e) geturl() 得到請求的url

　　　　>>>>>>>

　　2）GET請求（帶參數）

　　　　須要用到urllib下面的parse模塊的urlencode方法

　　　　param = {"param1":"hello", "param2":"world"}

　　　　param = urllib.parse.urlencode(param)　　　　# 獲得的結果爲：param2=world&param1=hello

　　　　url = "?".join([url, param])　　# http://httpbin.org/ip?param1=hello&param2=world

　　　　response = urllib.request.urlopen(url)

　　3）POST請求：

　　　　urllib.request.urlopen()默認是get請求，可是當data參數不爲空時，則會發起post請求

　　　　傳遞的data須要是bytes格式

　　　　設置timeout參數，若是請求超出咱們設置的timeout時間，會跑出timeout error 異常。

　　　　param = {"param1":"hello", "param2":"world"}

　　　　param = urllib.parse.urlencode(param).encode("utf8") # 參數必需要是bytes

　　　　response = urllib.request.urlopen(url, data=param, timeout=10)

　　4）添加headers

　　　　經過urllib發起的請求，會有一個默認的header：Python-urllib/version，指明請求是由urllib發出的，因此遇到一些驗證user-agent的網站時，咱們須要僞造咱們的headers

　　　　僞造headers，須要用到urllib.request.Request對象

　　　　headers = {"user-agent:"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}

　　　　req = urllib.request.Request(url, headers=headers)

　　　　resp = urllib.request.urlopen(req)

　　　　對於爬蟲來講，若是一直使用同一個ip同一個user-agent去爬一個網站的話，可能會被禁用，因此咱們還能夠用用戶代理池，循環使用不一樣的user-agent

　　　　原理：將各個瀏覽器的user-agent作成一個列表，而後每次爬取的時候，隨機選擇一個代理去訪問

　　　　uapool = ["谷歌代理", 'IE代理', '火狐代理',...]

　　　　curua = random.choice(uapool)

　　　　headers = {"user-agent": curua}

　　　　req = urllib.request.Request(url, headers=headers)

　　　　resp = urllib.request.urlopen(req)

　　5）添加cookie

　　　　爲了在請求的時候，帶上cookie信息，須要構造一個opener

　　　　須要用到http下面的cookiejar模塊

　　　　from http import cookiejar

　　　　from urllib import request

　　　　a) 建立一個cookiejar對象

　　　　　　cookie = cookiejar.CookieJar()

　　　　b) 使用HTTPCookieProcessor建立cookie處理器

　　　　　　cookies = request.HTTPCookieProcessor(cookie)

　　　　c) 以cookies處理器爲參數建立opener對象

　　　　　　opener = request.build_opener(cookies)

　　　　d) 使用這個opener來發起請求

　　　　　　resp = opener.open(url)

　　　　e) 使用opener還能夠將其設置成全局的，則再使用urllib.request.urlopen發起的請求，都會帶上這個cookie

　　　　　　request.build_opener(opener)

　　　　　　request.urlopen(url)

　　6）IP代理

　　　　使用爬蟲來爬取數據的時候，經常須要隱藏咱們真實的ip地址，這時候須要使用代理來完成

　　　　IP代理可使用西刺（免費的，可是不少無效），大象代理（收費）等

　　　　代理池的構建能夠寫固定ip地址，也可使用url接口獲取ip地址

　　　　固定ip：

　　　　　　from urllib import request

　　　　　　import random

　　　　　　ippools = ["36.80.114.127:8080","122.114.122.212:9999","186.226.178.32:53281"]

　　　　　　def ip(ippools):

　　　　　　　　cur_ip = random.choice(ippools)

　　　　　　　　# 建立代理處理程序對象

　　　　　　　　proxy = request.ProxyHandler({"http":cur_ip})

　　　　　　　　# 構建代理

　　　　　　　　opener = request.build_opener(proxy, request.HttpHandler)

　　　　　　　　# 全局安裝

　　　　　　　　request.install_opener(opener)

　　　　　　for i in range(5):

　　　　　　　　try:

　　　　　　　　　　ip(ippools)

　　　　　　　　　　cur_url = "http://www.baidu.com"

　　　　　　　　　　resp = request.urlopen(cur_url).read().decode("utf8")

　　　　　　　　excep Exception as e:

　　　　　　　　　　print(e)

　　　　使用接口構建IP代理池（這裏是以大象代理爲例）

　　　　　　def api():

　　　　　　　　all=urllib.request.urlopen("http://tvp.daxiangdaili.com/ip/?tid=訂單號&num=獲取數量&foreign=only")

　　　　　　　　ippools = []

　　　　　　　　for item in all:

　　　　　　　　　　ippools.append(item.decode("utf8"))

　　　　　　　　return ippools

　　　　　　其餘的和上面使用方式相似

　　7）爬取數據並保存到本地 urllib.request.urlretrieve()

　　　　如咱們常常會須要爬取一些文件或者圖片或者音頻等，保存到本地

　　　　urllib.request.urlretrieve(url, filename)

　　8）urllib的parse模塊

　　　　前面第2）中，咱們用到了urllib.parse.urlencode()來編碼咱們的url

　　　　a）urllib.parse.quote()

　　　　　　這個多用於特殊字符的編碼，如咱們url中須要按關鍵字進行查詢，傳遞keyword='詩經'

　　　　　　url是隻能包含ASCII字符的，特殊字符及中文等都須要先編碼在請求

　　　　　　要解碼的話，使用unquote

　　　　b）urllib.parse.urlencode()

　　　　　　這個一般用於多個參數時，幫咱們將參數拼接起來並編譯，向上面咱們使用的同樣

　　9）urllib.error

　　　　urllib中主要兩個異常，HTTPError，URLError，HTTPError是URLError的子類

　　　　HTTPError包括三個屬性：

　　　　　　code：請求狀態碼

　　　　　　reason：錯誤緣由

　　　　　　headers：請求報頭

二、requests

　　requests模塊在python內置模塊上進行了高度的封裝，從而使得python在進行網絡請求時，變得更加人性化，使用requests能夠垂手可得的完成瀏覽器可有的任何操做。

　　1）get：

　　　　requests.get(url)　　#不帶參數的get請求

　　　　requests.get(url, params={"param1":"hello"})　　# 帶參數的get請求，requests會自動將參數添加到url後面

　　2）post：

　　　　requests.post(url, data=json.dumps({"key":"value"}))

　　3）定製頭和cookie信息

　　　　header = {"content-type":"application/json"，"user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"}

　　　　cookie = {"cookie":"cookieinfo"}

　　　　requests.post(url, headers=header, cookie=cookie)

　　　　requests.get(url, headers=header, cookie=cookie)

　　4）返回對象操做：

　　　　使用requests.get/post後會返回一個response對象，其存儲了服務器的響應內容，咱們能夠經過下面方法讀取響應內容

　　　　resp = requests.get(url)

　　　　resp.url　　　　　　　　 # 獲取請求url

　　　　resp.encoding　　　　 # 獲取當前編碼

　　　　resp.encoding='utf8' 　　# 設置編碼

　　　　resp.text　　　　　　 # 以encoding解析返回內容。字符串形式的響應體會自動根據響應頭部的編碼方式進行解碼

　　　　resp.content 　　　　　 # 以字節形式（二進制）返回。字節方式的響應體，會自動解析gzip和deflate壓縮

　　　　resp.json() 　　　　　　 # requests中內置的json解碼器。以json形式返回，前提是返回的內容確實是json格式的，不然會報錯

　　　　resp.headers　　　　　 # 以字典形式存儲服務器響應頭。可是這個字典不區分大小寫，若key不存在，則返回None

　　　　resp.request.headers　　# 返回發送到服務器的頭信息

　　　　resp.status_code　　　 # 響應狀態碼

　　　　resp.raise_for_status()　 # 失敗請求拋出異常

　　　　resp.cookies　　　　　　 # 返回響應中包含的cookie

　　　　resp.history　　　　　　 # 返回重定向信息。咱們能夠在請求時加上allow_redirects=False 來阻止重定向

　　　　resp.elapsed　　　　　　 # 返回timedelta，響應所用的時間

　　　　具體的有哪些方法或屬性，能夠經過dir(resp)去查看。

　　5）Session()

　　　　會話對象，可以跨請求保持某些參數。最方便的是在同一個session發出的全部請求之間保持cookies

　　　　s = requests.Session()

　　　　header={"user-agent":""}

　　　　s.headers.update(header)

　　　　s.auth = {"auth", "password"}

　　　　resp = s.get(url)

　　　　resp1 = s.port(url)

　　6）代理

　　　　proxies = {"http":"ip1", "https":"ip2"}

　　　　requests.get(url, proxies=proxies)

　　7）上傳文件

　　　　requests.post(url, files={"file": open(file, 'rb')})

　　　　把字符串當作文件上傳方式：

　　　　requests.post(url, files={"file":('test.txt', b'hello world')}})　　# 顯示的指明文件名爲test.txt

　　8）身份認證（HTTPBasicAuth）

　　　　from requests.auth import HTTPBasicAuth

　　　　resp = request.get(url, auth=HTTPBasicAuth("user", "password"))

　　　　另外一種很是流行的HTTP身份認證形式是摘要式身份認證

　　　　requests.get(url, HTTPDigestAuth("user", "password"))

　　9）模擬登錄GitHub demo

　　　　from bs4 import BeautifulSoup

　　　　# 訪問登錄界面，獲取登錄須要的authenticity_token

　　　　url = "https://github.com/login"

　　　　resp = requests.get(url)

　　　　# 使用BeautifulSoup解析爬取的html代碼並獲取authenticity_token的值

　　　　s = BeautifulSoup(resp.text, "html.parser")

　　　　token = s.find("input", attrs={"name":"authenticity_token"}).get("value")

　　　　# 獲取登錄須要攜帶的cookie

　　　　cookie = resp.cookies.get_dict()

　　　　# 帶參數登錄

　　　　login_url = "https://github.com/session"

　　　　login_data = {

　　　　　　"commit":"Sign+in",

　　　　　　"utf8":"✓",

　　　　　　"authenticity_token":token,

　　　　　　"login":username,

　　　　　　　"password":password

　　　　}

　　　　resp2 = requests.post(login_url, data=login_data, cookies=cookie)

　　　　# 獲取登錄後的cookie

　　　　logged_cookie = resp2.cookies.get_dict()

　　　　# 攜帶cookie就能夠訪問你想要訪問的任意網頁了

　　　　requests.get(url, cookies=logged_cookie)

　　　　# 也能夠把爬取的網頁寫入本地文件

　　　　with open(file, "w", encoding='utf8') as f:

　　　　　　f.write(resp2.text)

參考：https://www.cnblogs.com/ranxf/p/7808537.html

相關標籤/搜索

python3+requests+unittest

python3+requests+beautifulsoup+mysql

python3+requests+excel

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。