urllib庫是python內置的一個http請求庫html
其實urllib庫很差用,沒有requests庫好用,下一篇文章再寫requests庫,requests庫是基於urllib庫實現的python
做爲最最基本的請求庫,瞭解一下原理和用法仍是頗有必要的chrome
urllib.request 請求模塊(就像在瀏覽器輸入網址,敲回車同樣)json
urllib.error 異常處理模塊(出現請求錯誤,能夠捕捉這些異常)瀏覽器
urllib.parse url解析模塊cookie
urllib.robotparser robots.txt解析模塊,判斷哪一個網站能夠爬,哪一個不能夠爬,用的比較少app
在python2中:dom
import urllib2socket
response = urllib2.urlopen('http://www.baidu.com')ide
在python3中:
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)
對前三個參數進行講解:
url參數:
from urllib import request response = request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8'))
data參數:
沒有data參數時,發送的是一個get請求,加上data參數後,請求就變成了post方式
import urllib
from urllib import request
from urllib import parse
data1= bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
data2= bytes(str({'word':'hello'}),encoding='utf-8')
response1= urllib.request.urlopen('http://httpbin.org/post',data = data1)
response2= urllib.request.urlopen('http://httpbin.org/post',data = data2)
print(response1.read())
print(response2.read())
b'{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.5"},"json":null,"origin":"113.71.243.133","url":"http://httpbin.org/post"}\n' b'{"args":{},"data":"","files":{},"form":{"{\'word\': \'hello\'}":""},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"17","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.5"},"json":null,"origin":"113.71.243.133","url":"http://httpbin.org/post"}\n'
data參數須要bytes類型,因此須要使用bytes()函數進行編碼,而bytes函數的第一個參數須要時str類型,因此使用urllib.parse.urlencode將字典轉化爲字符串。
提交的url是httpin.org,這個網址用於提供http請求測試,http://httpin.org/post用來測試post請求,能夠輸出請求和響應信息。
通過測試,經過str({'word':''hello}),將字典轉化爲字符串也能夠
timeout參數:
設置一個超時的時間,若是在這個時間內沒有響應,便會拋出異常
import urllib import urllib.request import urllib.error try: response = urllib.request.urlopen('http://www.baidu.com',timeout=0.001) print(response.read()) except : print('error')
將超時時間設置爲0.001秒,在這個時間內,沒有響應,輸出error
使用urlopen()發送請求後,會獲得一個響應response,response學習一下:
響應類型:
import urllib from urllib import request response = urllib.request.urlopen('http://www.baidu.com') print(type(response))
輸出爲:
<class 'http.client.HTTPResponse'>
狀態碼與響應頭:
import urllib
from urllib import request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
#經過getheader('Server')能夠得到一個特定的響應頭
200 [('Bdpagetype', '1'), ('Bdqid', '0xf6ba47940001da56'), ('Cache-Control', 'private'), ('Content-Type', 'text/html'), ('Cxy_all', 'baidu+a77af89c048e9272d9feda1e4fd31907'), ('Date', 'Sat, 09 Jun 2018 04:04:51 GMT'), ('Expires', 'Sat, 09 Jun 2018 04:03:53 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=B6DF381069A580546B5E16B4BE9FF3AE:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=B6DF381069A580546B5E16B4BE9FF3AE; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1528517091; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=26600_1451_21078_18559_26350_26577_20927; path=/; domain=.baidu.com'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')] BWS/1.1
read方法;
import urllib import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(type(response.read())) print(response.read().decode('utf-8'))
response.read()返回的是bytes形式的數據,因此須要用decode('utf-8')進行解碼。
urlopen實現了簡單的請求,若是咱們須要發送複雜的請求那應該怎麼辦?在urllib庫中就須要使用一個Request對象
import urllib from urllib import request #直接聲明一個Request對象,並把url看成參數直接傳遞進來 request = urllib.request.Request('http://www.baidu.com') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
聲明瞭一個Request對象,把url看成參數傳遞給這個對象,而後把這個對昂做爲urlopen函數的參數
有了這個Request對象就能夠實現更復雜的請求,好比加headers等
#利用Request對象實現一個post請求 import urllib from urllib import request url = 'http://www.baidu.com' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36' } data = {'word':'hello'} data = bytes(str(data),encoding='utf-8') req = urllib.request.Request(url = url,data = data,headers = headers,method = 'POST') response = urllib.request.urlopen(req) print(response.read().decode('utf-8'))
上面的這個請求包含了請求方式、url,請求頭,請求體,邏輯清晰。
Request對象還有一個add_header方法,這樣也能夠添加header
好比設置代理、處理cookie等一些操做,須要一些handler來實現這些功能
handler至關於輔助的工具來幫助咱們處理這些操做
好比ProxyHandler(設置代理的handler),能夠變換本身的IP地址
cookie能夠維持登陸狀態
能夠捕獲三種異常:URLError,HTTPError(是URLError類的一個子類),ContentTooShortError
URLError只有一個reason屬性
HTTPError有三個屬性:code,reason,headers
import urllib from urllib import request from urllib import error try: response = urllib.request.urlopen('http://123.com') except error.URLError as e: print(e.reason)
import urllib from urllib import request from urllib import error try: response = urllib.request.urlopen('http://123.com') except error.HTTPError as e: print(e.reason,e.code,e.headers) except error.URLError as e: print(e.reason) else: print('RequestSucess!')
urlparse函數
該函數是對傳入的url進行分割,分割成幾部分,並對每部分進行賦值
import urllib from urllib import parse result = urllib.parse.urlparse('http://www,baidu.com') print(type(result)) print(result)
輸出結果爲:
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www,baidu.com', path='', params='', query='', fragment='')
從輸出結果能夠看出,這幾部分包括:協議類型、域名、路徑、參數、query、fragment
urlparse有幾個參數:url,scheme,allow_fragments
在使用urlparse時,能夠經過參數scheme = 'http'的方式來指定默認的協議類型,若是url有協議類型,scheme參數就不會生效了
import urllib from urllib import parse result = urllib.parse.urlparse('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=python.org&oq=%25E6%2596%25B0%25E6%25B5%25AA&rsv_pq=d28ff08c000024df&rsv_t=d3d8kj5yW7d89rZNhlyrAw%2FRXjh8%2FrDWinUOKVobUbk6BVzP5U8UMplpW1w&rqlang=cn&rsv_enter=1&inputT=1213&rsv_sug3=69&rsv_sug1=63&rsv_sug7=100&bs=%E6%96%B0%E6%B5%AA') print(type(result)) print(result)
輸出結果爲:
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=python.org&oq=%25E6%2596%25B0%25E6%25B5%25AA&rsv_pq=d28ff08c000024df&rsv_t=d3d8kj5yW7d89rZNhlyrAw%2FRXjh8%2FrDWinUOKVobUbk6BVzP5U8UMplpW1w&rqlang=cn&rsv_enter=1&inputT=1213&rsv_sug3=69&rsv_sug1=63&rsv_sug7=100&bs=%E6%96%B0%E6%B5%AA', fragment='')
urlunparse函數
與urlparse函數做用相反,是對url進行拼接的
urljoin函數
用來拼接url
urlencode函數
能夠把一個字典轉化爲get請求參數