本系列以python3.4爲基礎
urllib是Python3的標準網絡請求庫。包含了網絡數據請求,處理cookie,改變請求頭和用戶代理,重定向,認證等的函數。
urllib與urllib2?:python2.x用urllib2,而python3更名爲urllib,被分紅一些子模塊:urllib.request,urllib.parse,urllib.error,urllib.robotparser.儘管函數名稱大多和原來同樣,可是使用新的urllib庫時須要注意哪些函數被移動到子模塊裏了。html
HTTP版本:HTTP/1.1,包含Connection:close 頭python
特別經常使用的函數:urllib.request.urlopen()api
同類型開源庫推薦:requestscookie
urllib:用來處理網絡請求和操做url。有如下子模塊網絡
urllib.request 打開後讀取url內容函數
urllib.error 包含由urllib.request拋出的異常類ui
urllib.parse 解析URL編碼
urllib.robotparser 解析robots.txt filesurl
from urllib.request import urlopen html=urlopen('https://www.baidu.com') print(html.geturl(),html.info(),html.getcode(),sep='\n') print(html.read().decode('UTF-8'))
from urllib import request with request.urlopen('https://api.douban.com/v2/book/2129650') as f: data = f.read() print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', data.decode('utf-8'))
from urllib import request req = request.Request('http://www.douban.com/') #設置請求頭 req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') with request.urlopen(req) as f: print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', f.read().decode('utf-8')) import urllib.request data = parse.urlencode([ #進行url編碼參數 ('username', 'xby')] req = urllib.request.Request(url='https://www.baidu.com', data=data) with urllib.request.urlopen(req) as f: print(f.read().decode('utf-8')) from urllib import request, parse print('Login to weibo.cn...') email = input('Email: ') passwd = input('Password: ') login_data = parse.urlencode([ #進行url編碼參數 ('username', email), ('password', passwd), ('entry', 'mweibo'), ('client_id', ''), ('savestate', '1'), ('ec', ''), ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F') ]) req = request.Request('https://passport.weibo.cn/sso/login') req.add_header('Origin', 'https://passport.weibo.cn') req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F') with request.urlopen(req, data=login_data.encode('utf-8')) as f: print('Status:', f.status, f.reason) for k, v in f.getheaders(): print('%s: %s' % (k, v)) print('Data:', f.read().decode('utf-8')) urllib.request urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
url參數能夠是字符串或者urllib.request.Request對象代理
data參數必須是字節形式。能夠經過from urllib import parse parse.urlencode()來處理獲得。若是沒有提供dat參數則爲GET請求,不然爲POST請求。
[tomeout,]超時單位爲秒
context參數必須是ssl.SSLContext的實例
返回值:返回一個能夠做爲contextmanager的對象。它有一些方法和屬性:
geturl()
info()-元數據信息,好比headers
getcode()-http響應碼,好比200
read()-獲取內容,字節形式
status
reason
對於Http(s)請求,返回的一個http.client.HTTPResponse對象。經常使用方法getheaders(),read()
對於ftp,file請求,返回一個urllib.response.addinfourl對象
可能拋出的異常urllib.error.URLError,urllib.error.HTTPError
class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
經過這個對象咱們能夠設置請求數據,添加請求頭,同時能夠獲取一些url信息:好比協議類型,主機。也能夠設置代理Request.set_proxy(host, type)
class urllib.request.OpenerDirector以及關聯的urllib.request.install_opener(opener),urllib.request.build_opener([handler, …])
方法:OpenerDirector.add_handler(handler) ,這個handler對象必須繼承urllib.request.BaseHandler,常見的有
urllib.request.BaseHandler -基類
urllib.request.HTTPDefaultErrorHandler
urllib.request.HTTPRedirectHandler
urllib.request.HTTPCookieProcessor
urllib.request.ProxyHandler
urllib.request.HTTPBasicAuthHandler
urllib.request.HTTPSHandler
例子:
import urllib.request # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib.request.HTTPBasicAuthHandler() auth_handler.add_password(realm='PDQ Application', uri='https://mahler:8092/site-updates.py', user='klem', passwd='kadidd!ehopper') opener = urllib.request.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib.request.install_opener(opener) urllib.request.urlopen('http://www.example.com/login.html')
proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'}) proxy_auth_handler = urllib.request.ProxyBasicAuthHandler() proxy_auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler) # This time, rather than install the OpenerDirector, we use it directly: opener.open('http://www.example.com/login.html')
可能拋出的異常urllib.error.URLError,urllib.error.HTTPError
exception urllib.error.URLError :有如下屬性:reason
exception urllib.error.HTTPError 它是URLError的一個子類,有如下屬性:
code
reason
headers
from urllib.request import Request, urlopen from urllib.error import URLError, HTTPError req = Request("http://www.baidu.com/") try: response = urlopen(req) except HTTPError as e: print('Error code: ', e.code) except URLError as e: print('We failed to reach a server.') print('Reason: ', e.reason) else: print("good!") print(response.read().decode("utf8"))
urllib.parse.urlparse函數會將一個普通的url解析爲6個部分,返回的數據類型爲ParseResult對象,經過訪問其屬性能夠獲取對應的值。
同時,它還能夠將已經分解後的url再組合成一個url地址(經過urlunparse(parts))。返回的6個部分,分別是:scheme(機制)、netloc(網絡位置)、path(路徑)、params(路徑段參數)、query(查詢)、fragment(片斷)。
urllib.parse.urlencode(query, doseq=False, safe=' ', encoding=None, errors=None),注意:query參數是一個序列對象
urllib.request.urlretrieve(url,savefilepath)