Python標準庫學習之urllib

時間 2019-12-10

原文原文鏈接

本系列以python3.4爲基礎
urllib是Python3的標準網絡請求庫。包含了網絡數據請求，處理cookie,改變請求頭和用戶代理，重定向，認證等的函數。
urllib與urllib2?:python2.x用urllib2,而python3更名爲urllib,被分紅一些子模塊：urllib.request,urllib.parse,urllib.error,urllib.robotparser.儘管函數名稱大多和原來同樣，可是使用新的urllib庫時須要注意哪些函數被移動到子模塊裏了。html

HTTP版本：HTTP/1.1，包含Connection:close 頭python

特別經常使用的函數:urllib.request.urlopen()api

同類型開源庫推薦:requestscookie

urllib:用來處理網絡請求和操做url。有如下子模塊網絡

urllib.request 打開後讀取url內容函數
urllib.error 包含由urllib.request拋出的異常類ui
urllib.parse 解析URL編碼
urllib.robotparser 解析robots.txt filesurl

簡單的例子

from urllib.request import urlopen
html=urlopen('https://www.baidu.com')
print(html.geturl(),html.info(),html.getcode(),sep='\n')
print(html.read().decode('UTF-8'))

from urllib import request
with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

from urllib import request
req = request.Request('http://www.douban.com/') #設置請求頭
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
import urllib.request
data = parse.urlencode([ #進行url編碼參數
    ('username', 'xby')]
req = urllib.request.Request(url='https://www.baidu.com',
                     data=data)
with urllib.request.urlopen(req) as f:
    print(f.read().decode('utf-8'))
from urllib import request, parse
print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([ #進行url編碼參數
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])
req = request.Request('https://passport.weibo.cn/sso/login') 
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')
with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))
urllib.request
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url參數能夠是字符串或者urllib.request.Request對象代理
data參數必須是字節形式。能夠經過from urllib import parse parse.urlencode()來處理獲得。若是沒有提供dat參數則爲GET請求，不然爲POST請求。
[tomeout,]超時單位爲秒
context參數必須是ssl.SSLContext的實例

返回值：返回一個能夠做爲contextmanager的對象。它有一些方法和屬性：
geturl()
info()-元數據信息，好比headers
getcode()-http響應碼，好比200
read()-獲取內容，字節形式
status
reason

對於Http(s)請求，返回的一個http.client.HTTPResponse對象。經常使用方法getheaders(),read()
對於ftp,file請求，返回一個urllib.response.addinfourl對象

可能拋出的異常urllib.error.URLError,urllib.error.HTTPError

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
經過這個對象咱們能夠設置請求數據，添加請求頭，同時能夠獲取一些url信息：好比協議類型，主機。也能夠設置代理Request.set_proxy(host, type)

class urllib.request.OpenerDirector以及關聯的urllib.request.install_opener(opener),urllib.request.build_opener([handler, …])
方法：OpenerDirector.add_handler(handler) ，這個handler對象必須繼承urllib.request.BaseHandler，常見的有
urllib.request.BaseHandler -基類
urllib.request.HTTPDefaultErrorHandler
urllib.request.HTTPRedirectHandler
urllib.request.HTTPCookieProcessor
urllib.request.ProxyHandler
urllib.request.HTTPBasicAuthHandler
urllib.request.HTTPSHandler
例子：

import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
 
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

異常處理

可能拋出的異常urllib.error.URLError,urllib.error.HTTPError
exception urllib.error.URLError :有如下屬性：reason
exception urllib.error.HTTPError 它是URLError的一個子類，有如下屬性：
code
reason
headers

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://www.baidu.com/")
try:
    response = urlopen(req)
except HTTPError as e:
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    print("good!")
print(response.read().decode("utf8"))

urllib.parse

urllib.parse.urlparse函數會將一個普通的url解析爲6個部分，返回的數據類型爲ParseResult對象，經過訪問其屬性能夠獲取對應的值。
同時，它還能夠將已經分解後的url再組合成一個url地址（經過urlunparse(parts)）。返回的6個部分，分別是：scheme(機制)、netloc(網絡位置)、path(路徑)、params(路徑段參數)、query(查詢)、fragment(片斷)。

urllib.parse.urlencode(query, doseq=False, safe=' ', encoding=None, errors=None),注意：query參數是一個序列對象