Python urllib模塊詳解

時間 2019-12-01

原文原文鏈接

在Python 2中，有urllib和urllib2兩個庫來實現請求的發送。而在Python 3中，已經不存在urllib2這個庫了，統一爲urllib，其官方文檔連接爲：https://docs.python.org/3/library/urllib.html。php

urllib庫，是Python內置的HTTP請求庫，也就是說不須要額外安裝便可使用。它包含以下4個模塊：html

request：它是最基本的HTTP請求模塊，能夠用來模擬發送請求。就像在瀏覽器裏輸入網址而後回車同樣，只須要給庫方法傳入URL以及額外的參數，就能夠模擬實現這個過程了。
error：異常處理模塊，若是出現請求錯誤，咱們能夠捕獲這些異常，而後進行重試或其餘操做以保證程序不會意外終止。
parse：一個工具模塊，提供了許多URL處理方法，好比拆分、解析、合併等。
robotparser：主要是用來識別網站的robots.txt文件，而後判斷哪些網站能夠爬，哪些網站不能夠爬，它其實用得比較少。

這裏重點講解一下前3個模塊。python

1、 request模塊

使用urllib的request模塊，咱們能夠方便地實現請求的發送並獲得響應。nginx

1.urlopen()方法

打開指定的URLjson

（1）語法瀏覽器

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url參數能夠是一個string，或者一個Request對象。
data參數是可選的，而且是一個是bytes（字節流）對象，或者None，傳遞給服務器的數據，須要經過bytes()方法轉換。另外，若是傳遞了這個參數，則它的請求方式就再也不是GET方式，而是POST方式。
timeout參數用於設置超時時間，單位爲秒，意思就是若是請求超出了設置的這個時間，尚未獲得響應，就會拋出異常。若是不指定該參數，就會使用全局默認時間。它支持HTTP、HTTPS、FTP請求。
cafile和capath這兩個參數分別指定CA證書和它的路徑，這個在請求HTTPS連接時會有用。
cadefault參數如今已經棄用了，其默認值爲False。
context參數必須是ssl.SSLContext類型，用來指定SSL設置。

（2）示例服務器

基本使用cookie

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

傳遞數據網絡

import urllib.parse
import urllib.request
 
data = bytes(urllib.parse.urlencode({'name': 'helloworld'}), encoding='utf8')  # 須要將數據轉換成bytes對象
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

設置超時時間session

import socket
import urllib.request
import urllib.error
 
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)  # 設置0.1秒請求就會超時
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIMEOUT')

（3）urllib.request.urlopen()響應結果操做

首先，利用type()方法輸出響應的類型：

import urllib.request
 
response = urllib.request.urlopen('https://www.python.org')
print(type(response))

輸出結果：
<class 'http.client.HTTPResponse'>

能夠發現，它是一個HTTPResposne類型的對象。它主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg、version、status、reason、debuglevel、closed等屬性。

import urllib.request
 
response = urllib.request.urlopen('https://www.python.org')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

輸出結果：
200
[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '49094'), ('Accept-Ranges', 'bytes'), ('Date', 'Sun, 30 Sep 2018 03:03:43 GMT'), ('Via', '1.1 varnish'), ('Age', '1389'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2124-IAD, cache-hnd18721-HND'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 881'), ('X-Timer', 'S1538276623.322806,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

2.Request類

URL請求的抽象類，咱們知道利用urlopen()方法能夠實現最基本請求的發起，但這幾個簡單的參數並不足以構建一個完整的請求。若是請求中須要加入Headers等信息，就能夠利用更強大的Request類來構建。

（1）構造方法

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url參數用於請求URL，這是必傳參數。
data參數若是要傳，必須傳bytes（字節流）類型的。若是它是字典，能夠先用urllib.parse模塊裏的urlencode()編碼。跟urlopen()方法的data參數同樣。
headers參數是一個字典，它就是請求頭，咱們能夠在構造請求時經過headers參數直接構造，也能夠經過調用請求實例的add_header()方法添加。
origin_req_host參數是請求方的host名稱或者IP地址。
unverifiable參數表示這個請求是不是沒法驗證的，默認是False，意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如，咱們請求一個HTML文檔中的圖片，可是咱們沒有自動抓取圖像的權限，這時unverifiable的值就是True。
method參數是一個字符串，用來指示請求使用的方法，好比GET、POST和PUT等。

（2）示例

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
req.add_header('Host', 'httpbin.org')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

輸出結果：
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "123.147.199.132", 
  "url": "http://httpbin.org/post"
}

3.高級用法

除了上面提到的簡單的構造請求，有點時候還會遇到一些更高級的東西，好比，登陸驗證、Cookie、代理等一些處理。就須要一個更強大的工具來作這些處理，urllib.request模塊裏的BaseHandler類這個時候就派上了用場了，它是全部其餘Handler的父類，它提供了最基本的方法，例如default_open()、protocol_request()等。

下面是一些繼承BaseHandler類的子類：

HTTPDefaultErrorHandler：用於處理HTTP響應錯誤，錯誤都會拋出HTTPError類型的異常。
HTTPRedirectHandler：用於處理重定向。
HTTPCookieProcessor：用於處理Cookies。
ProxyHandler：用於設置代理，默認代理爲空。
HTTPPasswordMgr：用於管理密碼，它維護了用戶名和密碼的表。
HTTPBasicAuthHandler：用於管理認證，若是一個連接打開時須要認證，那麼能夠用它來解決認證問題。

另外，還有其餘的Handler類請參考官方文檔：https://docs.python.org/3/library/urllib.request.html#urllib.request.BaseHandler。

示例：

（1）登陸驗證

有些網站在打開時就會彈出提示框，直接提示你輸入用戶名和密碼，驗證成功後才能查看頁面，以下圖所示：

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError
 
username = 'username'
password = 'password'
url = 'http://localhost:5000/'
 
p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)
 
try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代碼解釋：

首先實例化HTTPBasicAuthHandler對象，其參數是HTTPPasswordMgrWithDefaultRealm對象，它利用add_password()添加進去用戶名和密碼，這樣就創建了一個處理驗證的Handler。

接下來，利用這個Handler並使用build_opener()方法構建一個Opener，這個Opener在發送請求時就至關於已經驗證成功了。

接下來，利用Opener的open()方法打開連接，就能夠完成驗證了。這裏獲取到的結果就是驗證後的頁面源碼內容。

（2）代理

在作爬蟲的時候，免不了要使用代理，若是要添加代理，能夠這樣作：

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

代碼解釋：

這裏咱們在本地搭建了一個代理，它運行在9743端口上。

這裏使用了ProxyHandler，其參數是一個字典，鍵名是協議類型（好比HTTP或者HTTPS等），鍵值是代理連接，能夠添加多個代理。

而後，利用這個Handler及build_opener()方法構造一個Opener，以後發送請求便可。

（3）Cookie

a.獲取網站的Cookie

import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

輸出結果：
BAIDUID=ABC21AF733113C91DA9A89E3E7CD9C53:FG=1
BIDUPSID=ABC21AF733113C91DA9A89E3E7CD9C53
H_PS_PSSID=1444_21087
PSTM=1538278865
delPer=0
BDSVRTM=0
BD_HOME=0

代碼解釋：

首先，咱們必須聲明一個CookieJar對象。

接下來，就須要利用HTTPCookieProcessor來構建一個Handler。

最後利用build_opener()方法構建出Opener，執行open()函數便可。

b.將cookie保存成Mozilla型瀏覽器格式的Cookies文件

import http.cookiejar, urllib.request

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

此時，須要將CookieJar換成MozillaCookieJar，執行上面的程序會生成了一個cookies.txt文件。

c.讀取Mozilla行瀏覽器格式的Cookie文件

import http.cookiejar, urllib.request

cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

運行結果正常的話，會輸出百度網頁的源代碼。

d.將Cookie文件保存成libwww-perl(LWP)格式的Cookies文件

import http.cookiejar, urllib.request

filename = 'cookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

此時，須要將MozillaCookieJar換成LWPCookieJar，執行上面的程序會生成了一個cookies.txt文件。

e.讀取libwww-perl(LWP)格式的Cookies文件

import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

運行結果正常的話，會輸出百度網頁的源代碼。

2、 error模塊

urllib的error模塊定義了由request模塊產生的異常。若是出現了問題，request模塊便會拋出error模塊中定義的異常。

1.URLError類

URLError類來自urllib庫的error模塊，它繼承自OSError類，是error異常模塊的基類，由request模塊生的異常均可以經過捕獲這個類來處理。它具備一個屬性reason，即返回錯誤的緣由。

from urllib import request, error
try:
    response = request.urlopen('https://blog.csdn.net/1')
except error.URLError as e:
    print(e.reason)

輸出結果：
Not Found

2.HTTPError類

它是URLError的子類，專門用來處理HTTP請求錯誤，好比認證請求失敗等。它有以下3個屬性。

code：返回HTTP狀態碼，好比404表示網頁不存在，500表示服務器內部錯誤等。
reason：同父類同樣，用於返回錯誤的緣由。
headers：返回請求頭。

from urllib import request,error
try:
    response = request.urlopen('https://blog.csdn.net/1')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

輸出結果:
Not Found
404
Server: openresty
Date: Sun, 30 Sep 2018 06:13:52 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 12868
Connection: close
Vary: Accept-Encoding

由於URLError是HTTPError的父類，因此能夠先選擇捕獲子類的錯誤，再去捕獲父類的錯誤，因此上述代碼更好的寫法以下：

from urllib import request, error

try:
    response = request.urlopen('https://blog.csdn.net/1')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

輸出結果：
Not Found
404
Server: openresty
Date: Sun, 30 Sep 2018 06:17:53 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 12868
Connection: close
Vary: Accept-Encoding
Set-Cookie: uuid_tt_dd=10_20732824360-1538288273861-180630; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
Set-Cookie: dc_session_id=10_1538288273861.705662; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;
ETag: "5b0fa58c-3244"

有時候，reason屬性返回的不必定是字符串，也多是一個對象。示例以下：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIMEOUT')
      
輸出結果：  
<class 'socket.timeout'>
TIMEOUT

3、parse模塊

parse模塊定義了處理URL的標準接口，例如實現URL各部分的抽取、合併以及連接轉換。它支持以下協議的URL處理：file、ftp、gopher、hdl、http、https、imap、mailto、 mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、 sip、sips、snews、svn、svn+ssh、telnet和wais。

1.urlparse()方法

該方法能夠實現URL的識別和分段。

（1）語法

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

urlstring：這是必填項，即待解析的URL。
scheme：它是默認的協議（好比http或https等）
allow_fragments：便是否忽略fragment。若是它被設置爲False，fragment部分就會被忽略，它會被解析爲path、parameters或者query的一部分，而fragment部分爲空。

一般一個基本的URL由6大組件組成（scheme://netloc/path;parameters?query#fragment ），每一個元素組都爲 String 字符串類型，或者爲空。例如：http://www.baidu.com/index.html;user?id=5#comment

除這六大組件外，該類具備如下附加的只讀便利屬性（可看下錶）：

屬性	索引	值	值若是不存在
scheme	0	URL 協議	scheme 參數
netloc	1	域名及網絡端口	空字符串
path	2	分層路徑	空字符串
params	3	最後一個路徑元素參數	空字符串
query	4	Query 組件	空字符串
fragment	5	片斷標誌符	空字符串
username		用戶名	None
password		Password	None
hostname		主機名 (小寫)	None
port		若是存在，端口值爲整數	None

（2）示例

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
print(result.scheme, result[0])
print(result.netloc, result[1])
print(result.path, result[2])
print(result.params, result[3])
print(result.query, result[4])
print(result.fragment, result[5])

輸出結果：
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
http http
www.baidu.com www.baidu.com
/index.html /index.html
user user
id=5 id=5
comment comment

能夠看到，返回結果是一個ParseResult類型的對象，它包含6部分，分別是scheme、netloc、path、params、query和fragment。

2.urlunparse()方法

有了urlparse()，相應地就有了它的對立方法urlunparse()。它接受的參數是一個可迭代對象，可是它的長度必須是6，不然會拋出參數數量不足或者過多的問題。

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

輸出結果：
http://www.baidu.com/index.html;user?a=6#comment

3.urljoin()方法

該方法能夠提供一個base_url（基礎連接）做爲第一個參數，將新的連接做爲第二個參數，該方法會分析base_url的scheme、netloc和path這3個內容並對新連接缺失的部分進行補充，最後返回結果。

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

輸出結果：
http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

能夠發現，base_url提供了三項內容scheme、netloc和path。若是這3項在新的連接裏不存在，就予以補充；若是新的連接存在，就使用新的連接的部分。而base_url中的params、query和fragment是不起做用的。

4.urlencode()方法

該方法能夠構造GET請求參數。

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

輸出結果：
http://www.baidu.com?name=germey&age=22

5.parse_qs()方法

該方法能夠將一串GET請求參數轉換成字典。

from urllib.parse import parse_qs

query = 'name=germey&age=22'
print(parse_qs(query))

輸出結果：
{'name': ['germey'], 'age': ['22']}

6.parse_qsl()方法

該方法能夠將一串GET請求參數轉換成元組組成的列表。

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

輸出結果：
[('name', 'germey'), ('age', '22')]

能夠看到，運行結果是一個列表，而列表中的每個元素都是一個元組，元組的第一個內容是參數名，第二個內容是參數值。

7.quote()方法

該方法能夠將內容轉化爲URL編碼的格式。RL中帶有中文參數時，有時可能會致使亂碼的問題，此時用這個方法能夠將中文字符轉化爲URL編碼。

from urllib.parse import quote

keyword = '美女'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

輸出結果：
https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3

8.unquote()方法

有了quote()方法，固然還有unquote()方法，它能夠進行URL解碼。

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3'
print(unquote(url))

輸出結果：
https://www.baidu.com/s?wd=美女

本文參考：靜覓的Python3網絡爬蟲開發實戰

相關標籤/搜索

python+urllib+beautifulsoup

python+urllib+beautifusoup

python+urllib+beautifulsoup+pymysql

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。