urllib模塊

時間 2019-12-02

標籤 urllib 模塊简体版

原文原文鏈接

在Python2中，有urllib和urllib2這兩個庫來實現請求的發送。而在Python3中，就只有urllib這個庫了。html

首先，咱們要知道，urllib庫是python內置的HTTP請求庫，不須要額外的安裝包。它主要包含如下4個模塊的內容。python

request：最基本的http請求模塊，用來模擬發送請求。
error：異常處理模塊，若是出現錯誤，咱們能夠捕獲這些錯誤，保證程序不會終止。
parse: 一個工具模塊，提供了許多的url處理模塊，如拆分、解析、合併等。
robotparser：主要用來識別網站的robot.txt文件，而後判斷網站是否可爬（用的少）

一. 發送請求urllib.request模塊json

使用urllib.request模塊，咱們就能夠發送請求，獲得響應。瀏覽器

1.urlopen()服務器

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)網絡

url: 須要打開的網址
data：Post提交的數據
timeout：設置網站的訪問超時時間
context：必須是ssl.SSLContext類型，指定ssl設置
cafile和capath：指定CA證書和路徑

直接用urllib.request模塊的urlopen（）獲取頁面，page的數據格式爲bytes類型，須要decode（）解碼，轉換成str類型。數據結構

from urllib import request

response = request.urlopen(r'http://python.org/') # <http.client.HTTPResponse object at 0x00000000048BC908> HTTPResponse類型
page = response.read()
page = page.decode('utf-8')

urlopen返回對象提供方法：併發

read() , readline() ,readlines() , fileno() , close() ：對HTTPResponse類型數據進行操做
info()：返回HTTPMessage對象，表示遠程服務器返回的頭信息
getcode()：返回Http狀態碼。若是是http請求，200請求成功完成;404網址未找到
geturl()：返回請求的url

Py2中的代碼：

import urllib2
# url 做爲Request()方法的參數，構造並返回一個Request對象
request = urllib2.Request("http://www.baidu.com")
# Request對象做爲urlopen()方法的參數，發送給服務器並接收響應
response = urllib2.urlopen(request)
html = response.read()
print html

2.使用requestsdom

（1）在咱們的第一個例子中，urlopen()參數就是一個URL地址。可是咱們若是須要執行更加複雜的操做，例如增長HTTP報頭，就必須建立一個requests實例來做爲urloprn()的參數；而須要訪問的url地址則做爲requests實例的參數。ide

`urllib.request.Request`(url, data=None, headers={}, origin_req_host=None,unverifuable = False,method=None)

url: 須要打開的網址
data：Post提交的數據,必須是bytes（字節流）類型。使用parse.urldecode()
headers：字典，請求頭
origin_req_host：請求方的host或者IP地址
unverifuable：表示這個請求時沒法驗證的，默認false,
methed：請求方法，如get/post等

使用request()來包裝請求，再經過urlopen()獲取頁面。

PS：py2中爲`urllib.Request`(url, data=None, headers={}, method=None)

用來包裝頭部的數據：

- User-Agent ：這個頭部能夠攜帶以下幾條信息：瀏覽器名和版本號、操做系統名和版本號、默認語言

- Referer：能夠用來防止盜鏈，有一些網站圖片顯示來源http://***.com，就是檢查Referer來鑑定的

- Connection：表示鏈接狀態，記錄Session的狀態。

from urllib import request

url = r'http://www.lagou.com/zhaopin/Python/?labelWords=label'
headers = {
     'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                   r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
     'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',
     'Connection': 'keep-alive'
 }

req = request.Request(url, headers=headers)
page = request.urlopen(req).read().decode('utf-8')
#page = page.decode('utf-8')
print(page)

（2）添加更多的Header信息

在 HTTP Request 中加入特定的 Header，來構造一個完整的HTTP請求消息。

能夠經過調用Request.add_header() 添加/修改一個特定的header 也能夠經過調用Request.get_header()來查看已有的header。

添加一個特定的header
隨機添加/修改User-Agent

import urllib
url = "http://www.itcast.cn"
#IE 9.0 的 User-Agent
header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"} 
req = urllib.request.Request(url, headers = header)
#也能夠經過調用Request.add_header() 添加/修改一個特定的header
req.add_header("Connection", "keep-alive")
也能夠經過調用Request.get_header()來查看header信息
req.get_header(header_name="Connection")
response = urllib.request.urlopen(req)
print(response.code)     #能夠查看響應狀態碼
html = response.read(),print(html)

import urllib2
import random

url = "http://www.itcast.cn"
ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ",
    "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ",
    "Mozilla/5.0 (Macintosh; Intel Mac OS... "
]
user_agent = random.choice(ua_list)
request = urllib2.Request(url)
#也能夠經過調用Request.add_header() 添加/修改一個特定的header
request.add_header("User-Agent", user_agent)
# 第一個字母大寫，後面的所有小寫
request.get_header("User-agent") 
response = urllib2.urlopen(req)
html = response.read()
print html

3.post數據

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen（）的data參數默認爲None，當data參數不爲空的時候，urlopen（）提交方式爲Post。

urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None)

urlencode（）主要做用就是將url附上要提交的數據。

Post的數據必須是bytes或者iterable of bytes，不能是str，所以須要進行encode（）編碼

from urllib import request, parse

url = r'http://www.lagou.com/jobs/positionAjax.json?'
headers = {
    'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
    'Referer': r'http://www.lagou.com/zhaopin/Python/?labelWords=label',
    'Connection': 'keep-alive'
  }
data = {
    'first': 'true',
    'pn': 1,
    'kd': 'Python'
  }

data = parse.urlencode(data).encode('utf-8')
req = request.Request(url, headers=headers, data=data)
page = request.urlopen(req).read()
page = page.decode('utf-8')

4.get

get請求通常用於咱們向服務器獲取數據，好比說以下練習。

from urllib import request,parse

def tiebaSpider(url, beginPage, endPage):
    """
        做用：負責處理url，分配每一個url去發送請求
        url：須要處理的第一個url
        beginPage: 爬蟲執行的起始頁面
        endPage: 爬蟲執行的截止頁面
    """
    for page in range(beginPage, endPage + 1):
        pn = (page - 1) * 50
        filename = "第" + str(page) + "頁.html"
        # 組合爲完整的 url，而且pn值每次增長50
        fullurl = url + "&pn=" + str(pn)
        #print fullurl
        # 調用loadPage()發送請求獲取HTML頁面
        html = loadPage(fullurl, filename)
        # 將獲取到的HTML頁面寫入本地磁盤文件
        writeFile(html, filename)

def loadPage(url, filename):
    '''
        做用：根據url發送請求，獲取服務器響應文件
        url：須要爬取的url地址
        filename: 文件名
    '''
    print("正在下載" + filename)
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

    req = request.Request(url, headers = headers)
    response = request.urlopen(req)
    return(response.read())

def writeFile(html, filename):
    """
        做用：保存服務器響應文件到本地磁盤文件裏
        html: 服務器響應文件
        filename: 本地磁盤文件名
    """
    print("正在存儲" + filename)
    with open(filename, 'wb') as f:
        f.write(html)
    print("-" * 20)

# 模擬 main 函數
if __name__ == "__main__":
    kw = input("請輸入須要爬取的貼吧:")
    # 輸入起始頁和終止頁，str轉成int類型
    beginPage = int(input("請輸入起始頁："))
    endPage = int(input("請輸入終止頁："))
    url = "http://tieba.baidu.com/f?"
    key = parse.urlencode({"kw" : kw})
    # 組合後的url示例：http://tieba.baidu.com/f?kw=lol
    url = url + key
    tiebaSpider(url, beginPage, endPage)

5.驗證

有些網站在打開時，會彈出提示框，提示輸入用戶名和密碼，驗證成功後才能查看界面。若是咱們要請求這樣的頁面，咱們必需要藉助HTTPBasicAuthHandler、build_opener來完成

from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener

from urllib.error import URLError

u = 'uesrname'
p = 'password'
url = ''
p = HTTPPasswordMgrWithDefaultRealm()  # 創建該參數對象
p.add_password(None,url,u,p)  # 添加用戶名和密碼
auth_headler = HTTPBasicAuthHandler(p)  # 建立一個處理驗證的handler
opener = build_opener(auth_headler)  # 使用build_opener建立一個opener
try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

6.代理

在爬取網頁的時候，免不了要使用代理，要使用代理，可使用ProxyHandler,其參數是一個字典，鍵是協議類型（http/https/ftp等），鍵值是代理的連接，能夠添加多個代理。

實例1

from urllib.request import ProxyHandler,build_opener
from urllib.error import URLError

proxy_handler = ProxyHandler(  # 建立一個代理的handler
    {
        'http':'http://IP:port',
        'https':'https://IP:port',
    }
)
opener = build_opener(proxy_handler)  # 使用build_opener建立一個opener
try:
    result = opener.open('https://www.baidu.com')
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

實例2

import urllib2
# 構建了兩個代理Handler，一個有代理IP，一個沒有代理IP
httpproxy_handler = urllib2.ProxyHandler({"http" : "124.88.67.81:80"})
nullproxy_handler = urllib2.ProxyHandler({})
 
proxySwitch = True #定義一個代理開關
 
# 經過 urllib2.build_opener()方法使用這些代理Handler對象，建立自定義opener對象
# 根據代理開關是否打開，使用不一樣的代理模式
if proxySwitch:  
    opener = urllib2.build_opener(httpproxy_handler)
else:
    opener = urllib2.build_opener(nullproxy_handler)
 
request = urllib2.Request("http://www.baidu.com/")
 
# 1. 若是這麼寫，只有使用opener.open()方法發送請求才使用自定義的代理，而urlopen()則不使用自定義代理。
response = opener.open(request)
 
# 2. 若是這麼寫，就是將opener應用到全局，以後全部的，不論是opener.open()仍是urlopen() 發送請求，都將使用自定義代理。
# urllib2.install_opener(opener)
# response = urlopen(request)
 
print response.read()

data = {

         'first': 'true',

         'pn': 1,

        'kd': 'Python'

proxy = request.ProxyHandler({'http': '5.22.195.215:80'})  # 設置proxy

opener = request.build_opener(proxy)  # 掛載opener

request.install_opener(opener)  # 安裝opener

data = parse.urlencode(data).encode('utf-8')

page = opener.open(url, data).read()

page = page.decode('utf-8')

return pag

二.異常處理urllib.error

在咱們用urlopen或者opener.open方法發出一個請求時，若是由於網絡等問題不能處理這個請求，就會產生錯誤。

1）URLError

URLError產生的緣由主要有：

- 1.沒有網絡鏈接

- 2.服務器鏈接失敗

- 3.找不到指定的服務器

能夠用try except 語句來獲取相應的異常。

from urllib import request,parse,error
url = 'http://www.1232435erfefre.com'
req = request.Request(url)
try:
    resp = request.urlopen(req)
except error.URLError as e:
    print(e)
>>><urlopen error [Errno 11004] getaddrinfo failed>

2)HTTPError

HTTPError是URLError的子類，咱們每發出一個請求，服務器都會對應一個response應答，其中包含一個數字「響應狀態碼」。

若是urlopen或opener.open不能處理的，會產生一個HTTPError，對應相應的狀態碼，HTTP狀態碼錶示HTTP協議所返回的響應的狀態。

注意， 100-299範圍的號碼錶示成功，因此咱們只能看到400-599的錯誤號碼。

from urllib import request,parse,error

#url1 = 'http://www.1232435erfefre.com'
url2 = 'http://blog.baidu.com/itcast'
'''
req = request.Request(url2)
try:
    resp = request.urlopen(req)
except error.HTTPError as e:
    print(e,e.code）

3）改進版

因爲HTTPError的父類是URLError,因此父類的異常均可以寫在子類異常的後面，因此代碼能夠修改以下。

from urllib import request,parse,error

#url1 = 'http://www.1232435erfefre.com'
url2 = 'http://blog.baidu.com/itcast'

req = request.Request(url2)
try:
    resp = request.urlopen(req)
except error.HTTPError as e:
    print(e.code())
    print(e)
except error.URLError as e:
    print(e.reason,e)
else:
    print('hehe')

4）http服務器響應狀態

1xx:信息
100 Continue
服務器僅接收到部分請求，可是一旦服務器並無拒絕該請求，客戶端應該繼續發送其他的請求。
101 Switching Protocols
服務器轉換協議：服務器將聽從客戶的請求轉換到另一種協議。
 
 
 
2xx:成功
200 OK
請求成功（其後是對GET和POST請求的應答文檔）
201 Created
請求被建立完成，同時新的資源被建立。
202 Accepted
供處理的請求已被接受，可是處理未完成。
203 Non-authoritative Information
文檔已經正常地返回，但一些應答頭可能不正確，由於使用的是文檔的拷貝。
204 No Content
沒有新文檔。瀏覽器應該繼續顯示原來的文檔。若是用戶按期地刷新頁面，而Servlet能夠肯定用戶文檔足夠新，這個狀態代碼是頗有用的。
205 Reset Content
沒有新文檔。但瀏覽器應該重置它所顯示的內容。用來強制瀏覽器清除表單輸入內容。
206 Partial Content
客戶發送了一個帶有Range頭的GET請求，服務器完成了它。
 
 
 
3xx:重定向
300 Multiple Choices
多重選擇。連接列表。用戶能夠選擇某連接到達目的地。最多容許五個地址。
301 Moved Permanently
所請求的頁面已經轉移至新的url。
302 Moved Temporarily
所請求的頁面已經臨時轉移至新的url。
303 See Other
所請求的頁面可在別的url下被找到。
304 Not Modified
未按預期修改文檔。客戶端有緩衝的文檔併發出了一個條件性的請求（通常是提供If-Modified-Since頭表示客戶只想比指定日期更新的文檔）。服務器告訴客戶，原來緩衝的文檔還能夠繼續使用。
305 Use Proxy
客戶請求的文檔應該經過Location頭所指明的代理服務器提取。
306 Unused
此代碼被用於前一版本。目前已再也不使用，可是代碼依然被保留。
307 Temporary Redirect
被請求的頁面已經臨時移至新的url。
 
 
 
4xx:客戶端錯誤
400 Bad Request
服務器未能理解請求。
401 Unauthorized
被請求的頁面須要用戶名和密碼。
401.1
登陸失敗。
401.2
服務器配置致使登陸失敗。
401.3
因爲 ACL 對資源的限制而未得到受權。
401.4
篩選器受權失敗。
401.5
ISAPI/CGI 應用程序受權失敗。
401.7
訪問被 Web 服務器上的 URL 受權策略拒絕。這個錯誤代碼爲 IIS 6.0 所專用。
402 Payment Required
此代碼尚沒法使用。
403 Forbidden
對被請求頁面的訪問被禁止。
403.1
執行訪問被禁止。
403.2
讀訪問被禁止。
403.3
寫訪問被禁止。
403.4
要求 SSL。
403.5
要求 SSL 128。
403.6
IP 地址被拒絕。
403.7
要求客戶端證書。
403.8
站點訪問被拒絕。
403.9
用戶數過多。
403.10
配置無效。
403.11
密碼更改。
403.12
拒絕訪問映射表。
403.13
客戶端證書被吊銷。
403.14
拒絕目錄列表。
403.15
超出客戶端訪問許可。
403.16
客戶端證書不受信任或無效。
403.17
客戶端證書已過時或還沒有生效。
403.18
在當前的應用程序池中不能執行所請求的 URL。這個錯誤代碼爲 IIS 6.0 所專用。
403.19
不能爲這個應用程序池中的客戶端執行 CGI。這個錯誤代碼爲 IIS 6.0 所專用。
403.20
Passport 登陸失敗。這個錯誤代碼爲 IIS 6.0 所專用。
404 Not Found
服務器沒法找到被請求的頁面。
404.0
沒有找到文件或目錄。
404.1
沒法在所請求的端口上訪問 Web 站點。
404.2
Web 服務擴展鎖定策略阻止本請求。
404.3
MIME 映射策略阻止本請求。
405 Method Not Allowed
請求中指定的方法不被容許。
406 Not Acceptable
服務器生成的響應沒法被客戶端所接受。
407 Proxy Authentication Required
用戶必須首先使用代理服務器進行驗證，這樣請求才會被處理。
408 Request Timeout
請求超出了服務器的等待時間。
409 Conflict
因爲衝突，請求沒法被完成。
410 Gone
被請求的頁面不可用。
411 Length Required
"Content-Length" 未被定義。若是無此內容，服務器不會接受請求。
412 Precondition Failed
請求中的前提條件被服務器評估爲失敗。
413 Request Entity Too Large
因爲所請求的實體的太大，服務器不會接受請求。
414 Request-url Too Long
因爲url太長，服務器不會接受請求。當post請求被轉換爲帶有很長的查詢信息的get請求時，就會發生這種狀況。
415 Unsupported Media Type
因爲媒介類型不被支持，服務器不會接受請求。
416 Requested Range Not Satisfiable
服務器不能知足客戶在請求中指定的Range頭。
417 Expectation Failed
執行失敗。
423
鎖定的錯誤。
 
 
 
5xx:服務器錯誤
500 Internal Server Error
請求未完成。服務器遇到不可預知的狀況。
500.12
應用程序正忙於在 Web 服務器上從新啓動。
500.13
Web 服務器太忙。
500.15
不容許直接請求 Global.asa。
500.16
UNC 受權憑據不正確。這個錯誤代碼爲 IIS 6.0 所專用。
500.18
URL 受權存儲不能打開。這個錯誤代碼爲 IIS 6.0 所專用。
500.100
內部 ASP 錯誤。
501 Not Implemented
請求未完成。服務器不支持所請求的功能。
502 Bad Gateway
請求未完成。服務器從上游服務器收到一個無效的響應。
502.1
CGI 應用程序超時。　·
502.2
CGI 應用程序出錯。
503 Service Unavailable
請求未完成。服務器臨時過載或當機。
504 Gateway Timeout
網關超時。
505 HTTP Version Not Supported
服務器不支持請求中指明的HTTP協議版本

三.解析連接urllib.parse模塊

1.urlparse()

該方法實現url的識別和分段，將url分紅6個部分，分別是scheme（協議），netloc（域名）、path（訪問路徑）、params（參數）、query（查詢條件）、fragment（位置）

from urllib import parse

resule = parse.urlparse('https://blog.csdn.net/pleasecallmewhy/article/details/8924889')
print(resule)
# 結果：
'''
ParseResult(
  scheme='https',
    netloc='blog.csdn.net',
    path='/pleasecallmewhy/article/details/8924889',
    params='',
    query='',
    fragment=''
)
'''

2.urlunparse()

接受一個可迭代對象，如列表、元祖或者特定的數據結構，長度必須是6。和urlparse()方法的功能相反。

from urllib import parse

dt = ['http','www.baidu.com','ondex.html','user','a=8','comment']
print(parse.urlunparse(dt))
# http://www.baidu.com/ondex.html;user?a=8#comment

3.urlsplit()

這個方法和urlparse()方法很是的相似，只不過不在單獨的解析params這一部分，只返回5個結果，params會和path合併。實例以下：

from urllib import parse

resule = parse.urlsplit('https://blog.csdn.net/pleasecallmewhy/article/details/8924889')
print(resule)
# 結果
'''
SplitResult(
    scheme='https', 
    netloc='blog.csdn.net', 
    path='/pleasecallmewhy/article/details/8924889', 
    query='', 
    fragment='')
'''

4.urlunsplit()

這個方法和urlinparse()方法很是的相似，也是將鏈接拼接成完整url,闖入的參數是一個可迭代對象，惟一區別是長度爲5。

from urllib import parse

dt = ['http','www.baidu.com','ondex.html','user','a=8',]
print(parse.urlunsplit(dt))
# http://www.baidu.com/ondex.html?user#a=8

5. urljoin()

學習了前面的方法，咱們已經能夠完成鏈接的拼接護着拆分，不過前提是必需要有特定的長度。此外，還可使用urljoin()方法。咱們能夠提供一個base_url做爲第一個參數，新的連接做爲第二個參數，該方法會解析base_url中的scheme、netloc和path這3個參數的內容對新連接缺失的部分進行補充，若是存在，就用新連接的內容。最後返回結果。

from urllib import parse

print(parse.urljoin('http://www.baidu.com/','http://www.baidu.com/ondex.html?user#a=8'))
# http://www.baidu.com/ondex.html?user#a=8

6. urlencode()

此方法是最經常使用的方法，它將一個字典序列化爲爲get請求參數，在構造get請求參數是很是有用。

from urllib import parse

url = 'http://www.baidu.com/'
data = {'name':'wl','age':23}
url_page = url + parse.urlencode(data)
print(url_page)
# http://www.baidu.com/name=wl&age=23

7.parse_qs()

有了序列化，那確定hui有反序列化。將get請求中的數據轉化爲字典。

from urllib import parse

url = 'http://www.baidu.com/name=wl&age=23'
print(parse.parse_qs(url))
# {'age': ['23'], 'http://www.baidu.com/name': ['wl']}

8. parse_qsl()

用於將參數轉化爲元祖組成的列表。

from urllib import parse

url = 'http://www.baidu.com/name=wl&age=23'
print(parse.parse_qsl(url))
# [('http://www.baidu.com/name', 'wl'), ('age', '23')]

9.quote()

該方法將內容轉化爲URL編碼的格式。URL中有中文時，有時可能會有亂碼的問題，此時可使用該方法。

from urllib import parse

url = 'http://www.baidu.com/name=' + parse.quote('張三')
print(url)
# http://www.baidu.com/name=%E5%BC%A0%E4%B8%89

10.unquote()

該方法將URL編碼格式的內容還原。方便的實現解碼。

from urllib import parse

url = 'http://www.baidu.com/name=%E5%BC%A0%E4%B8%89'
print(parse.unquote(url))
# http://www.baidu.com/name=張三

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

urllib模塊

urllib.request.Request(url, data=None, headers={}, origin_req_host=None,unverifuable = False,method=None)

PS：py2中爲urllib.Request(url, data=None, headers={}, method=None)

`urllib.request.Request`(url, data=None, headers={}, origin_req_host=None,unverifuable = False,method=None)

PS：py2中爲`urllib.Request`(url, data=None, headers={}, method=None)