python之HTTP處理模塊urllib和urllib2

時間 2020-01-28

標籤 python http 處理模塊 urllib urllib2 欄目 Python 简体版

原文原文鏈接

python2主要涉及兩個模塊來處理HTTP請求：urllib和urllib2html

urllib模塊：python

urllib.urlopen(url[,data[,proxies]]) 打開一個url的方法，返回一個文件對象，而後能夠進行相似文件對象的操做 web

urlopen返回對象提供方法：瀏覽器

read() , readline() ,readlines() , fileno() , close() ：這些方法的使用方式與文件對象徹底同樣服務器

info()：返回一個httplib.HTTPMessage對象，表示遠程服務器返回的頭信息網絡

getcode()：返回Http狀態碼。若是是http請求，200請求成功完成;404網址未找到app

geturl()：返回請求的urlsocket

urllib.urlencode() 將URL中的鍵值對以鏈接符&劃分,暫時不支持urldecode();注意：urlencode的參數必須是Dictionaryide

如：urllib.urlencode({'spam':1,'eggs':2,'bacon':0})函數

結果爲：eggs=2&bacon=0&spam=1

urllib.quote(url)和urllib.quote_plus(url) 將url數據獲取以後，並將其編碼，從而適用與URL字符串中，使其能被打印和被web服務器接受

如：

print urllib.quote('http://www.baidu.com')

print urllib.quote_plus('http://www.baidu.com')

結果分別爲：

http%3A//www.baidu.com

http%3A%2F%2Fwww.baidu.com

urllib.unquote(url)和urllib.unquote_plus(url) 與上面正好相反

urllib2模塊：

直接請求一個url地址：

urllib2.urlopen(url, data=None) 經過向指定的URL發出請求來獲取數據

構造一個request對象信息，而後發送請求：

urllib2.Request(url,data=None,header={},origin_req_host=None) 功能是構造一個請求信息，返回的req就是一個構造好的請求

urllib2.urlopen(url, data=None) 功能是發送剛剛構造好的請求req，並返回一個文件類的對象response，包括了全部的返回信息

response.read() 能夠讀取到response裏面的html

response.info() 能夠讀到一些額外的響應頭信息

主要區別：

urllib2能夠接受一個Request類的實例來設置URL請求的headers，urllib僅能夠接受URL。這意味着，你不能夠經過urllib模塊假裝你的User Agent字符串等（假裝瀏覽器）。
urllib提供urlencode方法用來GET查詢字符串的產生，而urllib2沒有。這是爲什麼urllib常和urllib2一塊兒使用的緣由。
urllib2模塊比較優點的地方是urlliburllib2.urlopen能夠接受Request對象做爲參數，從而能夠控制HTTP Request的header部。
可是urllib.urlretrieve函數以及urllib.quote等一系列quote和unquote功能沒有被加入urllib2中，所以有時也須要urllib的輔助

異常處理：

官方內容：

The following exceptions are raised as appropriate:

exception urllib2.URLError
The handlers raise this exception (or derived exceptions) when they run into a problem. It is a subclass of IOError.
- reason
- The reason for this error. It can be a message string or another exception instance (socket.error for remote URLs, OSError for local URLs).

exception urllib2.HTTPError
Though being an exception (a subclass of URLError), an HTTPError can also function as a non-exceptional file-like return value (the same thing that urlopen() returns). This is useful when handling exotic HTTP errors, such as requests for authentication.
- reason
- The reason for this error. It can be a message string or another exception instance.
- code
- An HTTP status code as defined in RFC 2616. This numeric value corresponds to a value found in the dictionary of codes as found in BaseHTTPServer.BaseHTTPRequestHandler.responses.

URLError:

只有一個錯誤類reason。

URLError在沒有網絡鏈接(沒有路由到特定服務器),或者服務器不存在的狀況下產生。這種狀況下，異常一樣會帶有"reason"屬性，它是一個tuple，包含了一個錯誤號和一個錯誤信息

HTTPError:

包含兩個錯誤類code與reson。

服務器上每個HTTP 應答對象response包含一個數字"狀態碼"。有時狀態碼指出服務器沒法完成請求。默認的處理器會爲你處理一部分這種應答(例如:假如response是一個"重定向"，須要客戶端從別的地址獲取文檔，urllib2將爲你處理)。其餘不能處理的，urlopen會產生一個HTTPError。典型的錯誤包含"404"(頁面沒法找到)，"403"(請求禁止)，和"401"(帶驗證請求)

注意：except HTTPError 必須在第一個，不然except URLError將一樣接受到HTTPError。

實例：

import urllib

import urllib2

from sys import exit

murl = "http://zhpfbk.blog.51cto.com/"

UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2896.3 Safari/537.36"

### 設置傳入的參數，內容爲一個dic

value = {'value1':'tkk','value2':'abcd'}

### 對value進行url編碼

data = urllib.urlencode(value)

### 設置一個http頭，格式爲一個dic