1.urllib2能夠接受一個Request對象,並以此能夠來設置一個URL的headers,可是urllib只接收一個URL。html
2.urllib模塊能夠提供進行urlencode的方法,該方法用於GET查詢字符串的生成,urllib2的不具備這樣的功能。python
1) urllib2.urlopen(url[, data][, timeout])服務器
3.urlopen方法是urllib2模塊最經常使用也最簡單的方法,它打開URL網址,url參數能夠是一個字符串url或者是一個Request對象。cookie
4.urlopen方法也可經過創建了一個Request對象來明確指明想要獲取的url。網絡
2) class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])app
Request類是一個抽象的URL請求。5個參數的說明以下函數
URL——是一個字符串,其中包含一個有效的URL。ui
data——是一個字符串,指定額外的數據發送到服務器,若是沒有data須要發送能夠爲「None」。這些數據須要被以標準的格式編碼(encode),而後做爲一個數據參數傳送給Request對象。Encoding是在urlib模塊中完成的,而不是在urlib2中完成的。this
headers——是字典類型,頭字典能夠做爲參數在request時直接傳入,也能夠把每一個鍵和值做爲參數調用add_header()方法來添加。標準的headers組成是(Content-Length, Content-Type and Host),只有在Request對象調用urlopen()或者OpenerDirector.open()時加入。編碼
origin_req_host——是RFC2965定義的源交互的request-host。默認的取值是cookielib.request_host(self)。這是由用戶發起的原始請求的主機名或IP地址。例如,若是請求的是一個HTML文檔中的圖像,這應該是包含該圖像的頁面請求的request-host。
unverifiable ——表明請求是不是沒法驗證的,它也是由RFC2965定義的。默認值爲false。一個沒法驗證的請求是,其用戶的URL沒有足夠的權限來被接受。例如,若是請求的是在HTML文檔中的圖像,可是用戶沒有自動抓取圖像的權限,unverifiable的值就應該是true。
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
5.調用urlopen函數對請求的url返回一個response對象。這個response相似於一個file對象,因此用.read()函數能夠操做這個response對象。
response對象的幾個經常使用的方法:
geturl() — 返回檢索的URL資源,這個是返回的真正url,一般是用來鑑定是否重定向的。
info() — 返回頁面的原信息就像一個字段的對象, 如headers,它以mimetools.Message實例爲格式(能夠參考HTTP Headers說明)。
getcode() — 返回響應的HTTP狀態代碼。
當不能處理一個response時,urlopen拋出一個URLError(對於python APIs,內建異常如,ValueError, TypeError 等也會被拋出。)
HTTPError是HTTP URL在特別的狀況下被拋出的URLError的一個子類。
URLError——handlers當運行出現問題時(一般是由於沒有網絡鏈接也就是沒有路由到指定的服務器,或在指定的服務器不存在),拋出這個異常.它是IOError的子類.這個拋出的異常包括一個‘reason’ 屬性,他包含一個錯誤編碼和一個錯誤文字描述。
HTTPError——HTTPError是URLError的子類。每一個來自服務器HTTP的response都包含「status code」. 有時status code不能處理這個request. 默認的處理程序將處理這些異常的responses。例如,urllib2發現response的URL與你請求的URL不一樣時也就是發生了重定向時,會自動處理。對於不能處理的請求, urlopen將拋出HTTPError異常. 典型的錯誤包含‘404’ (沒有找到頁面), ‘403’ (禁止請求),‘401’ (須要驗證)等。它包含2個重要的屬性reason和code。
若是咱們想同時處理HTTPError和URLError,由於HTTPError是URLError的子類,因此應該把捕獲HTTPError放在URLError前面,如否則URLError也會捕獲一個HTTPError錯誤,代碼參考以下:
import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
response=urllib2.urlopen(req)
except urllib2.HTTPError,e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ',e.code
print 'Error reason: ',e.reason
except urllib2.URLError,e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:
# everything is fine
response.read()
代碼改進以下:
import urllib2
req = urllib2.Request('http://www.python.org/fish.html')
try:
response=urllib2.urlopen(req)
except urllib2.URLError as e:
if hasattr(e, 'reason'):
#HTTPError and URLError all have reason attribute.
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
#Only HTTPError has code attribute.
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
response.read()
# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),
200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),
300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),
400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),
500: ('Internal Server Error', 'Server got itself in trouble'), 501: ('Not Implemented', 'Server does not support this operation'), 502: ('Bad Gateway', 'Invalid responses from another server/proxy.'), 503: ('Service Unavailable', 'The server cannot process the request due to a high load'), 504: ('Gateway Timeout', 'The gateway server did not receive a timely response'), 505: ('HTTP Version Not Supported', 'Cannot fulfill request.'), }