urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
css
urllib2 是python中的一個來處理URLs(統一資源定位器)的模塊。它以urlopen()函數的方式,提供很是簡單的接口。它可使用多種不一樣的協議來打開網頁。它也提供稍微複雜的接口來處理更通常的情形:例如基本的身份驗證,Cookies,代理等等。這些由類提供的(函數)也叫作句柄和Openers.
html
urllib2 supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.
python
urllib2 支持多種方案來獲取網頁(經過網址字符串以前的「:」--例如FTP,HTTP)。此教程重點關注最經常使用的情形: http。 web
For straightforward situations urlopen is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is RFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using urllib2, with enough detail about HTTP to help you through. It is not intended to replace the urllib2 docs , but is supplementary to them. 瀏覽器
urlopen在一般狀況下很好使用。可是當你打開網頁遇到錯誤或者異常時,你須要瞭解一些超文本傳輸協議。最全面和權威的文檔固然是參考HTTP的 RFC 2616,可是這個技術文檔卻並不容易閱讀。這個指南就是經過詳盡的HTTP細節,來講明怎樣使用urllib2。這個指南僅僅是對文檔urllib2 docs的補充,而不是試圖取代它們。 服務器
The simplest way to use urllib2 is as follows :
cookie
最簡單的使用urllib2的方式以下所示:
網絡
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP. app
不少urllib2的情形都是如此簡單的(固然你也能夠打開這樣的網址'ftp://***.***.***.***'),然而咱們本教程的目的爲了解釋更復雜的情形:HTTP。
socket
HTTP is based on requests and responses - the client makes requests and servers send responses. urllib2 mirrors this with aRequestobject which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Callingurlopenwith this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response :
HTTP是基於請求和響應的:客戶端發出請求,服務器做出答覆。urllib2利用 Request 類來描述這個行爲,表明你做出的HTTP請求。最簡單的建立Request類的方法就是指定你要打開的URL。利用函數urlopen打開Request類,返回一個response類。這個答覆是一個像文件的類,你可使用.read()函數來查看答覆的內容。
Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so :
注意:urllib2使用一樣的請求借口來處理URL方案。例如,你能夠建立一個FTP請求:
In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") about the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.
在HTTP情形下,Request類有兩件額外的事讓你去作:第一,你能夠將數據發送到服務器。第二,你能夠發送關於數據自己,或者關於請求本身的額外信息(元數據)給服務器。這些信息一般用Http「headers」形式傳遞。讓咱們依次看幾個例子。
數據
Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [1] or other web application). With HTTP, this is often done using what's known as a POST request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as thedataargument. The encoding is done using a function from theurlliblibrary not fromurllib2.
有時候你想給某個URL傳遞數據(這裏的URL一般會涉及到CGI(通用網關界面)腳本或者其餘web應用程序)。結合HTTP,這一般使用POST請求。這一般是當你提交一個HTML表格時,你的瀏覽器所做的事情。並不是全部的POSTs都得來自表格。你可使用POST方法傳遞任意數據到你的應用程序。在一般的HTML表單上,這些要傳遞的數據須要驚醒標準的編碼,而後傳遞到Request對象的data參數。用urllib庫,而不是urllib2庫中的函數來進行這種編碼。
Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see HTML Specification, Form Submission for more details).
注意:有時候須要其餘編碼形式(例如,從HTML表格中上傳文件,請參考HTML Specification, Form Submission)
If you do not pass thedataargument, urllib2 uses a GET request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.
若是你不想以data參數的形式傳遞數據,urllib2可使用Get請求。GET和POST請求的一個不一樣之處在於:POST請求常常有反作用:他們會改變系統的狀態(例如,可能會把一聽垃圾放在你門口)。雖然HTTP標準清楚地告訴咱們:POST總會引發反作用,GET方法從不引發反作用,可是,GET也會有反作用,POST方法也許沒有反作用。數據也能夠經過GET請求將數據直接鑲嵌在URL中。
This is done as follows.
>>> import urllib2 >>> import urllib >>> data = {} >>> data['name'] = 'Somebody Here' >>> data['location'] = 'Northampton' >>> data['language'] = 'Python' >>> url_values = urllib.urlencode(data) >>> print url_values name=Somebody+Here&language=Python&location=Northampton >>> url = 'http://www.example.com/example.cgi' >>> full_url = url + '?' + url_values >>> data = urllib2.open(full_url)
Notice that the full URL is created by adding a?to the URL, followed by the encoded values.
注意到完整的URL是由網址+‘?’還有編碼後的數據組成的。
We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
Some websites [2] dislike being browsed by programs, or send different versions to different browsers [3] . By default urllib2 identifies itself asPython-urllib/x.y(wherexandyare the major and minor version numbers of the Python release, e.g.Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through theUser-Agentheader [4]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [5].
咱們在這裏討論一個特定的HTTP標題,來講明如何向你的HTTP請求添加標題。有些網站不喜歡正在瀏覽的節目,或者給不一樣的瀏覽器發送不一樣版本。默認狀況下urllib2識別本身爲Python-urllib/x.y(其中x和y分別是主要的和次要的python版本號。例如,Python-urllib/2.5),這樣會混淆一些網站,或者不能工做。瀏覽器經過user-Agent標題來確認本身。當你建立一個Request類時候,你傳遞包含標題的字典型。下面的例子向上面同樣作了一樣的請求,可是他將本身做爲IE瀏覽器。
The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.
答覆已經有了兩個有用的方法(POST,GET)。在看 info and geturl 以前,咱們看看若是程序出錯會發生什麼事情。
HTTPErroris the subclass ofURLErrorraised in the specific case of HTTP URLs.
urlopen會引起一個URLError異常,當它不能處理答覆(儘管像Python的APIs,內建的異常如ValueError,TypeError等也可能一塊兒異常)時。HTTPError是URLError的一個子類,當具體的HTTP網址是會引起這個異常。
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.
一般,URLError被引起是由於沒有網絡鏈接(沒有這個服務器),或者目標服務器不存在。在這種狀況下,異常被引起會有一個‘reason’屬性,這個屬性是個元組類型,包含一個錯誤代碼和一個文本錯誤信息。e.g.
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason >>>
(4, 'getaddrinfo failed')
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise anHTTPError. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.TheHTTPErrorinstance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.
每一個來自服務器的HTTP response響應都包含一個數字狀態碼。有時候這個狀態碼代表服務器不能履行你的請求。默認處理程序會給你一些錯誤的信息(如,若是請求是'redirection',它從不一樣的網址得到文件,urllib2會爲你處理這些),對一些不能處理的,urlopen會引起一個HTTPError。典型的異常包括‘404‘(找不到網頁),’403‘(請求被禁止),’401‘(須要驗證)。請參考10條和RFC2616中的HTTP錯誤代碼。HTTPError有一個代碼屬性。他對應服務器發出的錯誤。
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
BaseHTTPServer.BaseHTTPRequestHandler.responsesis a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :
由於默認的處理是從新定向(代碼在300範圍內)。代碼在100-299代表成功。一般你看到的代碼錯誤在400-599之間。BaseHTTPServer.BaseHTTPRequestHandler.response 是一個有用的代碼字典。它RFC2616中使用的響應代碼。以下所示:
When an error is raised the server responds by returning an HTTP error code and an error page. You can use theHTTPErrorinstance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.
當一個異常被引起,服務器經過返回一個HTTP錯誤代碼和一個錯誤網頁。你可使用HTTPError實例打開。這意味着你可使用code屬性如 read,geturl,info,methods方法。
>>> req = urllib2.Request('http://www.python.org/fish.html') >>> try: >>> urllib2.urlopen(req) >>> except URLError, e: >>> print e.code >>> print e.read() >>> 404 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html><head><title>Error 404: File Not Found</title> ...... etc...
So if you want to be prepared forHTTPErroror URLErrorthere are two basic approaches. I prefer the second approach.
若是你想編寫HTTPError和URLError,這有兩種方法。我更願意使用第二個方法。
Note
Theexcept HTTPErrormust come first, otherwiseexcept URLErrorwill also catch anHTTPError.
Note
URLErroris a subclass of the built-in exceptionIOError.
This means that you can avoid importingURLErrorand use :
from urllib2 import Request , urlopen
req = Request ( someurl )
try :
response = urlopen ( req )
except IOError , e :
if hasattr ( e , 'reason' ) :
print 'We failed to reach a server.'
print 'Reason: ' , e . reason
elif hasattr ( e , 'code' ) :
print 'The server couldn\'t fulfill the request.'
print 'Error code: ' , e . code
else :
# everything is fine
Under rare circumstancesurllib2can raisesocket.error.
The response returned by urlopen (or theHTTPErrorinstance) has two useful methodsinfoandgeturl.
geturl - this returns the real URL of the page fetched. This is useful becauseurlopen(or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.
info - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently anhttplib.HTTPMessageinstance.
Typical headers include 'Content-length', 'Content-type', and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.
urlopen返回的結果有兩個好的方法:geturl:返回獲得的網頁的真是的網址。這個頗有用。由於urlopen可能跟着一個從新定向。URL的網址也許不是你發出請求的那個URL。info:返回一個字典類型的數據。包括描述的網頁,特別是服務器返回的標題。它目前是httplib.HTTPMessage的一個實例。典型的標題包括'Content-length', 'Content-type', 等等。請參考 Quick Reference to HTTP Headers裏面有一個有用的標題列表和簡要的介紹和用法。
When you fetch a URL you use an opener (an instance of the perhaps confusingly-namedurllib2.OpenerDirector). Normally we have been using the default opener - viaurlopen- but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
當你用一個opener(是urllib2.OpenerDirector的一個實例)打開一個網址,通常說來,咱們一直利用默認的opener-經過urlopen-可是你能夠本身建立一個opener. Openers使用句柄。全部繁重的工做都是由handlers來作的。每個Handler知道怎樣對某個特定的URL打開網址,或者知道怎樣處理URL的某方面。例如,HTTP從新定向或者HTTP cookies。
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
當你想處理URLs,你就想去創建openers。例如獲得opener來處理cookies。或者用opener來處理從新定向。爲了創建一個OpenerDirector 的實例opener,接着須要須要函數.add_handler().
Alternatively, you can usebuild_opener, which is a convenience function for creating opener objects with a single function call.build_openeradds several handlers by default, but provides a quick way to add more and/or override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
你也可使用build_opener,他是一個很方便的函數來建立opener類。它默認狀況下增長許多handles,但提供一個快速的增長或者覆蓋默認handlers的方法。其餘handlers你也許想去處理代理,認證或者其餘普通但稍微專業的情形。
install_openercan be used to make anopenerobject the (global) default opener. This means that calls tourlopenwill use the opener you have installed.
Opener objects have anopenmethod, which can be called directly to fetch urls in the same way as theurlopenfunction: there's no need to callinstall_opener, except as a convenience.
install_opener 能夠用來建立一個opener類。這意味着urlopen使用你創建的opener。Opener類有一個open方法。他能夠用來直接獲得urls,像urlopen函數那同樣不須要使用install_opener函數。
To illustrate creating and installing a handler we will use theHTTPBasicAuthHandler. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like :Www-authenticate: SCHEME realm="REALM".e.g.
當建立一個handler時咱們使用HTTPBasicAuthHandler.更多信息請參考權威的 Basic Authentication Tutorial.
當須要認證的時候,服務器發送一個標題(401代碼)要求驗證。這中須要驗證和‘realm‘ 標題看起來想這樣:Www-authenticate: SCHEME realm="REALM"例如:
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance ofHTTPBasicAuthHandlerand an opener to use this handler.
客戶端應該試圖從新提交請求用合適的名字和密碼。這就是基本的認證。爲了簡化這種國沉給咱們創建一個HTTPBasicAuthHandler的一個實例和opener。
TheHTTPBasicAuthHandleruses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use aHTTPPasswordMgr. Frequently one doesn't care what the realm is. In that case, it is convenient to useHTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providingNoneas the realm argument to theadd_passwordmethod.
The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.
HTTPBasicAuthHandler 用一個密碼管理者的類來處理我那個這和密碼用戶名。若是你知道哦阿realm是什麼,你可使用HTTPPasswrodMgr. 一般咱們不關心realm是什麼。在這種哦功能情形下,咱們用HTTPPasswordMgrWithDefaultRealm是很方便的。這如許你能夠具體化用戶名和密碼。若是你不提供另外的可選方案他會幫你做這些。咱們經過用add_password 中的None。
在頂極URL是第一個URL須要認證。URL比.addpassword()更deeper.
Note
In the above example we only supplied ourHHTPBasicAuthHandlertobuild_opener. By default openers have the handlers for normal situations -ProxyHandler,UnknownHandler,HTTPHandler,HTTPDefaultErrorHandler,HTTPRedirectHandler,FTPHandler,FileHandler,HTTPErrorProcessor.
top_level_url is in fact either a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. "http://example.com/" or an "authority" (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "joe@password :example.com" is not correct.
urllib2 will auto-detect your proxy settings and use those. This is through theProxyHandlerwhich is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [6]. One way to do this is to setup our ownProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler :
urllib2自動檢測的代理設置並使用他們。這是經過正常處理鏈下的ProxyHandler實現的。通常來講它是個好東西可是有時候,它並非很管用。一種方式就是本身設定咱們的ProxyHandler,沒有代理人的定義.用相似的步驟也能夠設定 Basic Authentication :
>>> proxy_support = urllib2.ProxyHandler({}) >>> opener = urllib2.build_opener(proxy_support) >>> urllib2.install_opener(opener)
Note
Currentlyurllib2does not support fetching ofhttpslocations through a proxy. This can be a problem.
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :
python支持從網絡層面得到資源。urllib2使用httplib庫中的socket庫。在python2.3中你能夠指定多久算超時。當你想獲得網頁是頗有用。默認狀況下socket模塊沒有timeout 能夠掛起。目前,socket 中的timeout只在httplib和urllib2層面上。然而,你能夠設定全局的timout值。