【python】urllib2

時間 2019-11-13

原文原文鏈接

1 urllib2.urlopen(url[, data][, timeout])

請求url，得到請求數據，url參數能夠是個String，也能夠是個Request參數php

沒有data參數時爲GET請求，設置data參數時爲POST請求，另外data格式必須爲application/x-www-form-urlencoded，urllib.urlencode()可以設置請求參數的編碼，data是字典，須要經urllib.urlencode()編碼html

timeout設置請求阻塞的超時時間，若是沒有設置的話，會使用全局默認timeout參數；該參數只對HTTP、HTTPS、FTP生效python

This function returns a file-like object with three additional methods:瀏覽器

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of an mimetools.Message instance (see Quick Reference to HTTP Headers)
getcode() — return the HTTP status code of the response

1 class OpenerDirector

管理一系列的Handler，這些handler都有本身的功能實現和協議，後面會提到大量的Handler功能服務器

1 urllib2.build_opener([handler, ...])

返回OpenerDirector實例，實現了BaseHandler均可以生成Handler實例。Python已經內建許多的Handler，你能夠替換或者添加新的Handler。cookie

內建Handler以下：app

ProxyHandler：處理代理操做  less

UnknownHandler：Raise URLError異常  網站

HTTPHandler：處理HTTP的GET和POST操做  ui

HTTPDefaultErrorHandler：處理HTTP Error的通用處理，全部的響應都會拋出HTTPError異常

HTTPRedirectHandler：處理HTTP重定向操做，如30一、30二、303等和HEAD請求的307都會執行重定向操做

FTPHandler：處理FTP操做

FileHandler：處理文件

HTTPErrorProcessor：處理非200異常

除去上面這些Handler，urllib2還有一些其它的Handler可供選擇，這些Handler都能根據名稱知曉其功能，不細做解釋，包括但不只限於：

HTTPCookieProcessor：處理cookie
HTTPBasicAuthHandler：處理Auth
ProxyBasicAuthHandler：處理Proxy和Auth
HTTPDigestAuthHandler：處理DigestAuth
ProxyDigestAuthHandler：處理ProxyDigest
HTTPSHandler：處理HTTPS請求
CacheFTPHandler：比FTPHandler多點功能。


urllib2對於opener的使用：

1     urllib2.install_opener(opener)

定義全局的OpenerDirector，若是執行這個方法，會把本身定義的Handler用在後續的URL處理上。

1 class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])

url和data的內容和前面的一致，添加了headers的信息，header的內容能夠參考http://isilic.iteye.com/blog/1801072

origin_req_host應該是請求的服務器Host地址，unverifiable參數代表請求是否可驗證

基本用法：

1）

1     import urllib2  
2     f = urllib2.urlopen('http://www.python.org/')  
3     print f.read(100)

2）

1     import urllib2  
2     req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',data='Committed Data')  
3     f = urllib2.urlopen(req)  
4     print f.read()

3）

 1     import urllib  
 2     import urllib2  
 3     url = 'http://www.server.com/cgi-bin/register.cgi'  
 4     user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
 5     values = {'name' : 'Michael','language' : 'Python' }  
 6     headers = { 'User-Agent' : user_agent }  
 7     data = urllib.urlencode(values)  
 8     req = urllib2.Request(url, data, headers)  
 9     f = urllib2.urlopen(req)  
10     print f.read()

Proxy的使用至關普遍，對於單個應用來講，爬蟲是很容易被封禁，若是使用Proxy模式，就能下降被封的風險，因此有需求的同窗須要仔細看下Python urllib2對於Proxy的使用：

1 import urllib2  
2 proxy_handler = urllib2.ProxyHandler({'http': '127.0.0.1:80'})  //使用本機80端口的代理訪問谷歌的內容 3 opener = urllib2.build_opener(proxy_handler)  
4 urllib2.install_opener(opener)  
5 f = urllib2.urlopen('http://www.google.com')  
6 print f.read()

注意這個Proxy會將proxy_handler做爲全局的ProxyHandler，這個未必是咱們須要的，若是咱們須要使用不一樣的Proxy，這個設置就有問題，須要修改成如下Proxy使用方式：

1     import urllib2  
2     proxy_handler = urllib2.ProxyHandler({'http': '127.0.0.1:80'})  
3     opener = urllib2.build_opener(proxy_handler)  
4     f = opener.open(url)  
5     print f.read()

使用多個代理：

 1 import urllib2
 2 proxyList=('211.167.112.14:80',
 3         '210.32.34.115:8080',
 4         '115.47.8.39:80',
 5         '211.151.181.41:80',
 6         '219.239.26.23:80'
 7         )
 8 for proxy in proxyList:
 9     proxies={"":proxy}
10     proxy_handler=urllib2.ProxyHandler(proxies)
11     opener=urllib2.build_opener(proxy_handler)
12     f=opener.open("http://www.cc98.org")
13     print f.read()

對於cookie的處理也是有Handler自動處理的:由於 HTTP 協議是一個無狀態(Stateless)的協議，服務器如何知道當前請求鏈接的用戶是否已經登錄了呢？有兩種方式： 1.在URI 中顯式地使用 Session ID；
2.利用 Cookie，大概過程是登錄一個網站後會在本地保留一個 Cookie，當繼續瀏覽這個網站的時候，瀏覽器會把 Cookie 連同地址請求一塊兒發送過去。

    import urllib2  
    import cookielib  
    cookies = cookielib.CookieJar()  
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies))  
    response = opener.open('http://www.google.com')  
    for cookie in cookies:  
        if cookie.name == 'cookie_spec':  
            print cookie.value

處理cookie時通常是cookielib和HTTPCookieProcessor一塊兒使用，HTTPCookieProcessor爲handler。

cookielib模塊定義了自動處理HTTP cookies的類，用來訪問那些須要cookie數據的網站，cookielib模塊包括 CookieJar，FileCookieJar，CookiePolicy，DefaultCookiePolicy，Cookie及 FileCookieJar的子類MozillaCookieJar和LWPCookieJar，CookieJar對象能夠管理HTTP cookies，將cookie添加到http請求中，並能從http響應中獲得cookie，FileCookieJar對象主要是從文件中讀取 cookie或建立cookie，其中，MozillaCookieJar是爲了建立與Mozilla瀏覽器cookies.txt兼容的 FileCookieJar實例，LWPCookieJar是爲了建立與libwww-perl的Set-Cookie3文件格式兼容的 FileCookieJar實例，用LWPCookieJar保存的cookie文件易於人類閱讀。默認的是FileCookieJar沒有save函數,而MozillaCookieJar或LWPCookieJar都已經實現了。因此能夠用MozillaCookieJar或LWPCookieJar，去自動實現cookie的save。

使用Basic HTTP Authentication：

 1     import urllib2  
 2     auth_handler = urllib2.HTTPBasicAuthHandler()  
 3     auth_handler.add_password(realm='PDQ Application',  
 4                               uri='https://mahler:8092/site-updates.py',  
 5                               user='klem',  
 6                               passwd='kadidd!ehopper')  
 7     opener = urllib2.build_opener(auth_handler)  
 8     urllib2.install_opener(opener)  
 9     f = urllib2.urlopen('http://www.server.com/login.html')  
10     print f.read()

參考：

http://isilic.iteye.com/blog/1806403

http://www.devba.com/index.php/archives/4605.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。