Python-爬蟲-基本庫（urllib）使用

時間 2019-11-26

標籤 python 爬蟲基本 urllib 使用欄目 Python 简体版

原文原文鏈接

urllib庫的使用html

Python2中有urllib和urllib2兩個內置庫實現請求的發送；Python3中則沒有urllib2，統一爲了內置的urllib庫；python

API：https://docs.python.org/3/library/urllib.html瀏覽器

#該庫提供了相關函數和類，基於身份認證、摘要身份驗證、重定向、cookie的操做，實現完成（HTTP/1.1協議）的URL訪問；cookie

該庫主要包含如下四個模塊：網絡

request，用於模擬瀏覽器發送請求；app

error,異常處理模塊；ssh

parse，主要提供了對URL處理的方法，例如：拆分、轉碼、解析合併等；socket

robotparser，用於識別網站的robots.txt文件，判斷哪些網站能夠爬取、哪些不能夠，通常不使用；ide

一、request模塊：svn

（1）方法：urlopen

　　def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None):

　　urlopen用於打開一個url，返回結果爲HTTPResponse類型對象；

參數：

　　data可選參數，該字段能夠是字節流編碼格式，即bytes類型，則須要經過bytes()方法轉化；若是該參數不爲空，表示該請求方式再也不是GET請求方式，則是PPOST方式提交請求；

timeout用於設置超時時間，單位秒，若是請求後超過該時間依然沒有響應，則拋出異常；若是該參數未設置，那麼會使用默認時間；他的支持僅是HTTP、HTTPS、FTP請求；

其餘參數：

context，則必須是ssl.SSLContext類型，用於指定SSL設置；cafile和capath分別指定CA證書和它的路徑，在HTTPS鏈接是會使用；

cadefault參數忽略；

例如：經過訪問http://httpbin.org測試http請求（該站點能夠測試http請求）

 1 #urllib(發送請求)
 2 #注意：python3之後將urlib2和urllib整合爲了urllib，其中urllib的request不能直接用，須要urllib.request引入
 3 import urllib
 4 import urllib.parse
 5 import urllib.request
 6 
 7 data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
 8 response=urllib.request.urlopen('http://httpbin.org/post',data=data,timeout=1)#該地址可提供HTTP請求測試
 9 print(response.read().decode('utf-8'))
10 print(type(response))#返回一個http.client.HTTPResponse對象
11 print(response.status)#狀態碼
12 print(response.getheaders())#相應頭頭信息
13 print(response.getheader('Server'))#獲取頭信息中的Server服務名

以上urlopen方法設置超時時間爲1秒，若是超時則拋出urllib.error.URLError: <urlopen error timed out>異常；

（2）類：Request

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

urlopen方法能夠實現最基本的請求發起，可是簡單的幾個參數沒法知足一個完整的請求；若是加入Headers等信息，則須要使用Request類來構建；

使用該類，依然使用urlopen方法來發起請求，可是urlopen方法再也不是一個字符串url，而是Request類型的字段；

以下：

1 #Request來構建請求
2 
3 import urllib
4 import urllib.parse
5 import urllib.request
6 request=urllib.request.Request('http://python.org')
7 response=urllib.request.urlopen(request)
8 print(response.read().decode('utf-8'))

參數：

url必選，其餘可選參數

data，若是傳入該參數，必須是bytes（字節流）類型，若是他是字典，則能夠經過urllib.parse 中urlencode（）進行編碼

headers參數是一個字典，他是請求頭，在構建請求時可經過headers參數直接構建或者使用add_header()方法單獨添加；

添加請求頭信息最經常使用方法是經過修改User-Agent假裝瀏覽器，默認User-Agent是Python-urllib;

例如若是咱們模擬發送請求時是使用的火狐瀏覽器，則能夠設置

User-Agent

Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/64.0

例如：

 1 #Request來構建請求
 2 
 3 import urllib
 4 import urllib.parse
 5 import urllib.request
 6 
 7 headers={"User-Agent":'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0',
 8                     #注意："Mozilla/5.0 (Windows NT 6.1; W…) Gecko/20100101 Firefox/64.0",這裏我複製了火狐debug下url的請求信息，
 9                     # W... 沒有顯示全，所以粘貼過來少內容，報錯：UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position。。。
10           "Host":'httpbin.org'
11          }
12 data=bytes(urllib.parse.urlencode({'word':'Help'}),encoding='utf-8')
13 request=urllib.request.Request('http://httpbin.org/post',headers=headers,data=data )
14 #也能夠經過add_header添加請求頭信息
15 request.add_header("Content-Type","application/x-www-form-urlencoded")
16 
17 response=urllib.request.urlopen(request)
18 print(response.read().decode('utf-8'))
19 
20 
21

（3）類：BaseHandler

用於一些更高級操做，例如Cookies、代理操做等；

子類：　　HTTPDefaultErrorHandler 處理HTTP相應錯誤，拋出HTTPError異常

　　　　　HTTPREdirectHandler 處理重定向

　　　　 HTTPCookieProcessor處理Cookie

ProxyHandler設置代理，默認代理爲空

　　　　 HTTPPasswordMgr 用於管理密碼，他維護了用戶名和密碼的表

HTTPBasicAuthHandler用於管理可認證，若是一個鏈接打開時須要認證，則他能夠解決認證問題；

OpenerDirector，簡稱Opener，以前的urlopen實際就是urllib提供的一個簡單的Opener；

下面則經過Handler來構建Opener：

例如：安裝ｔｏｍｃａｔ後訪問ｔｏｍｃａｔ首頁面，http://localhost:8080/manager/html此時須要驗證，

此時能夠進行下面代碼來驗證，經過顯示源碼：

#Handler來構建Opener(作個登陸驗證)

import urllib
import urllib.parse
import urllib.request
from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
from urllib.error import  URLError
username="admin"
pwd="admin"
url="http://localhost:8080/manager/html"
p=HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,pwd)
authHandler=HTTPBasicAuthHandler(p)
opener=build_opener(authHandler)
result=opener.open(url)
html=result.read().decode('utf-8')
print(html)

若是設置代理，則以下（未驗證下面代碼）：

proxyhandler=ProxyHandler({

'http':'http://127.0.0.1:999',

'https':'https://127.0.0.1:888'

})

opener=build_opener(proxyhandler)

try:

response=opener.open('https://www.baidu.com')

print(response.read().decode('utf-8'))

except URLError as e:

print(e.reason)

(4)獲取請求後的Cookies

能夠經過聲明一個CookieJar對象，利用HTTPCookieProcessor構建一個Handler，而後經過build_opener()方法建立opener，執行open方法便可；

保存Cookies文件，則能夠經過MozillaCookieJar或者LWPCookieJar 對象實現；

例如：

 1 #獲取網站的Cookies
 2 
 3 import urllib
 4 import urllib.parse
 5 import urllib.request
 6 from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
 7 from urllib.error import  URLError
 8 import http.cookiejar
 9 cookie=http.cookiejar.CookieJar()
10 handler=urllib.request.HTTPCookieProcessor(cookie)
11 opener=urllib.request.build_opener(handler)
12 response=opener.open("http://www.baidu.com")
13 for item in cookie:
14      print(item.name+":"+item.value)
15 
16 #輸出文本格式,則使用MozillaCookieJar
17 filename='cookie.txt'
18 cookie1=http.cookiejar.MozillaCookieJar(filename)
19 handler1=urllib.request.HTTPCookieProcessor(cookie1)
20 opener1=urllib.request.build_opener(handler1)
21 response1=opener1.open("http://www.baidu.com")
22 cookie1.save(ignore_discard=True,ignore_expires=True)
23 
24 #LWPCookieJar也能夠保存Cookies，可是根式與上面不一樣；會保存爲libwww-perl(LWP)格式的Cookies文件
25 filename2='cookie2.txt'
26 cookie2=http.cookiejar.LWPCookieJar(filename2)
27 handler2=urllib.request.HTTPCookieProcessor(cookie2)
28 opener2=urllib.request.build_opener(handler2)
29 response2=opener2.open("http://www.baidu.com")
30 cookie2.save(ignore_discard=True,ignore_expires=True)
31

如何利用取保存了cookies的文件數據？

以下：

 1 #獲取網站的Cookies
 2 
 3 import urllib
 4 import urllib.parse
 5 import urllib.request
 6 from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
 7 from urllib.error import  URLError
 8 import http.cookiejar
 9 cookie=http.cookiejar.CookieJar()
10 handler=urllib.request.HTTPCookieProcessor(cookie)
11 opener=urllib.request.build_opener(handler)
12 response=opener.open("http://www.baidu.com")
13 for item in cookie:
14      print(item.name+":"+item.value)
15 
16 #輸出文本格式,則使用MozillaCookieJar
17 filename='cookie.txt'
18 cookie1=http.cookiejar.MozillaCookieJar(filename)
19 handler1=urllib.request.HTTPCookieProcessor(cookie1)
20 opener1=urllib.request.build_opener(handler1)
21 response1=opener1.open("http://www.baidu.com")
22 cookie1.save(ignore_discard=True,ignore_expires=True)
23 
24 #LWPCookieJar也能夠保存Cookies，可是根式與上面不一樣；會保存爲libwww-perl(LWP)格式的Cookies文件
25 filename2='cookie2.txt'
26 cookie2=http.cookiejar.LWPCookieJar(filename2)
27 handler2=urllib.request.HTTPCookieProcessor(cookie2)
28 opener2=urllib.request.build_opener(handler2)
29 response2=opener2.open("http://www.baidu.com")
30 cookie2.save(ignore_discard=True,ignore_expires=True)
31 
32 #加載cookie.txt，訪問百度來搜索數據
33 cookie = http.cookiejar. LWPCookieJar()
34 cookie.load('cookie2.txt',ignore_discard=True, ignore_expires=True)
35 handler = urllib.request.HTTPCookieProcessor(cookie)
36 opener = urllib .request.build_opener(handler)
37 response= opener.open('http://www.baidu.com/baidu?word=Python')
38 print (response. read(). decode ('utf-8'))

二、關於異常處理

urllib的error模塊定義了又request模塊產生的異常。

URLError類：

URLError類是來自於urllib庫的error模塊，繼承自OSError類，是error異常類的基類，由request模塊產生的異常均可以捕獲處理到；

它具備一個屬性reason，返回錯誤的消息；

例如：訪問了一個網站不存在的頁面；

 1 #關於異常URLError
 2 
 3 import urllib
 4 import urllib.parse
 5 import urllib.request
 6 from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm,build_opener
 7 from urllib.error import  URLError
 8 import http.cookiejar
 9 
10 
11 try:
12        response=urllib.request.urlopen("https://i.cnblogs.com/a.html") 
13 except URLError as e: 
14        print(e.reason)

結果爲Not Found；

HTTPError類：

他是URLError的子類，專門用於處理Http請求錯誤，例如認證請求失敗

code：返回HTTP狀態碼，例如：404,500等狀態碼

reason，返回錯誤信息

headers：返回請求頭

例如：

1 try:
2      response=urllib.request.urlopen("https://i.cnblogs.com/a.html")
3 except  HTTPError as e:
4      print(e.reason,e.code,e.headers)

二者父子關係，咱們也能夠先捕子類型錯誤，再補貨父類類型錯誤；

例如：

1 try:
2      response=urllib.request.urlopen("https://i.cnblogs.com/a.html")
3 except  HTTPError as e:
4      print(e.reason,e.code,e.headers,sep='\n')
5 except URLError as e:
6      print(e.reason)
7 else:
8      print("無異常")

有時候異常信息是一個對象，例如：

1 try:
2      response = urllib.request.urlopen("https://www.baidu.com",timeout=0.01)
3 except  HTTPError as e:
4      print(type(e.reason))
5 except URLError as e: #請求超時，此時被URLError異常捕獲
6      print(type(e.reason))#<class 'socket.timeout'>是一個異常對象
7 else:
8      print("無異常")

再次修改上面

的程序，經過isinstance來判斷是那種對象，來給具體異常信息描述；

#前面import略
import socket
try:
     response = urllib.request.urlopen("https://www.baidu.com",timeout=0.01)
except  HTTPError as e:
     print(type(e.reason))
except URLError as e: #請求超時，此時被URLError異常捕獲
     print(type(e.reason))#<class 'socket.timeout'>是一個異常對象
     if isinstance(e.reason,socket.timeout):
          print("Time Out")
else:
     print("無異常")

3.連接解析：

urllib中的parse模塊提供了chuliURL的標準接口，例如，url的各部分抽取，合併以及鏈接轉換等；支持一下協議的URL處理：

file、ftp、gopher、hdl、http/https、imap、mailto、mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、sip、snews、svn、svn+ssh、

telnet和wais；

經常使用方法以下：

①urlparse()提供了url的識別和分段；返回元組對象；

例如：（allow_fragments=False 能夠忽略fragment)）

 1 #關於異常URLparse
 2 
 3 import urllib
 4 from urllib.parse import  urlparse
 5 result=urlparse("http://www.baidu.com/index.html;user?id=5#comment")
 6 print(type(result),result)
 7 #<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
 8 # ://前是scheme，表明協議； 第一個/前面是netloc，即域名；後面是path，即訪問路徑；分號後，是params，表明參數；？號後表示查詢條件；#號後是錨點用戶直接定位到當前頁某個位置
 9 #即根據scheme://netloc/path;params?query#fragment 格式進行的拆分；
10 
11 #例如：url沒有scheme，能夠經過參數設置；前提是url沒有寫scheme，不然scheme參數設置失效

12 result=urlparse("www.baidu.com/index.html;user?id=5#comment",scheme="http")
13 print(type(result),result)

②unurlparse()及將一個列表中的元素，組成爲一個完整url

前提是該方法參數必須是6個參數，不然報錯；

例如：data=['http','www.baidu.com','/index.html', 'user', 'id=5', 'comment']

unurlparse(data) 該結果則爲http://www.baidu.com/index.html;user?id=5#comment

③urlsplit()將一個一個url拆分；unsplit()與之相反；

例如：

1 r=urlsplit("http://www.baidu.com/index.html;user?id=5#comment")#不包含params，注意
2 print( (r))
3 print( r[0],r[1],r[2],r[3],r[4])

④urljoin() 鏈接多個url，將他們合併；

例如：

rom urllib.parse import urljoin
print(urljoin(' http: I lwww. baidu. com', 'FAQ. html ’))
print(urljoin('http://www.baidu.com ', ’ https://cuiqingcai . com/FAQ . html ’))

結果爲：

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html

⑤urlencode()

例如：

from urllib .parse import urlencode
params = {
’ name' 'JONES' ,
age : 30

}

url='http://www.baidu.com?'+urlencode(params)

print(url)

結果爲：

http://www.baidu.com?name=JONES&age=30

⑥parse_qs()

urlencode至關於序列化操做，而parse_qs()則至關於反序列化操做；

例如：

from urllib.parse import parse qs
query= 'name=germey&age=22'
print(parse_qs(query))

結果：

{’ name': [’ germey ’],’ age ' : [ ’ 22 ' ]}

⑦、quote()

該方法講內容轉爲URL編碼的格式。

例如：

keyword＝’張三’
url =’ https://www.baidu.com/s?wd =’+ quote(keyword)
print(url)

結果爲：

https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

⑧、unquote() 與上面quote()方法操做相反效果；

四、robotparser

即Python中用於解析robots.txt文件的模塊， Robots是一種協議，被叫作爬蟲協議、機器人協議，他全名叫作網絡爬蟲排除標準（Robots Exclusion Protocol）

用於告訴爬蟲和搜索引擎哪些頁面能夠抓取，那些不能夠；通常網站項目根目錄中會有一個robots.txt文件，來設置那些不容許被抓取；

例如：該文件中若是

User-agent: *
Disallow: I
Allow: /public/

則表示對全部爬蟲只容許抓取public目錄；

爬蟲通常會有名字，例如百度（搜索引擎會有蜘蛛來爬取網頁）的蜘蛛名字爲：BaiduSpider,其餘網站不說了這裏；

例如：訪問xx網站

 1 import urllib.robotparser
 2 
 3 rp = urllib.robotparser.RobotFileParser()
 4 rp.set_url('http://example.com/robots.txt')
 5 rp.read()
 6 url = 'http://example.com'
 7 user_agent = 'BadCrawler'
 8 f=rp.can_fetch(user_agent, url)#是否容許指定的用戶代理訪問網頁
 9 print(f)
10 user_agent = 'GoodCrawler'
11 n=rp.can_fetch(user_agent, url)
12 print(n)