網絡爬蟲必備知識之urllib庫

時間 2019-11-10

標籤網絡爬蟲必備知識 urllib 欄目系統網絡简体版

原文原文鏈接

就庫的範圍，我的認爲網絡爬蟲必備庫知識包括urllib、requests、re、BeautifulSoup、concurrent.futures，接下來將結合爬蟲示例分別對urllib庫的使用方法進行總結

1. urllib庫全局內容

　　官方文檔地址：https://docs.python.org/3/library/urllib.htmlphp

　　urllib庫是python的內置HTTP請求庫，包含如下各個模塊內容：html

（1）urllib.request：請求模塊python

（2）urllib.error：異常處理模塊web

（3）urllib.parse：解析模塊api

（4）urllib.robotparser：robots.txt解析模塊安全

如下全部示例都以http://example.webscraping.com/網站爲目標，網站預覽：服務器

2. urllib.request.urlopen

（1）函數原型cookie

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

　　該函數功能簡單，進行簡單的網站請求，不支持複雜功能如驗證、cookie和其餘HTTP高級功能，若要支持這些功能必須使用build_opener()函數返回的OpenerDirector對象，後面介紹，埋個伏筆網絡

（2）先寫個簡單的例子app

import urllib.request
        
url = "http://example.webscraping.com/"
response = urllib.request.urlopen(url,timeout=1)
print("http status:", response.status)

（3）data參數使用

　　data參數用於post請求，好比表單提交，若是沒有data參數則是get請求

import urllib.parse
import urllib.request

data = urllib.parse.urlencode({'name':'張三', 'password':'666777'}).encode('utf-8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())

　　首先解釋下urlencode的用法，將key-value的鍵值對轉換爲咱們想要的格式，返回的是a=1&b=2這樣的字符串，解碼使用unquote()，應爲urltilib沒有提供urldecode

data1 = urllib.parse.urlencode({'name':'張三', 'password':'666777'}).encode('utf-8')
print(data1)
data2 = urllib.parse.unquote(data1.decode('utf-8'))
print(data2)

（4）timeout參數

　　在某些網絡狀況很差或服務器出現異常的狀況下，這個時候咱們須要設置一個超時時間，不然程序會一直等待下去

（5）urlopen函數返回響應對象

response = urllib.request.urlopen(url,timeout=1)
for key,value in response.__dict__.items():
    print(key,":", value)

　　返回<class 'http.client.httpresponse'="">對象，咱們能夠response.status獲取返回狀態，response.read()得到響應題的內容　

3. urllib.request.build_opener

　　前面說過了urlopen不支持headers、cookie和HTTP的高級用法，那解決的方法就是使用build_opener()函數來定義本身的opener對象

（1）函數原型

　　build_opener([handler1[,headler2[,....]]])

　　參數都是特殊處理程序對象的實例，下表列出了全部可用的處理程序對象：

CacheFTPHandler	具備持久FTP連續的FTP處理程序
FileHandler	打開本地文件
FTPHandler	經過FTP打開URL
HTTPBasicAuthHandler	基本的HTTP驗證處理
HTTPCookieProcessor	處理HTTP cookie
HTTPDefaultErrorHandler	經過引起HTTPError異常來處理HTTP錯誤
HTTPDigestAuthHandler	HTTP摘要驗證處理
HTTPHandler	經過HTTP打開URL
HTTPRedirectHandler	處理HTTP重定向
HTTPSHandler	經過安全HTTP打開url
ProxyHandler	經過代理重定向請求
ProxyBasicAuthHandler	基本的代理驗證
ProxyDigestAuthHandler	摘要代理驗證
UnknownHandler	處理全部未知URL的處理程序

（2）opener對象建立

　這裏以設置cookie和添加代理服務器爲例進行說明

　　有時候爬取網站須要攜帶cookie信息訪問，這個時候須要設置cookie，同時大多數網站都會檢測某一段事件某個IP的訪問次數，若是訪問次數過多，它會禁止你的訪問，這個時候須要設置代理服務器來爬取數據。

proxy = urllib.request.ProxyHandler(
    {
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
    })
        
cjar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPHandler, urllib.request.HTTPCookieProcessor(cjar))

（3）headers設置

　　headers即爲請求頭，不少網站爲了防止程序爬蟲爬網站照成網站癱瘓，會須要攜帶一些headers頭部信息才能訪問，最多見的是user-agent參數

　　打開網站，按F12，點網絡咱們會看到下面內容：

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0)Gecko/20100101 Firefox/63.0)',
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Connection':'keep-alive',
    'Host':'example.webscraping.com'}

　　header的設置這裏介紹兩種方法：

　　a. 經過urllib.request.Request對象　

request = urllib.request.Request(url,headers=headers)

　　b. 經過OpenerDirector對象的add_headers屬性

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0)',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Connection':'keep-alive',
        'Host':'example.webscraping.com'}    
        
    cjar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPHandler,urllib.request.HTTPCookieProcessor(cjar))
    header_list = []
    for key,value in headers.items():
        header_list.append(key)
        header_list.append(value)
    opener.add_handler = [header_list]

（4）OpenDirector的open()函數

　　函數原型：

def open(self, fullurl, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT):

　　裏面有一部分代碼：

if isinstance(fullurl, str):
   req = Request(fullurl, data)
else:
   req = fullurl
   if data is not None:
      req.data = data

　　說明fullurl既能夠是url，也能夠是urllib.request.Request對象

　　使用：

request = urllib.request.Request(url)
response = opener.open(request, timeout=1)

綜合代碼：

def DownLoad(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0)',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Connection':'keep-alive',
        'Host':'example.webscraping.com'}    
    
    proxy = urllib.request.ProxyHandler(
        {
        'http': 'http://127.0.0.1:9743',
        'https': 'https://127.0.0.1:9743'
        })

    cjar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPHandler,urllib.request.HTTPCookieProcessor(cjar))
    header_list = []
    for key,value in headers.items():
        header_list.append(key)
        header_list.append(value)
    opener.add_handler = [header_list]
    try:
        request = urllib.request.Request(url)
        response = opener.open(request, timeout=1)
        print(response.__dict__)
    except urllib.error.URLError as e:
        if hasattr(e, 'code'):
            print ("HTTPErro:", e.code)
        elif hasattr(e, 'reason'):
            print ("URLErro:", e.reason)

4. url.error異常處理

　　不少時候咱們經過程序訪問網頁的時候，有的頁面可能會出錯，相似404，500的錯誤，這個時候就須要咱們捕獲異常，從上面的最後代碼已經看到了urllib.error的使用

except urllib.error.URLError as e:
    if hasattr(e, 'code'):
        print ("HTTPErro:", e.code)
    elif hasattr(e, 'reason'):
        print ("URLErro:", e.reason)

　　HTTPError是URLError的子類

　　URLError裏只有一個屬性：reason,即抓異常的時候只能打印錯誤信息，相似上面的例子

　　HTTPError裏有三個屬性：code,reason,headers，即抓異常的時候能夠得到code,reson，headers三個信息

5. urllib.parse

　　前面已經介紹過了urllib.parse.urlencode的使用，接下來再介紹三個函數：urlparse、urlunparse、urljoin

（1）urlparse

　　函數原型：

def urlparse(url, scheme='', allow_fragments=True):

"""Parse a URL into 6 components:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
    Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""

　　意思就的對你傳入的url進行拆分，包括協議，主機地址，端口，路徑，字符串，參數，查詢，片斷

protocol 協議，經常使用的協議是http
hostname 主機地址，能夠是域名，也能夠是IP地址
port 端口 http協議默認端口是：80端口，若是不寫默認就是:80端口
path 路徑網絡資源在服務器中的指定路徑
parameter 參數若是要向服務器傳入參數，在這部分輸入
query 查詢字符串若是須要從服務器那裏查詢內容，在這裏編輯
fragment 片斷網頁中可能會分爲不一樣的片斷，若是想訪問網頁後直接到達指定位置，能夠在這部分設置

（2）urlunparse

　　功能和urlparse功能相反，用於將各組成成分拼接成URL

　　函數原型：

def urlunparse(components):

from urllib.parse import urlunparse

print(urlunparse(('https','www.baidu.com','index.html','name','a=123','')))

（3）urljoin

　　函數的做用就是url拼接，後面的優先級高於前面

　　函數原型：

def urljoin(base, url, allow_fragments=True):
    """Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter."""

　　例：

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://pythonsite.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

6. urllib.robotparser

　　該模塊用於robots.txt內容解析

　　例：

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://example.webscraping.com/robots.txt')
print(rp.read())
url = 'http://example.webscraping.com'
user_agent = 'BadCrawler'
print(rp.can_fetch(user_agent,url))
user_agent = 'GoodCrawler'
print(rp.can_fetch(user_agent,url))