Python----Urllib的學習

時間 2019-11-06

原文原文鏈接

Urllib庫

Urllib庫的定義：Urllib庫是Python提供來操做URL的模塊。html

1.Python2.X 和 Python3.X的區別：Python2.X中包括Urllib庫、Urllib2庫，而在Python3.X中，將Urllib2合併到Urllib中。python

Python2.x到Python3.X之間的變化：瀏覽器

1. 爬取百度網頁並保存在本地服務器

import urllib.request

# 方式1
url = "http://www.baidu.com"
data = urllib.request.urlopen(url).read()
handlelData = open("D:/python/file/1.html", "wb")
handlelData.write(data)
handlelData.close()

# 方式2 能夠直接經過urllib.request.urlretrieve()方法直接將網頁內容保存在本地
filename = urllib.request.urlretrieve(url,filename = "D:/python/file/1.html")

注意：read()、readlines()、readline()三者的區別：cookie

read(): 讀取網頁的全部的內容，而且將讀取的內容返回一個字符串。post

readlines(): 也是讀取網頁所有內容，不一樣的是它會將讀取對的內容賦值給一個列表ui

readline(): 它是讀取網頁每一行的內容。編碼

2. 對url中含有中文的字符，咱們須要對其進行編碼和解碼url

encode = urllib.request.quote("http://www.sina.com.cn")
decode = urllib.request.unquote(encode)
print(encode)
print(decode)

3.模擬瀏覽器訪問網頁spa

# 方式1 經過build_opener()修改報頭
url = "http://www.baidu.com"
header = ('Uesr-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36')
opener = urllib.request.build_opener()
opener.addheaders = [header]
data = opener.open(url).read()
print(data)
# 方式2 經過urlib.request.Request()來添加報頭
req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36')
data = urllib.request.urlopen(req).read()
print(data)

4. 超時設置

for i in range(1,100):
    try:
        file = urllib.request.urlopen("http://yum.iqianyue.com",timeout=1)
        data = file.read()
        print(len(data))
    except Exception as e:
        print('出現異常----》',e)

5.請求方式的使用：post、get、put等

post請求的實例

url = "http://www.iqianyue.com/mypost/"

# 1.構建post請求參數
postdata = {"name":"ceo@iqianyue.com","pass":"aA123456"}

# 2.採用urllib.parse.urlencode()來編碼數據，而後設置成utf-8來編碼
encode_postdata = urllib.parse.urlencode(postdata).encode('utf-8')

# 3.用post參數來構建request的請求
req = urllib.request.Request(url,encode_postdata)

# 4.模擬瀏覽器訪問，給request請求添加報頭
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36')

# 5.經過urllib.request.urlopen()來發送請求，獲得響應結果.
data = urllib.request.urlopen(req).read()

# 6.將數據保存在本地
os_handle = open("D:/python/file/5.html","wb")
os_handle.write(data)

# 7.關閉流
os_handle.close()

get請求須要注意的是，若是請求的url中含有中文字符或者特需字符，須要進行轉碼在發送請求。

6.代理請求的設置

def user_proxy(proxy_address, url):

    # 1.代理的設置(包括端口號，用戶名，密碼，ip地址等)，採用什麼樣的來設置代理(http、ftp、https)
    proxy = urllib.request.ProxyHandler({'http': proxy_address})

    # 2.經過 build_opener()來設置代理(HTTPHandler、HTTPSHandler、FTPHandler)
    opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)

    # 3.模擬瀏覽器訪問網頁
    headers = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36"
    opener.addheaders = [('User-Agent', headers)]

    # 4. 建立全局的opener
    urllib.request.install_opener(opener)

    # 5.使用全局的opener來發送請求
    data = urllib.request.urlopen(url).read()
    return data


self_url = "https://www.baidu.com"
proxy_ip = '144.48.4.214:8989'
data = user_proxy(proxy_ip, self_url)
print(len(data))

7. DebugLog的實戰

# 1.經過將urllib.request.HTTPHandler 和 urllib.request.HTTPSHandler（debuglevel=1）
http_hd = urllib.request.HTTPHandler(debuglevel=1)
https_hd = urllib.request.HTTPSHandler(debuglevel=1)

# 2. 經過 build_opener()來設置debug，建立全局的opener
opener = urllib.request.build_opener(http_hd, https_hd)
urllib.request.install_opener(opener)

# 3. 經過全局的opener來發送請求
data = urllib.request.urlopen("http://edu.51cto.com")

8.異常的處理: 分爲兩種 HTTPError 和 URLError

HTTPError主要是http協議中狀態碼的錯誤

URLError主要是請求url中發生的錯誤

try:
    data = urllib.request.urlopen("https://www.jiangcxczxc.com").read()
    print(len(data))
except urllib.error.URLError as e:
    if hasattr(e, 'code'):
        print(e.code) 
    if hasattr(e, 'reason'):
        print(e.reason)

區別在於：HTTPError只能處理狀態碼的錯誤，不能處理URL不存在，服務器出現異常等，而URLError是都能處理的。

總結：咱們在使用urllib模塊時，應該注意哪些細節。

Urllib是咱們操做URL中的一個模塊，咱們在爬蟲過程彙總常常會使用到這個模塊
通常來講，標準的URL只容許一部分ASCII字符，好比字母、數字、部分符號等，若是咱們在使用不標準的URL作請求就會出現錯誤，咱們常常會在URL中遇到的中文、":" 、"&"等字符，咱們須要將其編碼，而後在使用編碼事後的url發送請求。
咱們在爬蟲過程當中常常會遇到403的錯誤，這是別人網頁採起了反爬蟲設置，此時咱們須要經過其餘方式來作。好比模擬瀏覽器來訪問、或者設置代理來作請求，在或者保存cookie等
就是異常的處理，咱們在爬蟲過沖中必須捕獲異常，防止在爬蟲過程當中中斷。