python urllib和urllib3包使用(轉載於)

時間 2019-11-30

標籤 python urllib urllib3 使用載於欄目 Python 简体版

原文原文鏈接

urllib包

urllib是一個包含幾個模塊來處理請求的庫。分別是：html

urllib.request 發送http請求python
urllib.error 處理請求過程當中,出現的異常。nginx
urllib.parse 解析urlgit
urllib.robotparser 解析robots.txt 文件github

回到頂部

urllib.request

urllib當中使用最多的模塊,涉及請求，響應，瀏覽器模擬，代理，cookie等功能。web

1. 快速請求

urlopen返回對象提供一些基本方法：json

read 返回文本數據
info 服務器返回的頭信息
getcode 狀態碼
geturl 請求的url

request.urlopen(url, data=None, timeout=10)
#url:  須要打開的網址
#data：Post提交的數據
#timeout：設置網站的訪問超時時間

from urllib import request
import ssl
# 解決某些環境下報<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
ssl._create_default_https_context = ssl._create_unverified_context
url = 'https://www.jianshu.com'
#返回<http.client.HTTPResponse object at 0x0000000002E34550>
response = request.urlopen(url, data=None, timeout=10)
#直接用urllib.request模塊的urlopen()獲取頁面，page的數據格式爲bytes類型，須要decode()解碼，轉換成str類型。
page = response.read().decode('utf-8')

2.模擬PC瀏覽器和手機瀏覽器

須要添加headers頭信息，urlopen不支持，須要使用Request瀏覽器

PC安全

import urllib.request

url = 'https://www.jianshu.com'
# 增長header
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'
}
request = urllib.request.Request(url,headers=headers)
response = urllib.request.urlopen(request)
#在urllib裏面 判斷是get請求仍是post請求，就是判斷是否提交了data參數
print(request.get_method())

>> 輸出結果
GET

手機服務器

req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) '
                             'AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

3.Cookie的使用

客戶端用於記錄用戶身份,維持登陸信息

import http.cookiejar, urllib.request

# 1 建立CookieJar對象
cookie = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor建立cookie處理器，
handler = urllib.request.HTTPCookieProcessor(cookie)
# 構建opener對象
opener = urllib.request.build_opener(handler)
# 將opener安裝爲全局
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url)


# 2 保存cookie爲文本
import http.cookiejar, urllib.request
filename = "cookie.txt"
# 保存類型有不少種
## 類型1
cookie = http.cookiejar.MozillaCookieJar(filename)
## 類型2
cookie = http.cookiejar.LWPCookieJar(filename)

# 使用相應的方法讀取
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
……

4.設置代理

當須要抓取的網站設置了訪問限制，這時就須要用到代理來抓取數據。

import urllib.request

url = 'http://httpbin.org/ip'
proxy = {'http':'39.134.108.89:8080','https':'39.134.108.89:8080'}
proxies = urllib.request.ProxyHandler(proxy) # 建立代理處理器
opener = urllib.request.build_opener(proxies,urllib.request.HTTPHandler) # 建立特定的opener對象
urllib.request.install_opener(opener) # 安裝全局的opener 把urlopen也變成特定的opener
data = urllib.request.urlopen(url)
print(data.read().decode())

回到頂部

urllib.error

urllib.error能夠接收有urllib.request產生的異常。urllib.error中經常使用的有兩個方法，URLError和HTTPError。URLError是OSError的一個子類，

HTTPError是URLError的一個子類，服務器上HTTP的響應會返回一個狀態碼，根據這個HTTP狀態碼，咱們能夠知道咱們的訪問是否成功。

URLError

URLError產生緣由通常是:網絡沒法鏈接、服務器不存在等。

例如訪問一個不存在的url

import urllib.error
import urllib.request
requset = urllib.request.Request('http://www.usahfkjashfj.com/')
try:
    urllib.request.urlopen(requset).read()
except urllib.error.URLError as e:
    print(e.reason)
else:
    print('success')


>> print結果
[Errno 11004] getaddrinfo failed

HTTPError

HTTPError是URLError的子類，在你利用URLopen方法發出一個請求時，服務器上都會對應一個應答對象response，其中他包含一個數字「狀態碼」，

例如response是一個重定向，需定位到別的地址獲取文檔，urllib將對此進行處理。其餘不能處理的，URLopen會產生一個HTTPError，對應相應的狀態碼，

HTTP狀態碼錶示HTTP協議所返回的響應的狀態。

from urllib import request, error
    try:
        response = request.urlopen('http://cuiqingcai.com/index.htm')
    except error.URLError as e:
        print(e.reason)

    # 先捕獲子類錯誤
    try:
        response = request.urlopen('http://cuiqingcai.com/index.htm')
    except error.HTTPError as e:
        print(e.reason, e.code, e.headers, sep='\n')
    except error.URLError as e:
        print(e.reason)
    else:
        print('Request Successfully')

>> print結果

Not Found

-------------
Not Found
404
Server: nginx/1.10.3 (Ubuntu)
Date: Thu, 08 Feb 2018 14:45:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT

回到頂部

urllib.parse

urllib.parse.urljoin 拼接url

基於一個base URL和另外一個URL構造一個絕對URL,url必須爲一致站點,不然後面參數會覆蓋前面的host

print(parse.urljoin('https://www.jianshu.com/xyz','FAQ.html'))
print(parse.urljoin('http://www.baidu.com/about.html','http://www.baidu.com/FAQ.html'))

>>結果
https://www.jianshu.com/FAQ.html
http://www.baidu.com/FAQ.html

urllib.parse.urlencode 字典轉字符串

from urllib import request, parse
url = r'https://www.jianshu.com/collections/20f7f4031550/mark_viewed.json'
headers = {
    'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
    'Referer': r'https://www.jianshu.com/c/20f7f4031550?utm_medium=index-collections&utm_source=desktop',
    'Connection': 'keep-alive'
}
data = {
    'uuid': '5a9a30b5-3259-4fa0-ab1f-be647dbeb08a',
}
#Post的數據必須是bytes或者iterable of bytes，不能是str，所以須要進行encode（）編碼
data = parse.urlencode(data).encode('utf-8')
print(data)
req = request.Request(url, headers=headers, data=data)
page = request.urlopen(req).read()
page = page.decode('utf-8')
print(page)

>>結果
b'uuid=5a9a30b5-3259-4fa0-ab1f-be647dbeb08a'
{"message":"success"}

urllib.parse.quote url編碼

urllib.parse.unquote url解碼

Url的編碼格式採用的是ASCII碼，而不是Unicode，好比

http://so.biquge.la/cse/search?s=7138806708853866527&q=%CD%EA%C3%C0%CA%C0%BD%E7

from urllib import parse

x = parse.quote('山西', encoding='gb18030')# encoding='GBK
print(x)  #%C9%BD%CE%F7


city = parse.unquote('%E5%B1%B1%E8%A5%BF',)  # encoding='utf-8'
print(city)  # 山西

urllib3包

Urllib3是一個功能強大，條理清晰，用於HTTP客戶端的Python庫，許多Python的原生系統已經開始使用urllib3。Urllib3提供了不少python標準庫裏所沒有的重要特性：

1.線程安全
2.鏈接池
3.客戶端SSL/TLS驗證
4.文件分部編碼上傳
5.協助處理重複請求和HTTP重定位
6.支持壓縮編碼
7.支持HTTP和SOCKS代理

安裝：

Urllib3 能經過pip來安裝：

$pip install urllib3

你也能夠在github上下載最新的源碼，解壓以後進行安裝：

$git clone git://github.com/shazow/urllib3.git

$python setup.py install

urllib3的使用：

request GET請求

import urllib3
import requests
#  忽略警告：InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised.
requests.packages.urllib3.disable_warnings()
# 一個PoolManager實例來生成請求, 由該實例對象處理與線程池的鏈接以及線程安全的全部細節
http = urllib3.PoolManager()
# 經過request()方法建立一個請求：
r = http.request('GET', 'http://cuiqingcai.com/')
print(r.status) # 200
# 得到html源碼,utf-8解碼
print(r.data.decode())

request GET請求(添加數據)

 header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
    }
    r = http.request('GET',
             'https://www.baidu.com/s?',
             fields={'wd': 'hello'},
             headers=header)
    print(r.status) # 200
    print(r.data.decode())

post請求

  #你還能夠經過request()方法向請求(request)中添加一些其餘信息，如：
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
    }
    r = http.request('POST',
                     'http://httpbin.org/post',
                     fields={'hello':'world'},
                     headers=header)
    print(r.data.decode())

# 對於POST和PUT請求(request),須要手動對傳入數據進行編碼，而後加在URL以後：
encode_arg = urllib.parse.urlencode({'arg': '個人'})
print(encode_arg.encode())
r = http.request('POST',
                 'http://httpbin.org/post?'+encode_arg,
                 headers=header)
# unicode解碼

print(r.data.decode('unicode_escape'))

發送json數據

#JSON:在發起請求時,能夠經過定義body 參數並定義headers的Content-Type參數來發送一個已通過編譯的JSON數據：
import json
data={'attribute':'value'}
encode_data= json.dumps(data).encode()

r = http.request('POST',
                     'http://httpbin.org/post',
                     body=encode_data,
                     headers={'Content-Type':'application/json'}
                 )
print(r.data.decode('unicode_escape'))

上傳文件

#使用multipart/form-data編碼方式上傳文件,可使用和傳入Form data數據同樣的方法進行,並將文件定義爲一個元組的形式　　　　　(file_name,file_data):
with open('1.txt','r+',encoding='UTF-8') as f:
    file_read = f.read()

r = http.request('POST',
                 'http://httpbin.org/post',
                 fields={'filefield':('1.txt', file_read, 'text/plain')
                         })
print(r.data.decode('unicode_escape'))

#二進制文件
with open('websocket.jpg','rb') as f2:
    binary_read = f2.read()

r = http.request('POST',
                 'http://httpbin.org/post',
                 body=binary_read,
                 headers={'Content-Type': 'image/jpeg'})
#
# print(json.loads(r.data.decode('utf-8'))['data'] )
print(r.data.decode('utf-8'))

使用Timeout

#使用timeout，能夠控制請求的運行時間。在一些簡單的應用中，能夠將timeout參數設置爲一個浮點數：
r = http.request('POST',
                 'http://httpbin.org/post',timeout=3.0)

print(r.data.decode('utf-8'))

#讓全部的request都遵循一個timeout，能夠將timeout參數定義在PoolManager中：
http = urllib3.PoolManager(timeout=3.0)

對重試和重定向進行控制

#經過設置retries參數對重試進行控制。Urllib3默認進行3次請求重試，並進行3次方向改變。
r = http.request('GET',
                 'http://httpbin.org/ip',retries=5)#請求重試的次數爲5

print(r.data.decode('utf-8'))
##關閉請求重試(retrying request)及重定向(redirect)只要將retries定義爲False便可：
r = http.request('GET',
                 'http://httpbin.org/redirect/1',retries=False,redirect=False)
print('d1',r.data.decode('utf-8'))
#關閉重定向(redirect)但保持重試(retrying request),將redirect參數定義爲False便可
r = http.request('GET',
                 'http://httpbin.org/redirect/1',redirect=False)

相關標籤/搜索

python+urllib+beautifulsoup

python+urllib+beautifusoup

python+urllib+beautifulsoup+pymysql

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。