爬蟲基礎之(三) --- urllib模塊

時間 2019-11-26

原文原文鏈接

urllib.request模塊方法

　　從urllib中導入請求模塊編寫urlhtml

1 from urllib import request      # 導入request模塊
2 # 或者 import urllib.request
3 url = "http://www.baidu.com/"   # 編寫 url

1. urlopen( ) 方法前端

　　用於打開一個遠程的url鏈接,而且向這個鏈接發出請求,獲取響應結果。返回的結果是一個http響應對象,這個響應對象中記錄了本次http訪問的響應頭和響應體python

　　urllib.request.urlopen 參數介紹：urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)web

url參數使用

 1 response = request.urlopen(url=url)
 2 
 3 print(response)    # 獲取響應，結果爲：<http.client.HTTPResponse object at 0x10be801d0>
 4 # 獲取響應頭
 5 print(response.headers)   # 獲取響應頭
 6 print(response.url)       # 獲取響應url 
 7 print(response.status)    # 獲取響應狀態碼
 8 # 獲取響應體
 9 print(response.read())                  # 獲取響應體 二進制字符串  
10 print(response.read().decode("utf-8"))  # 對響應體進行解碼
11 # 按行讀取
12 print(response.readline())      # 讀取一行
13 print(response.readline())      # 讀取下一行
14 print( response.readlines())    # 讀取多行。獲得一個列表 每一個元素是一行

data參數使用

　　上述例子是經過 get請求得到百度，下面使用urllib的 post請求。添加data參數的時候就是以post請求方式請求，若沒有data參數就是get請求方式瀏覽器

1 import urllib.request
2 import urllib.parse  # urllib.parse模塊,後面有介紹
3 #或者 from urllib import request, parse

4 data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
5 # 用urllib.parse模塊，經過bytes(urllib.parse.urlencode())能夠將post數據進行轉換並放到urllib.request.urlopen的data參數中。這樣就完成了一次post請求。
6 
7 response = urllib.request.urlopen('http://httpbin.org/post', data=data)
8 print(response.read())

timeout參數使用

　　在某些網絡狀況很差或者服務器端異常會出現請求慢或者請求異常等狀況，因此這個時候須要給請求設置一個超時時間，而不是讓程序一直在等待結果。使用timeout參數設置超時時間服務器

1 import urllib.request
2 
3 response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
4 print(response.read())   # 正常結束，控制檯顯示：socket.time : timed out
5 
6 response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
7 print(response.read())   # 超時，控制檯顯示：urllib.error.URLErrot : <urlopen error timed out>

2. urlretrieve(url =」xxx「, filename = 」xxx「) 方法cookie

　　打開url這個鏈接而且發起請求，得到響應並把響應結果保存到filename文件中網絡

1 res3 = request.urlretrieve(url=url,filename="./baidu.html") 
2 print(res3)     # 獲取url 保存到baidu.html文件中 並打印

3. Request(url=url, data=data, method='POST') 方法框架

　　web開發中，同一個url每每能夠對應若干套不一樣的數據(或者界面,如手機、電腦)，後臺能夠根據發起請求的前端的用戶代理的不一樣，socket

　　而決定應該給前端作出什麼樣的響應，若是檢測到沒有用戶代理能夠拒絕訪問。有不少網站爲了防止程序爬蟲爬網站形成網站癱瘓，

　　會須要攜帶一些headers頭部信息才能訪問，最長見的有user-agent參數因此須要假裝請求頭，去訪問目標站。

　　urllib.ruquest.Request 參數介紹：urllib.ruquest.Request(url=url,headers=headers,data=data,method='POST')

headers 參數使用

　　給請求添加頭部信息，定製本身請求網站時的頭部信息，使得請求假裝成瀏覽器等終端

1 req = request.Request(url=url,headers={'UserAgent':'Mozilla/5.0 (Windows NT 10.0; Win64;x64)AppleWebKit/537.36 (KHTML, likeGecko)Chrome/71.0.3578.80Safari/537.36'})
2 
3 res = request.urlopen(req)  # 用加入了請求頭的請求對象發起請求
4 print(res.status)           # 打印狀態碼

　　添加請求頭的post請求方式

 1 from urllib import request, parse
 2 
 3 url = 'http://httpbin.org/post'
 4 headers = {
 5     'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
 6     'Host': 'httpbin.org'}
 7 dict = { 'name': 'taotao'}
 8 
 9 data = bytes(parse.urlencode(dict), encoding='utf8')
10 req = request.Request(url=url, data=data, headers=headers, method='POST')
11 response = request.urlopen(req)
12 print(response.read().decode('utf-8')

　　添加請求頭的第二種post方式, 好處是本身能夠定義一個請求頭字典，而後循環進行添加

 1 from urllib import request, parse
 2 
 3 url = 'http://httpbin.org/post'
 4 dict = {'name': 'Germey'}
 5 
 6 data = bytes(parse.urlencode(dict), encoding='utf8')
 7 req = request.Request(url=url, data=data, method='POST')
 8 req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
 9 response = request.urlopen(req)
10 print(response.read().decode('utf-8'))

4. rulllib.request.ProxyHandler( ) 方法。高級用法：各類handler代理

　 rulllib.request.ProxyHandler( ) 方法設置代理

　　urllib.request.build_opener(handler) 建立一個opener攜帶handler

　　opener.open(req) 用opener發起請求

　　設置代理,網站它會檢測某一段時間某個IP 的訪問次數，若是訪問次數過多它會禁止訪問,因此這個時候須要經過設置代理來爬取數據

 1 import urllib.request,urllib.parse
 2 
 3 url = "https://www.baidu.com/s?wd=ip"
 4 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}
 5 
 6 req = urllib.request.Request(url=url,headers=headers) # 建立一個請求對象
 7 handler = urllib.request.ProxyHandler(                # 建立一個handler
 8         {"http":'122.241.88.79:15872'  
 9          "https":'122.241.88.79:15872'})  
10 
11 opener = urllib.request.build_opener(handler)  # 建立一個opener攜帶handler
12 res = opener.open(req)                         # 用opener發起請求
13 
14 with open("ip.html",'wb') as fp:
15     fp.write(res.read())

會話處理機制

　　cookie中保存中咱們常見的登陸信息，有時候爬取網站須要攜帶cookie信息訪問,這裏用到了http.cookijar，用於獲取cookie以及存儲cookie

　　「採用handler+opener機制處理會話問題」導入cookie初始化工具,處理cookie的時候這個對象就能夠存儲cookie信息

1 from urllib import request,parse
2 from http import cookiejar
3 
4 cookie = cookiejar.CookieJar()   　　　　　　　　　 # 初始化一個cookie對象
5 handler = request.HTTPCookieProcessor(cookie)    # 建立一個handler對象，攜帶上cookie
6 opener = request.build_opener(handler)    　　　　# 建立一個opener對象攜帶上handler 
7 
8 response = opener.open('http://www.baidu.com')   # 用opener來發起請求
9 print(response.read().decode('utf-8'))           # 此時發起的請求結束之後，相關的cookie信息就會被opener的handler經過cookiejar對象保存

cookie寫入到文件中保存

方式一： http.cookiejar.MozillaCookieJar( )方式

 1 import http.cookiejar, urllib.request
 2 
 3 filename = "cookie.txt"
 4 cookie = http.cookiejar.MozillaCookieJar(filename)
 5 
 6 handler = urllib.request.HTTPCookieProcessor(cookie)
 7 opener = urllib.request.build_opener(handler)
 8 response = opener.open('http://www.baidu.com')
 9 
10 cookie.save(ignore_discard=True, ignore_expires=True)

方式二： http.cookiejar.LWPCookieJar( )方式

 1 import http.cookiejar, urllib.request
 2 
 3 filename = 'cookie.txt'
 4 cookie = http.cookiejar.LWPCookieJar(filename)
 5 
 6 handler = urllib.request.HTTPCookieProcessor(cookie)
 7 opener = urllib.request.build_opener(handler)
 8 response = opener.open('http://www.baidu.com')
 9 
10 cookie.save(ignore_discard=True, ignore_expires=True)

cookie從文件中讀取，用哪一種方式寫入的，就用哪一種方式讀取

方法三：cookie.load( )方式

1 import http.cookiejar, urllib.request
2 
3 cookie = http.cookiejar.LWPCookieJar()
4 cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
5 
6 handler = urllib.request.HTTPCookieProcessor(cookie)
7 opener = urllib.request.build_opener(handler)
8 response = opener.open('http://www.baidu.com')
9 print(response.read().decode('utf-8'))

urllib.parse模塊

url解析模塊

1. urlparse( ) 方法拆分url

　　URL解析函數側重於將URL字符串拆分爲其組件，或者將URL組件組合爲URL字符串

　　拆分的時候協議類型部分就會是scheme=「」指定的部分。若是url裏面已經帶了協議，scheme指定的協議不會生效

　　urllib.parse.urlparse(urlstring, scheme=" ", allow_fragments=True)

　　urlparse("www.baidu.com/index.html;user?id=5#comment",scheme="https")

1 from urllib.parse import urlparse
2 
3 # 對傳入的url地址進行拆分; 能夠用 scheme=「 」 指定協議類型：
4 result = urlparse("http://www.baidu.com/index.html;user?id=5#comment")
5 print(result)

　　結果：

2. urlunparse( ) 方法拼接url　　功能和urlparse的功能相反，它是用於拼接

1 from urllib.parse import urlunparse
2 
3 data = ['http','www.baidu.com','index.html','user','a=123','commit']
4 print(urlunparse(data))

3. urljoin( ) 方法拼接url　　拼接的時候後面的優先級高於前面的url

 1 from urllib.parse import urljoin
 2 
 3 print(urljoin('http://www.baidu.com', 'FAQ.html'))
 4 # 結果 ：http://www.baidu.com/FAQ.html
 5 print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))
 6 # 結果 ：https://pythonsite.com/FAQ.html
 7 print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))
 8 # 結果 ：https://pythonsite.com/FAQ.html
 9 print(urljoin('http://www.baidu.com', '?category=2#comment'))
10 # 結果 ：http://www.baidu.com?category=2#comment
11 print(urljoin('www.baidu.com#comment', '?category=2'))
12 # 結果 ：www.baidu.com?category=2

4. urlencode( ) 方法

　　這個方法能夠將字典轉換爲url參數，對url進行編碼，由於urllib這個框架中的url中不能出現漢字，只能出現ascii碼字符

 1 from urllib import parse
 2 url = "https://www.baidu.com/s?"
 3 
 4 # 把參數寫成字典的形式
 5 dic= {"ie":"utf-8","wd":"奔馳"}
 6 # 用parse的urlencode方法編碼
 7 parames = parse.urlencode(dic)
 8 # 將編碼之後的參數拼接到url中
 9 url += parames
10 print(request.urlopen(url=url))

urllib.error模塊方法

　　有時候經過程序訪問頁面的時候，有的頁面可能會出現相似404，500等錯誤。這時就須要咱們捕捉異常，

　　在urllb異常中有兩個異常錯誤：URLError 和 HTTPError。 HTTPError 是 URLError 的子類

1. error.URLError 異常　　URLError裏只有一個屬性：reason,即抓異常的時候只能打印錯誤信息

1 from urllib import request,error
2 
3 try:
4     response = request.urlopen("http://pythonsite.com/1111.html")
5 except error.URLError as e:
6     print(e.reason)

2. error.HTTPError 異常　　HTTPError裏有三個屬性：code,reason,headers，即抓異常的時候能夠得到code,reson，headers三個信息

 1 from urllib import request,error
 2 try:
 3     response = request.urlopen("http://pythonsite.com/1111.html")
 4 except error.HTTPError as e:
 5     print(e.code)
 6     print(e.reason)
 7     print(e.headers)
 8 except error.URLError as e:
 9     print(e.reason)
10 else:
11     print("reqeust successfully")