Python爬蟲之 urllib庫

時間 2021-08-13

標籤 html python 瀏覽器服務器 socket ide 函數 post 網站 url 欄目 Python 简体版

原文原文鏈接

　　一、urllib庫介紹html

　　 urllib庫是Python內置的請求庫，可以實現簡單的頁面爬取功能。值得注意的是，在Python2中，有urllib和urllib2兩個庫來實現請求的發送。但在Python3中，就只有urllib庫了。因爲如今廣泛流行只用Python3了，因此瞭解urllib庫就好了。查看Python源文件知道urllib庫包括5個模塊，分別是：request、error、parse、robotparser、response。但我翻閱了一些資料後，發現robotparser和response不多說起，故我只對其餘三個模塊有所瞭解。python

　　二、request模塊瀏覽器

　　顧名思義，request就是用來發送請求的，咱們能夠經過設置參數來模擬瀏覽器發送請求。值得注意的是，此處request是一個urllib的一個子模塊與另一個請求庫request要區分。原本在寫這篇博客以前想仔細看看request模塊的源碼，打開發現有2700+行代碼，遂放棄。服務器

　　 request模塊中主要是用urlopen()和Request()來發送請求和一些Handler處理器。下面用代碼演示，具體用法在代碼註釋中。socket

　　urlopen()方法演示：ide

　　from urllib import request函數

　　from urllib import parsepost

　　from urllib import error網站

　　import socketurl

　　if __name__ == '__main__':

　　'''

　　def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,

　　*, cafile=None, capath=None, cadefault=False, context=None):

　　參數分析：

　　url:請求路徑

　　data:可選;若是要添加這個參數，須要將字典格式的數據轉化爲字節流數據，而且請求方式從get變爲post

　　timeout:可選;超時時間，若是訪問超時了變會拋出一個異常

　　其餘三個參數是用來設置證書和SSL的，默認設置便可

　　'''

　　# 一次簡單的請求了

　　response_1 = request.urlopen(url="http://www.baidu.com") # 返回一個HttpResponse對象

　　print(response_1.read().decode("utf-8")) #這樣就完成了一次簡單的請求了

　　print("狀態碼:" , response_1.status)

　　print("請求頭:" , response_1.getheaders())

　　print("----------------------------------華麗分割線-----------------------------------------------")

　　# 一次複雜的請求

　　dict = {"name" : "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　try:

　　response_2 = request.urlopen(url="http://www.httpbin.org/post",data=data,timeout=10)

　　except error.URLError as e:

　　if isinstance(e.reason,socket.timeout):

　　print("請求超時了")

　　print(response_2.read().decode('utf-8'))

　　使用Request構造請求體

　　from urllib import request,parse

　　if __name__ == '__main__':

　　"""

　　Request是一個類，經過初始化函數對其進行賦值，其做用是構造一個更強大的請求體

　　def __init__(self, url,

　　data=None, headers={},

　　origin_req_host=None,

　　unverifiable=False,

　　method=None):

　　url:請求路徑

　　data:可選;若是要添加這個參數，須要將字典格式的數據轉化爲字節流數據

　　headers:可選;參數類型是一個字典。咱們能夠修改User-Agent來假裝成瀏覽器，能夠防止反爬蟲

　　origin_req_host:可選;設置主機IP

　　unverifiable:可選;表示請求是不是沒法驗證的

　　method:可選;指示請求方式是GET,POST,PUT

　　"""

　　dict = {"name": "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　headers = {鄭州婦科在線醫生 http://www.zzkdfk120.com/

　　"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"

　　} #假裝成corome瀏覽器

　　req = request.Request(url="http://www.httpbin.org/post",data=data,headers=headers,method="POST")

　　response = request.urlopen(req)

　　print(response.read().decode("utf-8"))

　　三、error模塊

　　 error模塊有兩個子類：URLError和HTTPError

　　from urllib import request,error

　　if __name__ == '__main__':

　　try:

　　# 嘗試打開一個不存在的網站

　　response_1 = request.urlopen(

　　except error.URLError as e:

　　print(e.reason)

　　try:

　　# 請求出現錯誤

　　response_2 = request.urlopen("http://www.baidu.com/aaa.html")

　　except error.HTTPError as e:

　　print(e.reason)

　　#如果報400，則表示網頁不存在;如果報500，則表示服務器異常

　　print(e.code)

　　print(e.headers)

　　四、parse模塊

　　urlparse()：對url字符串進行解析

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#錨點"

　　result = parse.urlparse(url=url)

　　print(result)

　　# 輸出結果：

　　ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='param1', query='ie=UTF-8&wd=python', fragment='錨點')

　　urlunparse()： urlparse()的逆過程，傳入一個長度爲6的列表便可，列表的參數順序與urlparse()的結果一致。

　　urlsplit()與urlunsplit() :與上述兩個方法基本一致，只是將path和params的結果放在一塊兒了

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#錨點"

　　result = parse.urlsplit(url=url)

　　print(result)

　　# 輸出結果：

　　SplitResult(scheme='https', netloc='www.baidu.com', path='/s;param1', query='ie=UTF-8&wd=python', fragment='錨點')

　　其它的方法也是差很少的做用，都是對url進行解析的。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。