python爬蟲---從零開始（二）Urllib庫

時間 2019-11-24

標籤 python 爬蟲開始 urllib 欄目 Python 简体版

原文原文鏈接

　　接上文再繼續咱們的爬蟲，此次咱們來述說Urllib庫html

1，什麼是Urllib庫python

　　Urllib庫是python內置的HTTP請求庫git

　　urllib.request　　請求模塊github

　　urllib.error　　　異常處理模塊cookie

　　urllib.parse　　 url解析模塊post

　　urllib.robotparse robots.txt解析模塊測試

　　不須要額外安裝，python自帶的庫。網站

注意：ui

　　python2url

　　import urllib2

　　response = urllib2.urlopen('http://baidu.com')

　　python3

　　import urllib.request

　　response = urilib.request.urlopen('http://www.baidu.com')

　　python2和python3使用urllib庫仍是有必定區別的。

2，方法以及模塊：

　　1）request

　　基本運行：（get方式的請求）

　　import urllib.request

　　response = urilib.request.urlopen('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

　　運行結果以下：

　　在這裏咱們看到，當咱們輸入urllib.request.urlopen('http://baidu.com')時，咱們會獲得一大長串的文本，也就是咱們將要從這個獲得的文本里獲得咱們所須要的數據。

　　帶有請求參數：（post方式的請求）

　　import urllib.request

　　import urllib.parse

　　data = bytes(urllib.parse.urlencode({'username':'cainiao'}),encoding='utf8')

　　response = urllib.request.urlopen('http://httpbin.org/post',data = data)

　　print(response.read())

　　在這裏咱們不難看出，咱們給予的data username參數已經傳遞過去了。

　　注意data必須爲bytes類型

　　設置請求超時時間：

　　import urllib.request

　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 1)

　　print(response.read())

　　這時咱們看到，執行代碼時報出timed out錯誤。咱們這時可使用urllib.error模塊，代碼以下

　　import urllib.request

　　ipmort urllib.error

　　try:

　　　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 0.1)

　　　　print(response.read())

　　except urllib.error.URLError as e:

　　　　print('連接超時啦～！') # 這裏咱們沒有判斷錯誤類型，能夠自行加入錯誤類型判斷，而後在進行輸出。

　　說到這，咱們就把最簡單，最基礎的urlopen的基礎全都說完了，有能力的小夥伴，能夠進行詳細閱讀其源碼，更深刻的瞭解該方法。

　　2）響應 response

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(type(response))

　　# 獲得一個類型爲<class 'http.client.HTTPResponse'>　

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(type(response)) # 響應類型

　　print(response.status) #上篇文章提到的狀態碼

　　print(response.getheaders) # 請求頭

　　print(response.getheader('Server')) # 取得請求頭參數

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(response.read().decode('utf-8')) # 響應體，響應內容

　　響應體爲字節流形式的內容，咱們須要調用decode(decode('utf-8'))進行轉碼。

　　經常使用的post請求基本寫法

　　from urllib import request,parse

　　url = 'http://httpbin.org/post'

　　headers = {

　　　　'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',

　　　　'Host':'httpbin.org'

　　}

　　dict = {

　　　　'name':'cxiaocai'

　　}

　　data = bytes(parse.urlencode(dict),encoding='utf8')

　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

　　也能夠寫成這樣的

　　from urllib import request,parse

　　url = 'http://httpbin.org/post'

　　dict = {

　　　　'name':'cxiaocai'

　　}

　　data = bytes(parse.urlencode(dict),encoding='utf8')

　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')

　　req.add_header('User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

　　說到這裏，咱們最基本的urllib請求就能夠基本完成了，很大一部分網站也能夠進行爬取了。

3，代理設置

　　代理設置咱們這裏簡單的說一下，後面的博客咱們會用實際爬蟲來講明這個。

　　Hander代理

　　import urllib.request

　　proxy_hander = urllib.request.ProxyHeader({

　　　　'http':'http://127.0.0.1:1111',

　　　　'https':'https://127.0.0.1:2222'

　　})

　　opener = urllib.request.build_opener(proxy_hander)

　　response = opener.open('http://www.baidu.com')

　　print(response.read()) # 我這沒有代理，沒有測試該方法。

　　Cookie設置

　　import http.cookiejar, urllib.request

　　cookie = http.cookiejar.CookieJar()

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　for item in cookie:

　　　　print(item.name + "=" + item.value)

　　例如某些網站是須要登錄的，全部咱們在這裏須要設置Cookie

　　咱們也能夠將Cookie保存爲文本文件，便於屢次進行讀取。

　　import http.cookiejar, urllib.request

　　filename = 'cookie.txt'

　　cookie = http.cookiejar.MozillaCookieJar(filename)

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　cookie.save(ignore_discard=True, ignore_expires=True)

　　代碼運行之後會在項目目錄下生成一個cookie.txt

　　另一種Cookie的保存格式

　　import http.cookiejar, urllib.request

　　filename = 'cookie.txt'

　　cookie = http.cookiejar.LWPCookieJar(filename)

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　cookie.save(ignore_discard=True, ignore_expires=True)

　運行代碼之後也會生成一個txt文件，格式以下

　
下面咱們來讀取咱們過程保存的Cookie文件

import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_expires=True,ignore_discard=True)

hander = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(hander)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

4，異常處理
　　簡單事例，在這裏咱們來訪問一個不存在的網站

from urllib import request,error

try:

response = request.urlopen('https://www.cnblogs.com/cxiaocai/articles/index123.html')

except error.URLError as e:

print(e.reason)

　這裏咱們知道這個網站根本不存在的，會報錯，咱們捕捉該異常能夠保證程序繼續運行，咱們能夠執行重試操做
　咱們也能夠查看官網 https://docs.python.org/3/library/urllib.error.html#module-urllib.error

5，URL解析 
　　urlparse模塊
　　主要用戶解析URL的模塊，下面咱們先來一個簡單的示例

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1')

print(type(result),result)

這裏咱們看下輸出結果：

　　該方法能夠進行url的拆分
　　也能夠制定請求方式http，或者https方式請求

from urllib.parse import urlparse

result = urlparse('www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1',scheme='https')

print(result)

　輸出結果以下所示：

　　在這裏咱們看到了，請求被制定了https請求
　　咱們會看到輸出結果裏包含一個fragents，咱們想將framents拼接到query後面，咱們能夠這樣來作

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#commont',allow_fragments=False)

print(result)

　　輸出結果爲

　　若是沒有frament，則拼接到path內
　　示例：

　　
　　咱們如今知道了URl怎麼進行拆分，若是咱們獲得了URl的集合，例如這樣dada = ['http','www.baidu.com','index.html','user','a=6','comment']
咱們可使用urlunparse


　　還有urljoin，主要是來進行url的拼接的，接下來咱們來看下咱們的示例：

之後面的爲基準，若是有就留下，若是沒有就從前面取。
　　若是咱們的有了一個字典類型的參數，和一個url，咱們想發起get請求（上一期說過get請求傳參），咱們能夠這樣來作，

在這裏咱們須要注意的是，url地址後面須要自行加一個‘？’。
最後還有一個urllib.robotparser，主要用robot.txt文件的官網有一些示例，因爲這個不經常使用，在這裏我作過多解釋。
官網地址：https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser 感興趣的小夥伴能夠自行閱讀官方文檔。

到這裏咱們就把urllib的基本用法所有說了一遍，能夠本身嘗試寫一些爬蟲程序了（先用正則解析，後期咱們有更簡單的方法）。
想更深刻的研讀urllib庫，能夠直接登錄官方網站直接閱讀其源碼。官網地址： https://docs.python.org/3/library/urllib.html 注意：不少小夥伴看到個人代碼直接複製過去，但發現直接粘貼會報錯，還須要本身刪除多餘的空行，在這裏我並不建議大家複製粘貼，後期咱們整理一個github供你們直接使用。下一篇文章我會弄一篇關於Requests包的使用，我的感受比urllib更好用，敬請期待。　　                 感謝你們的閱讀，不正確的地方，還但願你們來斧正，鞠躬，謝謝🙏。