數據之路 - Python爬蟲 - urllib、Request、正則、XPath、Beautiful Soup、Pyquery

時間 2019-11-08

標籤數據之路 python 爬蟲 urllib request 正則 xpath beautiful soup pyquery 欄目 Python 简体版

原文原文鏈接

1、基本庫-urllib庫

urllib庫，它是Python內置的HTTP請求庫。它包含4個模塊：html

request：它是最基本的HTTP請求模塊，能夠用來模擬發送請求。html5
error：異常處理模塊，若是出現請求錯誤，咱們能夠捕獲這些異常，而後進行重試或其餘操做以保證程序不會意外終止。node
parse：一個工具模塊，提供了許多URL處理方法，好比拆分、解析、合併等。python
robotparser：主要是用來識別網站的robots.txt文件，而後判斷哪些網站能夠爬，哪些網站不能夠爬，它其實用得比較少。正則表達式

1.urllib.request模塊

request模塊主要功能：構造HTTP請求，利用它能夠模擬瀏覽器的一個請求發起過程，瀏覽器

request模塊同時還有：處理受權驗證（authenticaton）、重定向（redirection)、瀏覽器Cookies以及其餘內容。cookie

- urlopen方法

 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen參數介紹：網絡

url用於請求URL函數
data不傳：GET請求，傳：POST請求工具
timeout設置超時時間，單位爲秒，意思就是若是請求超出了設置的這個時間，尚未獲得響應，就會拋出異常。若是不指定該參數，就會使用全局默認時間。它支持HTTP、HTTPS、FTP請求。
context必須是ssl.SSLContext類型，用來指定SSL設置。
cafile指定CA證書
capath指定CA證書的路徑，這個在請求HTTPS連接時會有用。

- Request方法

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Request參數介紹：

url用於請求URL，這是必傳參數，其餘都是可選參數。
data若是要傳，必須傳bytes（字節流）類型的。若是它是字典，能夠先用urllib.parse模塊裏的urlencode()編碼。
headers是一個字典，它就是請求頭，咱們能夠在構造請求時經過headers參數直接構造，也能夠經過調用請求實例的add_header()方法添加。添加請求頭最經常使用的用法就是經過修改User-Agent來假裝瀏覽器
origin_req_host指的是請求方的host名稱或者IP地址。
unverifiable表示這個請求是不是沒法驗證的，默認是False，意思就是說用戶沒有足夠權限來選擇接收這個請求的結果。例如，咱們請求一個HTML文檔中的圖片，可是咱們沒有自動抓取圖像的權限，這時unverifiable的值就是True`。
method是一個字符串，用來指示請求使用的方法，好比GET、POST和PUT等。

from urllib import request, parse
 
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

- Handler處理器

urllib.request模塊裏的BaseHandler類，它是全部其餘Handler的父類。

常見Handler介紹：

HTTPDefaultErrorHandler：用於處理HTTP響應錯誤，錯誤都會拋出HTTPError類型的異常。
HTTPRedirectHandler：用於處理重定向。
HTTPCookieProcessor：用於處理Cookies。
ProxyHandler：用於設置代理，默認代理爲空。
HTTPPasswordMgr：用於管理密碼，它維護了用戶名和密碼的表。
HTTPBasicAuthHandler：用於管理認證，若是一個連接打開時須要認證，那麼能夠用它來解決認證問題。

- 代理

ProxyHandler，其參數是一個字典，鍵名是協議類型（好比HTTP或者HTTPS等），鍵值是代理連接，能夠添加多個代理。

而後，利用這個Handler及build_opener()方法構造一個Opener，以後發送請求便可。

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

- cookies

# 從網頁獲取cookie，並逐行輸出
import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

# 從網頁獲取cookie，保存爲文件格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)　　# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

PS：MozillaCookieJar是CookieJar的子類，LWPCookieJar與MozillaCookieJar都可讀取、保存cookie，但格式不一樣

調用load()方法來讀取本地的Cookies文件，獲取到了Cookies的內容。

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

2.urllib.error模塊

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

3.urllib.parse模塊

urlparse()
urlunparse()
urlsplit()
urlunsplit()
urljoin()
urlencode()
parse_qs()
parse_qsl()
quote()
unquote()

4.urllib.robotparser模塊

Robots協議也稱做爬蟲協議、機器人協議，它的全名叫做網絡爬蟲排除標準（Robots Exclusion Protocol），用來告訴爬蟲和搜索引擎哪些頁面能夠抓取，哪些不能夠抓取。它一般是一個叫做robots.txt的文本文件,

通常放在網站的根目錄下。www.taobao.com/robots.txt

robotparser模塊提供了一個類RobotFileParser，它能夠根據某網站的robots.txt文件來判斷一個爬取爬蟲是否有權限來爬取這個網頁。

urllib.robotparser.RobotFileParser(url='')

# set_url()：用來設置robots.txt文件的連接。
# read()：讀取robots.txt文件並進行分析。
# parse()：用來解析robots.txt文件。
# can_fetch()：該方法傳入兩個參數，第一個是User-agent，第二個是要抓取的URL。
# mtime()：返回的是上次抓取和分析robots.txt的時間。
# modified()：將當前時間設置爲上次抓取和分析robots.txt的時間。

from urllib.robotparser import RobotFileParser
 
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

2、基本庫-requests庫

1.GET、POST請求

get()、post()、put()、delete()方法分別用於實現GET、POST、PUT、DELETE請求。

import requests
========================================================
# GET請求 

data = {
    'name': 'germey',
    'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

========================================================
# POST請求

data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

2.響應

import requests
 
r = requests.get('http://www.jianshu.com')
print(type(r.status_code), r.status_code)    # status_code屬性獲得狀態碼
print(type(r.headers), r.headers)    # 輸出headers屬性獲得響應頭
print(type(r.cookies), r.cookies)    # 輸出cookies屬性獲得Cookies
print(type(r.url), r.url)    # 輸出url屬性獲得URL
print(type(r.history), r.history)    # 輸出history屬性獲得請求歷史

3.文件上傳

import requests
 
files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

4.cookies

# 獲取Cookies
import requests
 
r = requests.get("https://www.baidu.com")
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)

5.會話維持

import requests
 
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

6.SSL證書驗證

requests還提供了證書驗證的功能。當發送HTTP請求的時候，它會檢查SSL證書，咱們可使用verify參數控制是否檢查此證書。其實若是不加verify參數的話，默認是True，會自動驗證。

# 經過verity參數設置忽略警告
import requests
from requests.packages import urllib3
 
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 經過捕獲警告到日誌的方式忽略警告
import logging
import requests
logging.captureWarnings(True)
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 指定一個本地證書用做客戶端證書，這能夠是單個文件（包含密鑰和證書）或一個包含兩個文件路徑的元組
import requests
 
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

7.代理

import requests
 
proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}
 
requests.get("https://www.taobao.com", proxies=proxies)

8.超時設置

import requests

# 超時拋出異常
r = requests.get("https://www.taobao.com", timeout = 1)
print(r.status_code)

# 請求分爲兩個階段，即鏈接（connect）和讀取（read），能夠分別指定，傳入一個元組
r = requests.get('https://www.taobao.com', timeout=(5,11, 30))

# 永久等待    
r = requests.get('https://www.taobao.com', timeout=None)
r = requests.get('https://www.taobao.com')

9.身份認證

# 使用requests自帶的身份認證功能
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)

# 傳一個元組，默認使用HTTPBasicAuth類來認證
import requests
 
r = requests.get('http://localhost:5000', auth=('username', 'password'))
print(r.status_code)

3、正則表達式

1.經常使用匹配規則

模式	描述
\w	匹配字母、數字及下劃線
\W	匹配不是字母、數字及下劃線的字符
\s	匹配任意空白字符，等價於[\t\n\r\f]
\S	匹配任意非空字符
\d	匹配任意數字，等價於[0-9]
\D	匹配任意非數字的字符
\A	匹配字符串開頭
\Z	匹配字符串結尾，若是存在換行，只匹配到換行前的結束字符串
\z	匹配字符串結尾，若是存在換行，同時還會匹配換行符
\G	匹配最後匹配完成的位置
\n	匹配一個換行符
\t	匹配一個製表符
^	匹配一行字符串的開頭
$	匹配一行字符串的結尾
.	匹配任意字符，除了換行符，當re.DOTALL標記被指定時，則能夠匹配包括換行符的任意字符
[...]	用來表示一組字符，單獨列出，好比[amk]匹配a、m或k
[^...]	不在[]中的字符，好比[^abc]匹配除了a、b、c以外的字符
*	匹配0個或多個表達式
+	匹配1個或多個表達式
?	匹配0個或1個前面的正則表達式定義的片斷，非貪婪方式
{n}	精確匹配n個前面的表達式
{n,m}	匹配n到m次由前面正則表達式定義的片斷，貪婪方式
a\|b	匹配a或b
( )	匹配括號內的表達式，也表示一個組

2.修飾符

修飾符	描述
re.I	使匹配對大小寫不敏感
re.L	作本地化識別（locale-aware）匹配
re.M	多行匹配，影響^和$
re.S	使.匹配包括換行在內的全部字符
re.U	根據Unicode字符集解析字符。這個標誌影響\w、\W、 \b和\B
re.X	該標誌經過給予你更靈活的格式以便你將正則表達式寫得更易於理解

3.經常使用正則函數

match()方法會嘗試從字符串的起始位置匹配正則表達式，match()方法中，第一個參數傳入了正則表達式，第二個參數傳入了要匹配的字符串。group()方法能夠輸出匹配到的內容；span()方法能夠輸出匹配的範圍。
search()方法在匹配時會掃描整個字符串，而後返回第一個成功匹配的結果。
findall()方法會搜索整個字符串，而後返回匹配正則表達式的全部內容。
sub()方法可將一串文本中的全部數字都去掉。
compile()方法將正則字符串編譯成正則表達式對象，以便在後面的匹配中複用。
split()方法將字符串用給定的正則表達式匹配的字符串進行分割，分割後返回結果list。

4、解析庫-XPath

XPath，全稱XML Path Language，即XML路徑語言，它是一門在XML文檔中查找信息的語言。

使用XPath來對網頁進行解析，首先導入lxml庫的etree模塊，而後聲明瞭一段HTML文本，調用HTML類進行初始化，這樣就成功構造了一個XPath解析對象。etree模塊能夠自動修正HTML文本。


html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

# 利用XPath規則提取信息
html = etree.parse('./test.html', etree.HTMLParser()) 
result = html.xpath(’//*’) 
print(result)

# 屬性多值匹配,採用contains()函數
html = etree.HTML(text) 
result = html. xpath (’//li[contains(@class,」li」)]/a/text()’) 
print(result)

# 多屬性匹配，藉助and運算符實現
html = etree.HTML(text) 
result = html. xpath(' //li[contains(@class,」li") and @name＝item」］／a/text()' )
print(result)

# 按序選擇節點，藉助中括號傳入索引的方法獲取特定次序的節點
html = etree.HTML(text) 
result = html. xpath (’//li[l]/a/text()’) 
print(result) 
result = html.xpath(’I /li[last()] /a/text()’) 
print(result) 
result = html.xpath(’I !li [position() <3] I a/text()’) 
print (resl肚）
result = html. xpath (’I /li [last ()-2] /a/text()’) 
print(result)

# 節點軸選擇，未完待續

1.XPath經常使用規則

表達式	描述
nodename	選取此節點的全部子節點
/	從當前節點選取直接子節點
//	從當前節點選取子孫節點
.	選取當前節點
..	選取當前節點的父節點
@	選取屬性

2.XPath基本用法

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

- 全部節點、子節點、父節點

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())

# 選取全部節點
result = html.xpath('//*')
print(result)

result = html.xpath('//li')
print(result)
print(result[0])

# 選取子節點
result = html.xpath('//li/a')
print(result)

# 選取父節點
result = html.xpath('//a[@href='link4.html']/../@class')
print(result)

- 選取屬性

for lxml import etree

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class='item-0']')
print(result)

- 選取文本

for lxml import etree

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class='item-0']/text()')
print(result)

# 屬性多值匹配
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,'li')]/a/text()')
print(result)

# 多屬性匹配
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,li) and @name='item']/a/text()')
print(result)

5、解析庫-Beautiful Soup

Beautiful Soup就是Python的一個HTML或XML的解析庫，能夠用它來方便地從網頁中提取數據。Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲UTF-8編碼。

1.Beautiful Soup基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

- 標籤選擇器·選擇元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.head)
print(soup.p)    # 若是有多個p標籤，只輸出第一個

- 標籤選擇器·獲取名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

- 標籤選擇器·獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

- 子節點和子孫節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)        # 獲取子節點

print(soup.p.children)        # 獲取子節點
for i,child in enumerate(soup.p.children):
    print(i,child)            
    
print(soup.p.descendants)     # 獲取子孫節點
for i,child in enumerate(soup.p.descendants):
    print(i,child)

- 父節點、祖先節點、兄弟節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.a.parent)    　　　　　　　　　　　　　　　　 # 獲取父節點
print(list(enumerate(soup.a.parents)))    　　　　　　# 獲取祖先節點

print(list(enumerate(soup.a.next_siblings)))        # 獲取下一兄弟節點
print(list(enumerate(soup.a.previous_siblings)))    # 獲取上一個兄弟節點

2.Beautiful Soup支持的解析器

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3及Python 3.2.2以前的版本文檔容錯能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文檔容錯能力強	須要安裝C語言庫
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

3.方法選擇器

find_all()根據標籤名、屬性、內容查找文檔
find_all(narne,attrs,recursive,text,**kwargs)

# 標籤名查詢
print(soup.findall(name=’ul'))
print(type(soup.find_all(name=’ul’)[0]))

# 屬性查詢
print(soup.幹ind_all(attrs＝｛’id＇：’list-1'｝））
print(soup.於ind_all(attrs＝｛’name＇：’elements’｝））

# 文本查詢
print(soup.find_all(text=re.compile(’link')))

find_all()　　　　　        # 返回全部元素
find()　　　　　　　        # 返回單個元素
                            
find_parents()　　          # 返回全部祖先節點
find_parent()　　           # 返回直接父節點
                            
find_next_siblings()　　    # 返回後面全部的兄弟節點
find_next_sibling()　　     # 返回後面第一個兄弟節點
                            
find_previous_siblings()    # 返回前面全部兄弟節點
find_previous_sibling()     # 返回前面第一個兄弟節點
                            
find_all_next()             # 返回節點後全部符合條件的節點
find_next()                 # 返回第一個符合條件的節點
                            
find_all_previous()         # 返回節點後全部符合條件的節點
find_previous()             # 返回第一個符合條件的節點

4.CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

html= '''
<div class='panel'>
    <div class='panel-heading'>
        <h4>Hello</h4>
    </div>    
    <div class='panel-body'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
        </ul>
    </div>
</div>
'''

- 選擇標籤

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
print(soup.select('.panel.panel-heading'))    
print(soup.select('ul li'))
print(soup.select('#list-2.element'))

- 選擇屬性

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

- 選擇文本

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('li'):
    print(ul.get_text())

6、解析庫-Pyquery

html = '''
<div> 
　　<ul> 
　　　　<li class="item-0">first item<lli> 
　　　　<li class="item-1"><a href="link2.html"＞second item</a><lli> 
　　　　<li class="item-0 active"><a href="link3.html"><span class="bold"＞third item</span></a></li> 
　　　　<li class ="item-1 active"><a href="link4 . html">fourth item</a></li> 
　　　　<li class="item-0"＞＜a href="link5.html">fifth item</a></li> 
　　</ul> 
</div>
'''

1.初始化

# 字符串初始化
from pyquery import PyQuery as pq
doc = pd(html)
print(doc('li'))

# URL初始化
from pyquery import PyQuery as pq 
doc = pq(url=' https://cuiqingcai.com’) 
print(doc(’title'))

# 文件初始化
from  pyquery import  PyQuery as pq 
doc = pq(filename=’demo.html’) 
print(doc(’li’))

2.CSS選擇器

- 獲取標籤

from pyquery import PyQuery as pq
doc = pd(html)

# 子元素
items = doc('.list')
lis = items.find('li')

lis = items.children()
lis = items.children('.active')
print(lis)

# 父元素
items = doc('.list')
container =items.parents()
print(container)

parent = items.parents('.wrap')
print(parent)

# 兄弟元素
li = doc('.list.item-0.active')
print(li.siblings())
print(li.siblings('.active'))

- 獲取屬性

from pyquery import PyQuery as pq
doc = pd(html)
a = doc('.item-0.active a')
print(a)
print(a.attr.href)
print(a.attr('href')

- 獲取內容

from pyquery import PyQuery as pq
doc = pd(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

- 獲取HTML

from pyquery import PyQuery as pq
doc = pd(html)
li = doc('.item-0.active')
print(li)
print(li.html())

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。