小白學 Python 爬蟲（23）：解析庫 pyquery 入門

from pyquery import PyQuery

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

d = PyQuery(html)
print(d('p'))

結果以下：

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

以上是直接使用字符串進行的初始化，同時它還支持直接傳入 URL 地址進行初始化：

d_url = PyQuery(url='https://www.geekdigging.com/', encoding='UTF-8')
print(d_url('title'))

結果以下：

<title>極客挖掘機</title>

這樣寫的話，其實 PyQuery 會先請求這個 URL ，而後用響應獲得的 HTML 內容完成初始化，與下面這樣寫其實也是同樣的：

r = requests.get('https://www.geekdigging.com/')
r.encoding = 'UTF-8'
d_requests = PyQuery(r.text)
print(d_requests('title'))

CSS 選擇器

咱們先來簡單感覺下 CSS 選擇器的用法，真的是很是的簡單方便：

d_css = PyQuery(html)
print(d_css('.story .sister'))
print(type(d_css('.story .sister')))

結果以下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

這裏的寫法含義是咱們先尋找 class 爲 story 的節點，尋找到之後接着在它的子節點中繼續尋找 class 爲 sister 的節點。

最後的打印結果中能夠看到，它的類型依然爲 pyquery.pyquery.PyQuery ，說明咱們能夠繼續使用這個結果解析。

查找節點

咱們接着介紹一下經常使用的查找函數，這些查找函數最讚的地方就是它們和 JQuery 的用法徹底一致。

find() ：查找節點的全部子孫節點。
children() ：只查找子節點。
parent() ：查找父節點。
parents() ：查找祖先節點。
siblings() ：查找兄弟節點。

下面來一些簡單的示例：

# 查找子節點
items = d('body')
print('子節點：', items.find('p'))
print(type(items.find('p')))

# 查找父節點
items = d('#link1')
print('父節點：', items.parent())
print(type(items.parent()))

# 查找兄弟節點
items = d('#link1')
print('兄弟節點：', items.siblings())
print(type(items.siblings()))

結果以下：

子節點： <p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<class 'pyquery.pyquery.PyQuery'>
父節點： <p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<class 'pyquery.pyquery.PyQuery'>
兄弟節點： <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
<class 'pyquery.pyquery.PyQuery'>

遍歷

經過上面的示例，能夠看到，若是 pyquery 取出來的有多個節點，雖然類型也是 PyQuery ，可是和 Beautiful Soup 不同的是返回的並非列表，若是咱們須要繼續獲取其中的節點，就須要遍歷這個結果，可使用 items() 這個獲取結果進行遍歷：

a = d('a')
for item in a.items():
    print(item)

結果以下：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

這裏咱們調用 items() 後，會返回一個生成器，遍歷一下，就能夠逐個獲得 a 節點對象了，它的類型也是 PyQuery 類型。每一個 a 節點還能夠調用前面所說的方法進行選擇，好比繼續查詢子節點，尋找某個祖先節點等，很是靈活。

提取信息

前面咱們獲取到節點之後，接着就是要獲取咱們所須要的信息了。

獲取信息主要分爲兩個部分，一個是獲取節點的文本信息，一個獲取節點的屬性信息。

獲取文本信息

a_1 = d('#link1')
print(a_1.text())

結果以下：

Elsie

若是想獲取這個節點內的 HTML 信息，可使用 html() 方法：

a_2 = d('.story')
print(a_2.html())

結果以下：

Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.

獲取屬性信息

當咱們獲取到節點之後，可使用 attr() 來獲取相關的屬性信息：

attr_1 = d('#link1')
print(attr_1.attr('href'))

結果以下：

http://example.com/elsie

除了咱們可使用 attr() 這個方法之外， pyquery 還爲咱們提供了 attr 屬性，好比上面的示例還能夠寫成這樣：

print(attr_1.attr.href)

結果和上面的示例是同樣的。

小結

咱們在前置準備中安裝的幾種解析器到此就介紹完了，綜合比較一下，Beautiful Soup 對新手比較友好，無需瞭解更多的其餘知識就能夠上手使用，可是對於複雜 DOM 的解析，依然須要必定的 CSS 選擇器的基礎，若是對 Xpath 比較熟練的話直接使用 lxml 卻是最爲方便的，若是和小編同樣，對 JQuery 和 CSS 選擇器都比較熟悉，那麼 pyquery 卻是一個很不錯的選擇。

接下來小編計劃作幾個簡單的實戰分享，敬請期待哦~~~