Python爬蟲利器：Beautiful Soup

時間 2020-03-17

標籤 python 爬蟲利器 beautiful soup 欄目 Python 简体版

原文原文鏈接

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫。使用它來處理HTML頁面就像JavaScript代碼操做HTML DOM樹同樣方便。官方中文文檔地址css

1. 安裝

1.1 安裝 Beautiful Soup

Beautiful Soup3 目前已經中止維護，推薦使用 Beautiful Soup4，如今已經被移植到 bs4，導入的時候須要從 bs4 導入。安裝方法以下：html

# 使用 pip 安裝
pip install beautifulsoup4

# 使用 easy_install 安裝
easy_install beautifulsoup4

1.2 安裝解析器 lxml

另外還須要安裝相應的解析器，lxml，html5lib 任選一個便可。html5

# 安裝 lxml
pip install lxml

# 安裝 html5lib
pip install html5lib

1.3 使用方法

安裝了 BeautifulSoup 之後能夠導入使用。將一段文檔傳入BeautifulSoup 的構造方法,就能獲得一個文檔的對象, 能夠傳入一段字符串或一個文件句柄.node

# 首先從 bs4 導入
from bs4 inport BeautifulSoup

# 使用解析器和html文檔能夠初始化
soup = BeautifulSoup(open("index.html"), 'lxml')
content = '<html>data</html>'
soup = BeautifulSoup(content, 'lxml')

文檔須要轉換成Unicode,而且HTML的實例都被轉換成Unicode編碼python

2. BeautifulSoup中的對象

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構，相似於瀏覽器中的DOM節點數，每一個節點都是Python對象，全部對象能夠概括爲4種: Tag , NavigableString , BeautifulSoup , Comment 。正則表達式

2.1 tag 對象

tag 對象相似於一個標籤節點。與XML或HTML原生文檔中的標籤相同，如 body，div，a，span。tag 對象有不少方法和屬性。tag 對象的屬性能夠像字典同樣進行增刪改查操做。瀏覽器

2.1.1 name 屬性

name 屬性表示 tag 的名稱。經過 .name 獲取。若是改變了tag的name,那將影響全部經過當前Beautiful Soup對象生成的HTML文檔。app

2.1.2 attributes 屬性

一個tag可能有不少個屬性，使用 tag.attrs 獲取 tag 的全部節點屬性，能夠對這些屬性進行增刪改查。獲取方法以下：函數

tag.attrs：獲取屬性列表
tag.attrs[1]：獲取屬性列表中的第2個屬性
tag.get('href')：獲取 href 屬性
tag['href']：獲取 href 屬性

2.1.3 多值屬性

在HTML文檔中有典型像 class 同樣的有多個屬性值，這些多值屬性返回的值不是 string ，而是 list 。這些多值屬性的節點類型以下：編碼

class
rel
rev
accept-charset
headers
accesskey

在XML文檔中沒有多值屬性

content = '<a href="index.html" class="button button-blue" data="1 2 3"></a>'
soup = BeautifulSoup(content, 'lxml')
tag = soup.a  # 獲取 a 標籤
tag.name  # 標籤名稱：a
tag.attrs  # 屬性列表：['href', 'class', 'data']
tag.get('href')  # 獲取href屬性：index.html
tag['class']  # 獲取class屬性爲list：[button,button-blue]
tag['data']  # data屬性的值爲string：1 2 3

2.2 NavigableString 對象

字符串常被包含在tag內，Beautiful Soup用 NavigableString 類來包裝tag中的字符串。

使用 tag.string 獲取 tag 內字符串，爲NavigableString
使用 unicode(tag.string) 轉換爲通常Unicode字符串
tag 內字符串不能編輯
tag.string.replace_with('content') 替換 tag 內字符串
tag 內字符串不支持 .contents 或 .string 屬性或 find() 方法
在Beautiful Soup以外使用 NavigableString 對象須要調用 unicode() 方法

2.3 BeautifulSoup 對象

BeautifulSoup 對象表示的是一個文檔的所有內容。大部分時候，能夠把它看成 tag 對象。

由於 BeautifulSoup 對象並非真正的HTML或XML的tag，BeautifulSoup 對象包含了一個值爲「[document]」的特殊屬性 .name。

2.4 Comment 對象

Comment 對象是一個特殊類型的 NavigableString 對象，用來表示文檔的註釋部分。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)  # <class 'bs4.element.Comment'>
print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

3. 遍歷文檔樹

經過遍歷文檔樹，可以從文檔中找到指定的內容。

3.1 子節點

一個 tag 可能包含多個字符串或者其餘 tag，這些 tag 就是頂層的子節點，Beautiful Soup提供了許多操做和遍歷子節點的屬性。須要注意：

字符串沒有子節點。
BeautifulSoup 對象自己必定會包含子節點

假設有如下幾種簡單的方式獲取子節點：

使用tag名：獲取第一個直接子節點：soup.div.p
使用contents屬性：獲取全部直接子節點列表：soup.div.contents
使用children屬性：對子節點進行循環：soup.div.children
使用descendants屬性：全部tag的子孫節點進行遞歸循環：soup.div.descendants
使用string屬性：獲取只有一個string子節點tag的子節點：p.string
使用strings屬性：循環有多個string子節點的狀況：div.strings

div_html = '
<div>
    <p>uu</p>
    <p>sa</p>
    <p>
        <a>ma</a>
        -->
    </p>
</div>'
soup =  BeautifulSoup(div_html), 'lxml')
div = soup.div # 獲取 div 節點
div.p  # <p>uu</p>

div.contents
# [<p>uu</p>, <p>sa</p>, <p><a>ma</a>--></p>]
div.contents[0] # <p>uu</p>

for child in div.children:
    print(child)
# <p>uu</p><p>sa</p><p><a>ma</a>--></p>

for child in div.descendants:
    pring(child)

3.2 父節點

每一個tag或字符串都有父節點，即每一個節點都被包含在tag中，經過 .parent 屬性來獲取某個元素的父節點： p.parent，經過元素的 .parents 屬性能夠遞歸獲得元素的全部父輩節點。

soup =  BeautifulSoup(div_html), 'lxml')  # 使用3.1中定義的 div_html
div = soup.div # 獲取 div 節點
sa = div.a.string  # 第一個 a 節點的string

sa.parent  # a 節點
sa.parent.parent # div 節點
for parent in sa.parents:
    print(parent)
# a
# div
# [document]
# None

3.3 兄弟節點

兄弟節點就是具備相同父節點的同義詞節點。如3.1中定義的 div_html 中的3個p標籤互相爲兄弟節點。使用下面的節點tag屬性訪問兄弟節點

next_sibling：當前節點的下一個兄弟節點
previous_sibling：當前節點的上一個兄弟節點
next_siblings：當前節點以後的全部兄弟節點
previous_siblings：當前節點以前的全部兄弟節點

4. 搜索文檔樹

搜索功能能夠說是寫爬蟲過程當中必用的功能，用來查找指定的節點，Beautiful Soup定義了不少搜索方法，這些搜索方法的參數和用法都很相似，查詢功能很是強大，下面主要針對 find_all 方法說明。

4.1 過濾器

過濾器是使用搜索方法過程當中的匹配規則，即參數的可能取值。過濾器能夠爲下面幾種形式：

字符串：find_all('div')
列表：find_all(['div', 'span'])
正則表達式：find_all(re.compile('[a-z]{1,3}'))
True：匹配任意非字符串子節點
方法：至關於匹配的回調函數，該方法返回 True 表示匹配

content = '<nav><a>a_1</a><a>a_2</a>string</nav>'
soup = BeautifulSoup(content), 'lxml')
nav_node = soup.nav 
# <nav><a>a_1</a><a>a_2</a>string</nav>
nav_node.find_all(True)  # 不匹配 string
# ['<a>a_1</a>', '<a>a_2</a>']
def has_class_but_no_id(tag):  # 定義匹配函數
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)  # 返回有class屬性沒有id屬性的節點

4.2 find_all() find() 方法

find_all 方法返回匹配搜索的全部節點的列表或者空，而 find 方法直接返回第一個匹配搜索的結果。詳細的定義以下：

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
find_find(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

各參數含義以下：

name：匹配 tag 標籤名
attrs：匹配屬性名：find_all(href='index.html')
text：匹配文檔中的字符串內容
limit：指定匹配的結果集數量
recursive：默認True，False表示只搜索直接子節點

以上的參數值均可以是 4.1 中說明的任意一種過濾器值。另外還須要注意如下幾點：

attrs 參數爲字典類型，能夠是多個屬性條件組合
單獨使用 class 屬性時，應該使用 class_
class 屬性爲多值屬性，會分別搜索每一個 CSS 類名
按照CSS類目徹底匹配時，必須順序相同

# 搜索全部 div 標籤
soup.find_all('div')
# 搜索全部具備id屬性而且id屬性值爲 link1 或者 link2 的節點
soup.find_all(id=['link1', 'link2'])
# 搜索全部 class 屬性包含 button 的節點
soup.find_all(class_='button')
# 搜索全部匹配給定正則表達式內容的 p 標籤
soup.find_all('p', text=re.compile('game'))
# 搜索具備 button 類，而且具備值爲 link1 的 href 屬性的 a 標籤
soup.find_all('a', {'classl': 'button', 'href': 'link1'})
# 只搜索一個直接子節點的 a 標籤
soup.find_all('a', limit=1, recursive=False)