Python爬蟲（三）：BeautifulSoup庫

時間 2021-08-13

標籤 html html5 正則表達式 express ubuntu 瀏覽器 ide 函數學習編碼欄目 Python 简体版

原文原文鏈接

BeautifulSoup 是一個能夠從 HTML 或 XML 文件中提取數據的 Python 庫，它可以將 HTML 或 XML 轉化爲可定位的樹形結構，並提供了導航、查找、修改功能，它會自動將輸入文檔轉換爲 Unicode 編碼，輸出文檔轉換爲 UTF-8 編碼。html

BeautifulSoup 支持 Python 標準庫中的 HTML 解析器和一些第三方的解析器，默認使用 Python 標準庫中的 HTML 解析器，默認解析器效率相對比較低，若是須要解析的數據量比較大或比較頻繁，推薦使用更強、更快的 lxml 解析器。html5

1 安裝

1）BeautifulSoup 安裝正則表達式

若是使用 Debain 或 ubuntu 系統，能夠經過系統的軟件包管理來安裝：apt-get install Python-bs4，若是沒法使用系統包管理安裝，可使用 pip install beautifulsoup4 來安裝。express

2）第三方解析器安裝ubuntu

若是須要使用第三方解釋器 lxml 或 html5lib，但是使用以下命令進行安裝：apt-get install Python-lxml(html5lib) 和 pip install lxml(html5lib)。瀏覽器

看一下主要解析器和它們的優缺點：ide

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup,"html.parser")	Python的內置標準庫；執行速度適中；文檔容錯能力強。	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差。
lxml HTML 解析器	BeautifulSoup(markup,"lxml")	速度快；文檔容錯能力強。	須要安裝C語言庫。
lxml XML 解析器	BeautifulSoup(markup,["lxml-xml"])函數 BeautifulSoup(markup,"xml")學習	速度快；惟一支持XML的解析器。	須要安裝C語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性；以瀏覽器的方式解析文檔；生成HTML5格式的文檔。	速度慢；不依賴外部擴展。

2 快速上手

將一段文檔傳入 BeautifulSoup 的構造方法，就能獲得一個文檔的對象，能夠傳入一段字符串或一個文件句柄，示例以下：編碼

1）使用字符串

咱們以以下一段 HTML 字符串爲例：

html = '''<!DOCTYPE html><html lang="en"><head>    <meta charset="UTF-8">    <title>BeautifulSoup學習</title></head><body>Hello BeautifulSoup</body></html>'''

使用示例以下：

from bs4 import BeautifulSoup#使用默認解析器soup = BeautifulSoup(html,'html.parser')#使用 lxml 解析器soup = BeautifulSoup(html,'lxml')

2）本地文件

還以上面那段 HTML 爲例，將上面 HTML 字符串放在 index.html 文件中，使用示例以下：

#使用默認解析器soup = BeautifulSoup(open('index.html'),'html.parser')#使用 lxml 解析器soup = BeautifulSoup(open('index.html'),'lxml')

2.1 對象的種類

BeautifulSoup 將 HTML 文檔轉換成一個樹形結構，每一個節點都是 Python 對象，全部對象能夠概括爲4種：Tag，NavigableString，BeautifulSoup，Comment。

1）Tag 對象

Tag 對象與 HTML 或 XML 原生文檔中的 tag 相同，示例以下：

soup = BeautifulSoup('<title>BeautifulSoup學習</title>','lxml')tag = soup.titletp =type(tag)print(tag)print(tp)
#輸出結果'''<title>BeautifulSoup學習</title><class 'bs4.element.Tag'>'''

Tag 有不少方法和屬性，這裏先看一下它的的兩種經常使用屬性：name 和 attributes。

咱們能夠經過 .name 來獲取 tag 的名字，示例以下：

soup = BeautifulSoup('<title>BeautifulSoup學習</title>','lxml')tag = soup.titleprint(tag.name)
#輸出結果#title

咱們還能夠修改 tag 的 name，示例以下：

tag.name = 'title1'print(tag)
#輸出結果#<title1>BeautifulSoup學習</title1>

一個 tag 可能有不少個屬性，先看一它的 class 屬性，其屬性的操做方法與字典相同，示例以下：

soup = BeautifulSoup('<title class="tl">BeautifulSoup學習</title>','lxml')tag = soup.titlecls = tag['class']print(cls)
#輸出結果#['tl']

咱們還可使用 .attrs 來獲取，示例以下：

ats = tag.attrsprint(ats)
#輸出結果#{'class': ['tl']}

tag 的屬性能夠被添加、修改和刪除，示例以下：

#添加 id 屬性tag['id'] = 1
#修改 class 屬性tag['class'] = 'tl1'
#刪除 class 屬性del tag['class']

2）NavigableString 對象

NavigableString 類是用來包裝 tag 中的字符串內容的，使用 .string 來獲取字符串內容，示例以下：

str = tag.string

可使用 replace_with() 方法將原有字符串內容替換成其它內容，示例以下：

tag.string.replace_with('BeautifulSoup')

3）BeautifulSoup 對象

BeautifulSoup 對象表示的是一個文檔的所有內容，它並非真正的 HTML 或 XML 的 tag，所以它沒有 name 和 attribute 屬性，爲方便查看它的 name 屬性，BeautifulSoup 對象包含了一個值爲 [document] 的特殊屬性 .name，示例以下：

soup = BeautifulSoup('<title class="tl">BeautifulSoup學習</title>','lxml')print(soup.name)
#輸出結果#[document]

4）Comment 對象

Comment 對象是一個特殊類型的 NavigableString 對象，它會使用特殊的格式輸出，看一下例子：

soup = BeautifulSoup('<title class="tl">Hello BeautifulSoup</title>','html.parser')comment = soup.title.prettify()print(comment)
#輸出結果'''<title class="tl"> Hello BeautifulSoup</title>'''

咱們前面看的例子中 tag 中的字符串內容都不是註釋內容，如今將字符串內容換成註釋內容，咱們來看一下效果：

soup = BeautifulSoup('<title class="tl"><!--Hello BeautifulSoup--></title>','html.parser')str = soup.title.stringprint(str)
#輸出結果#Hello BeautifulSoup

經過結果咱們發現註釋符號  被自動去除了，這一點咱們要注意一下。

2.2 搜索文檔樹

BeautifulSoup 定義了不少搜索方法，咱們來具體看一下。

1）find_all()

find_all() 方法搜索當前 tag 的全部 tag 子節點，方法詳細以下：find_all(name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)，來具體看一下各個參數。

name 參數能夠查找全部名字爲 name 的 tag，字符串對象會被自動忽略掉，示例以下：

soup = BeautifulSoup('<title class="tl">Hello BeautifulSoup</title>','html.parser')print(soup.find_all('title'))
#輸出結果#[<title class="tl">Hello BeautifulSoup</title>]

attrs 參數定義一個字典參數來搜索包含特殊屬性的 tag，示例以下：

soup = BeautifulSoup('<title class="tl">Hello BeautifulSoup</title>','html.parser')soup.find_all(attrs={"class": "tl"})

調用 find_all() 方法時，默認會檢索當前 tag 的全部子孫節點，經過設置參數 recursive=False，能夠只搜索 tag 的直接子節點，示例以下：

soup = BeautifulSoup('<html><head><title>Hello BeautifulSoup</title></head></html>','html.parser')print(soup.find_all('title',recursive=False))
#輸出結果#[]

經過 text 參數能夠搜搜文檔中的字符串內容，它接受字符串、正則表達式、列表、True，示例以下：

from bs4 import BeautifulSoupimport re
soup = BeautifulSoup('<head>myHead</head><title>BeautifulSoup</title>','html.parser')#字符串soup.find_all(text='BeautifulSoup')
#正則表達式soup.find_all(soup.find_all(text=re.compile('title')))
#列表soup.find_all(soup.find_all(text=['head','title']))
#Truesoup.find_all(text=True)

limit 參數與 SQL 中的 limit 關鍵字相似，用來限制搜索的數據，示例以下：

soup = BeautifulSoup('<a id="link1" href="http://example.com/elsie">Elsie</a><a id="link2" href="http://example.com/elsie">Elsie</a>','html.parser')soup.find_all('a', limit=1)

咱們常常見到 Python 中 *arg 和 **kwargs 這兩種可變參數，*arg 表示非鍵值對的可變數量的參數，將參數打包爲 tuple 傳遞給函數；**kwargs 表示關鍵字參數，參數是鍵值對形式的，將參數打包爲 dict 傳遞給函數。

使用多個指定名字的參數能夠同時過濾 tag 的多個屬性，如：

soup = BeautifulSoup('<a id="link1" href="http://example.com/elsie">Elsie</a><a id="link2" href="http://example.com/elsie">Elsie</a>','html.parser')soup.find_all(href=re.compile("elsie"),id='link1')

有些 tag 屬性在搜索不能使用，如 HTML5 中的 data-* 屬性，示例以下：

soup = BeautifulSoup('<div data-foo="value">foo!</div>')soup.find_all(data-foo='value')

首先當我在 Pycharm 中輸入 data-foo='value' 便提示語法錯誤了，而後我無論提示直接執行提示 SyntaxError: keyword can't be an expression 這個結果也驗證了 data-* 屬性在搜索中不能使用。咱們能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的 tag，示例以下：

print(soup.find_all(attrs={'data-foo': 'value'}))

2）find()

方法詳細以下：find(name=None, attrs={}, recursive=True, text=None,**kwargs)，咱們能夠看出除了少了 limit 參數，其它參數與方法 find_all 同樣，不一樣之處在於：find_all() 方法的返回結果是一個列表，find() 方法返回的是第一個節點，find_all() 方法沒有找到目標是返回空列表，find() 方法找不到目標時，返回 None。來看個例子：

soup = BeautifulSoup('<a id="link1" href="http://example.com/elsie">Elsie</a><a id="link2" href="http://example.com/elsie">Elsie</a>','html.parser')print(soup.find_all('a', limit=1))print(soup.find('a'))
#輸出結果'''[<a href="http://example.com/elsie" id="link1">Elsie</a>]<a href="http://example.com/elsie" id="link1">Elsie</a>'''

從示例中咱們也能夠看出，find() 方法返回的是找到的第一個節點。

3）find_parents() 和 find_parent()

find_all() 和 find() 用來搜索當前節點的全部子節點，find_parents() 和 find_parent() 則用來搜索當前節點的父輩節點。

4）find_next_siblings() 和 find_next_sibling()

這兩個方法經過 .next_siblings 屬性對當前 tag 全部後面解析的兄弟 tag 節點進行迭代，find_next_siblings() 方法返回全部符合條件的後面的兄弟節點，find_next_sibling() 只返回符合條件的後面的第一個tag節點。

5）find_previous_siblings() 和 find_previous_sibling()

這兩個方法經過 .previous_siblings 屬性對當前 tag 前面解析的兄弟 tag 節點進行迭代，find_previous_siblings() 方法返回全部符合條件的前面的兄弟節點，find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點。

6）find_all_next() 和 find_next()

這兩個方法經過 .next_elements 屬性對當前 tag 以後的 tag 和字符串進行迭代，find_all_next() 方法返回全部符合條件的節點，find_next() 方法返回第一個符合條件的節點。

7）find_all_previous() 和 find_previous()

這兩個方法經過 .previous_elements 屬性對當前節點前面的 tag 和字符串進行迭代，find_all_previous() 方法返回全部符合條件的節點，find_previous() 方法返回第一個符合條件的節點。

2.3 CSS選擇器

BeautifulSoup 支持大部分的 CSS 選擇器，在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數，便可使用 CSS 選擇器的語法找到 tag，返回類型爲列表。示例以下：

soup = BeautifulSoup('<body><a id="link1" class="elsie">Elsie</a><a id="link2" class="elsie">Elsie</a></body>','html.parser')print(soup.select('a'))
#輸出結果#[<a clss="elsie" id="link1">Elsie</a>, <a clss="elsie" id="link2">Elsie</a>]

經過標籤逐層查找

soup.select('body a')

找到某個 tag 標籤下的直接子標籤

soup.select('body > a')

經過類名查找

soup.select('.elsie')soup.select('[class~=elsie]')

經過 id 查找

soup.select('#link1')

使用多個選擇器

soup.select('#link1,#link2')

經過屬性查找

soup.select('a[class]')

經過屬性的值來查找

soup.select('a[class="elsie"]')

查找元素的第一個

soup.select_one('.elsie')

查找兄弟節點標籤

#查找全部soup.select('#link1 ~ .elsie')#查找第一個soup.select('#link1 + .elsie')

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。