BeautifulSoup：網頁解析利器上手簡介

時間 2019-12-05

原文原文鏈接

關於爬蟲的案例和方法，咱們已講過許多。不過在以往的文章中，大可能是關注在 如何把網頁上的內容抓取下來 。今天咱們來分享下，當你已經把內容爬下來以後， 如何提取出其中你須要的具體信息 。html

網頁被抓取下來，一般就是 str 字符串類型的對象 ，要從裏面尋找信息，最直接的想法就是直接經過字符串的 find 方法 和 切片操做 ：前端

s = '<p>價格：15.7 元</p>'
start = s.find('價格：')
end = s.find(' 元')
print(s[start+3:end])  
# 15.7

這能應付一些極簡單的狀況，但只要稍稍複雜一點，這麼寫就會累死人。更通用的作法是使用 正則表達式 ：html5

import re
s = '<p>價格：15.7 元</p>'
r = re.search('[\d.]+', s)
print(r.group())
# 15.7

正則表達式是處理文本解析的萬金油，什麼狀況均可以應對。但惋惜掌握它須要必定的學習成本， 本來咱們有一個網頁提取的問題，用了正則表達式，如今咱們有了兩個問題。python

HTML 文檔自己是 結構化的文本 ，有必定的規則，經過它的結構能夠簡化信息提取。因而，就有了 lxml、pyquery、BeautifulSoup 等網頁信息提取庫。通常咱們會用這些庫來提取網頁信息。其中， lxml 有很高的解析效率，支持 xPath 語法 （一種能夠在 HTML 中查找信息的規則語法）； pyquery 得名於 jQuery（知名的前端 js 庫），能夠用相似 jQuery 的語法解析網頁 。但咱們今天要說的，是剩下的這個：正則表達式

BeautifulSoup

BeautifulSoup（下文簡稱 bs）翻譯成中文就是「美麗的湯」，這個奇特的名字來源於《 愛麗絲夢遊仙境 》（這也是爲什麼在其官網會配上奇怪的插圖，以及用《愛麗絲》的片斷做爲測試文本）。編程

bs 最大的特色我以爲是 簡單易用 ，不像正則和 xPath 須要刻意去記住不少特定語法，儘管那樣會效率更高更直接。 對大多數 python 使用者來講，好用會比高效更重要 。這也是我本身使用並推薦 bs 的主要緣由。工具

接下來介紹點 bs 的基本方法，讓你看完就能用起來。考慮到「只收藏不看黨」的閱讀體驗，先給出一個「 嫌長不看版 」的總結：性能

隨 anaconda 附帶，也能夠經過 pip 安裝
指定 不一樣解析器在性能、容錯性上會有差別 ，致使結果也可能不同
基本使用流程： 經過文本初始化 bs 對象 -> 經過 find/find_all 或其餘方法檢測信息 -> 輸出或保存
能夠迭代式的查找，好比先定位出一段內容，再其上繼續檢索
開發時應注意不一樣方法的返回類型，出錯時多看報錯、多加輸出信息
官方文檔 很友好，也有中文，推薦閱讀

安裝

推薦使用 pip 進行安裝（關於 pip 見前文《Crossin：如何安裝 Python 的第三方模塊》）：學習

pip install beautifulsoup4

要注意，包名是 beautifulsoup4 ，若是不加上 4，會是老版本也就是 bs3，它是爲了兼容性而存在，目前已不推薦。咱們這裏說 bs，都是指 bs4。測試

bs4 也能夠直接經過安裝 anaconda 得到（介紹見前文《Crossin：Python數據科學環境：Anaconda 瞭解一下》）。

bs 在使用時須要指定一個「 解析器 」：

html.parse - python 自帶，但容錯性不夠高，對於一些寫得不太規範的網頁會丟失部份內容
lxml - 解析速度快，需額外安裝
xml - 同屬 lxml 庫，支持 XML 文檔
html5lib - 最好的容錯性，但速度稍慢

這裏的 lxml 和 html5lib 都須要額外安裝，不過若是你用的是 anaconda，都是一併安裝好的。

快速上手

咱們就用官網上的文檔做例子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 bs 的初始化操做，是用文本建立一個 BeautifulSoup 對象，建議手動指定解析器：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

獲取其中的某個結構化元素及其屬性：

soup.title  # title 元素
# <title>The Dormouse's story</title>

soup.p  # 第一個 p 元素
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']  # p 元素的 class 屬性
# ['title']

soup.p.b  # p 元素下的 b 元素
# <b>The Dormouse's story</b>

soup.p.parent.name  # p 元素的父節點的標籤
# body

並非全部信息均可以簡單地經過結構化獲取，一般使用 find 和 find_all 方法進行查找：

soup.find_all('a')  # 全部 a 元素
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id='link3')  # id 爲 link3 的元素
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a

find 和 find_all 能夠有多個搜索條件疊加，好比 find('a', id='link3', class_='sister')
find 返回的是一個 bs4.element.Tag 對象 ，這個對象能夠進一步進行搜索。若是有多個知足的結果，find 只返回第一個 ；若是沒有，返回 None。
find_all 返回的是一個 由 bs4.element.Tag 對象組成的 list ，無論找到幾個或是沒找到，都是 list。

輸出：

x = soup.find(class_='story')
x.get_text()  # 僅可見文本內容
# 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'
x.prettify()  # 元素完整內容
# '<p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n  Elsie\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n  Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n  Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n</p>\n'

若是你有前端開發經驗，對 CSS 選擇器很熟悉，bs 也爲你提供了相應的方法：

soup.select('html head title')
# [<title>The Dormouse's story</title>]
soup.select('p > #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

以上就是 BeautifulSoup 的一個極簡上手介紹，對於 bs 能作什麼，想必你已有了一個初步認識。若是你要在開發中使用，建議再看下它的 官方文檔 。文檔寫得很清楚，也有中文版，你只要看了最初的一小部分，就能夠在代碼中派上用場了。更多的細節能夠在使用時進一步搜索具體方法和參數設置。

中文版文檔 地址：

Beautiful Soup 4.2.0 文檔www.crummy.com

對於爬蟲的其餘方面，推薦閱讀咱們以前的相關文章：

════

其餘文章及回答：

學編程：如何自學Python | 新手引導 | 一圖學Python

開發案例：智能防擋彈幕 | 紅包提醒 | 流浪地球

歡迎搜索及關注： Crossin的編程教室