BeautifulSoup：網頁解析利器上手簡介

時間 2019-11-06

原文原文鏈接

關於爬蟲的案例和方法，咱們已講過許多。不過在以往的文章中，大可能是關注在如何把網頁上的內容抓取下來。今天咱們來分享下，當你已經把內容爬下來以後，如何提取出其中你須要的具體信息。html

網頁被抓取下來，一般就是str 字符串類型的對象，要從裏面尋找信息，最直接的想法就是直接經過字符串的 find 方法和切片操做：前端

s = '<p>價格：15.7 元</p>'
start = s.find('價格：')
end = s.find(' 元')
print(s[start+3:end])  
# 15.7

這能應付一些極簡單的狀況，但只要稍稍複雜一點，這麼寫就會累死人。更通用的作法是使用正則表達式：html5

import re
s = '<p>價格：15.7 元</p>'
r = re.search('[\d.]+', s)
print(r.group())
# 15.7

正則表達式是處理文本解析的萬金油，什麼狀況均可以應對。但惋惜掌握它須要必定的學習成本，本來咱們有一個網頁提取的問題，用了正則表達式，如今咱們有了兩個問題。python

HTML 文檔自己是結構化的文本，有必定的規則，經過它的結構能夠簡化信息提取。因而，就有了lxml、pyquery、BeautifulSoup等網頁信息提取庫。通常咱們會用這些庫來提取網頁信息。其中，lxml 有很高的解析效率，支持 xPath 語法（一種能夠在 HTML 中查找信息的規則語法）；pyquery 得名於 jQuery（知名的前端 js 庫），能夠用相似 jQuery 的語法解析網頁。但咱們今天要說的，是剩下的這個：正則表達式

BeautifulSoup

BeautifulSoup（下文簡稱 bs）翻譯成中文就是「美麗的湯」，這個奇特的名字來源於《愛麗絲夢遊仙境》（這也是爲什麼在其官網會配上奇怪的插圖，以及用《愛麗絲》的片斷做爲測試文本）。編程

bs 最大的特色我以爲是簡單易用，不像正則和 xPath 須要刻意去記住不少特定語法，儘管那樣會效率更高更直接。對大多數 python 使用者來講，好用會比高效更重要。這也是我本身使用並推薦 bs 的主要緣由。工具

接下來介紹點 bs 的基本方法，讓你看完就能用起來。考慮到「只收藏不看黨」的閱讀體驗，先給出一個「嫌長不看版」的總結：性能

隨anaconda附帶，也能夠經過pip安裝
指定不一樣解析器在性能、容錯性上會有差別，致使結果也可能不同
基本使用流程：經過文本初始化 bs 對象->經過 find/find_all 或其餘方法檢測信息->輸出或保存
能夠迭代式的查找，好比先定位出一段內容，再其上繼續檢索
開發時應注意不一樣方法的返回類型，出錯時多看報錯、多加輸出信息
官方文檔很友好，也有中文，推薦閱讀

安裝

推薦使用pip進行安裝（關於 pip 見前文《Crossin：如何安裝 Python 的第三方模塊》）：學習

pip install beautifulsoup4

要注意，包名是beautifulsoup4，若是不加上 4，會是老版本也就是 bs3，它是爲了兼容性而存在，目前已不推薦。咱們這裏說 bs，都是指 bs4。測試

bs4 也能夠直接經過安裝 anaconda 得到（介紹見前文《Crossin：Python數據科學環境：Anaconda 瞭解一下》）。

bs 在使用時須要指定一個「解析器」：

html.parse- python 自帶，但容錯性不夠高，對於一些寫得不太規範的網頁會丟失部份內容
lxml- 解析速度快，需額外安裝
xml- 同屬 lxml 庫，支持 XML 文檔
html5lib- 最好的容錯性，但速度稍慢

這裏的 lxml 和 html5lib 都須要額外安裝，不過若是你用的是 anaconda，都是一併安裝好的。

快速上手

咱們就用官網上的文檔做例子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 bs 的初始化操做，是用文本建立一個 BeautifulSoup 對象，建議手動指定解析器：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

獲取其中的某個結構化元素及其屬性：

soup.title  # title 元素
# <title>The Dormouse's story</title>

soup.p  # 第一個 p 元素
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']  # p 元素的 class 屬性
# ['title']

soup.p.b  # p 元素下的 b 元素
# <b>The Dormouse's story</b>

soup.p.parent.name  # p 元素的父節點的標籤
# body

並非全部信息均可以簡單地經過結構化獲取，一般使用 find 和 find_all 方法進行查找：

soup.find_all('a')  # 全部 a 元素
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id='link3')  # id 爲 link3 的元素
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a

find 和 find_all 能夠有多個搜索條件疊加，好比find('a', id='link3', class_='sister')
find 返回的是一個bs4.element.Tag 對象，這個對象能夠進一步進行搜索。若是有多個知足的結果，find只返回第一個；若是沒有，返回 None。
find_all 返回的是一個由 bs4.element.Tag 對象組成的 list，無論找到幾個或是沒找到，都是 list。

輸出：

x = soup.find(class_='story')
x.get_text()  # 僅可見文本內容
# 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'
x.prettify()  # 元素完整內容
# '<p class="story">\n Once upon a time there were three little sisters; and their names were\n <a class="sister" href="http://example.com/elsie" id="link1">\n Elsie\n </a>\n ,\n <a class="sister" href="http://example.com/lacie" id="link2">\n Lacie\n </a>\n and\n <a class="sister" href="http://example.com/tillie" id="link3">\n Tillie\n </a>\n ;\nand they lived at the bottom of a well.\n</p>\n'

若是你有前端開發經驗，對 CSS 選擇器很熟悉，bs 也爲你提供了相應的方法：

soup.select('html head title')
# [<title>The Dormouse's story</title>]
soup.select('p > #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

以上就是 BeautifulSoup 的一個極簡上手介紹，對於 bs 能作什麼，想必你已有了一個初步認識。若是你要在開發中使用，建議再看下它的官方文檔。文檔寫得很清楚，也有中文版，你只要看了最初的一小部分，就能夠在代碼中派上用場了。更多的細節能夠在使用時進一步搜索具體方法和參數設置。

中文版文檔地址：

Beautiful Soup 4.2.0 文檔www.crummy.com

對於爬蟲的其餘方面，推薦閱讀咱們以前的相關文章：

════

其餘文章及回答：

學編程：如何自學Python | 新手引導 | 一圖學Python

開發案例：智能防擋彈幕 | 紅包提醒 | 流浪地球

歡迎搜索及關注：Crossin的編程教室