【python】BeautifulSoup的應用

時間 2019-12-12

原文原文鏈接

 1 from bs4 import BeautifulSoup
#下面的一段HTML代碼將做爲例子被屢次用到.這是 愛麗絲夢遊仙境的 的一段內容(之後內容中簡稱爲 愛麗絲 的文檔):
 2 html_doc = """
 3 <html><head><title>The Dormouse's story</title></head>
 4 <body>
 5 <p class="title"><b>The Dormouse's story</b></p>
 6 
 7 <p class="story">Once upon a time there were three little sisters; and their names were
 8 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 9 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
10 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
11 and they lived at the bottom of a well.</p>
12 
13 <p class="story">...</p>
14 """
#使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出:
15 soup = BeautifulSoup(html_doc, 'html.parser')#html.parser爲解析html_doc這段源碼的編譯器
16 #下面是從代碼中提取具體想要內容的語句
17 #print(soup.prettify())
18 print(soup.title)
19 print(soup.title.string)
20 print(soup.find_all('a'))
21 print(soup.a.string)
22 print(soup.find(id='link2'))
23 print(soup.find_all('p'))
24 print(soup.get_text())
25 for c in soup.find_all('a'):
26     print(c.get('href'))

幾個簡單的瀏覽結構化數據的方法:css

 
  soup.title # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p # <p class="title"><b>The Dormouse's story</b></p> soup.p['class'] # u'title' soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>  
 

從文檔中找到全部<a>標籤的連接:html

 
  for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie  
 

從文檔中獲取全部文字內容:html5

 
  print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ...  
 

這是你想要的嗎?彆着急,還有更好用的python

安裝 Beautiful Soup

若是你用的是新版的Debain或ubuntu,那麼能夠經過系統的軟件包管理來安裝:linux

$ apt-get install Python-bs4程序員

Beautiful Soup 4 經過PyPi發佈,因此若是你沒法使用系統包管理安裝,那麼也能夠經過 easy_install 或 pip 來安裝.包的名字是 beautifulsoup4 ,這個包兼容Python2和Python3.正則表達式

$ easy_install beautifulsoup4express

$ pip install beautifulsoup4ubuntu

(在PyPi中還有一個名字是 BeautifulSoup 的包,但那可能不是你想要的,那是 Beautiful Soup3 的發佈版本,由於不少項目還在使用BS3, 因此 BeautifulSoup 包依然有效.可是若是你在編寫新項目,那麼你應該安裝的 beautifulsoup4 )windows

若是你沒有安裝 easy_install 或 pip ,那你也能夠下載BS4的源碼 ,而後經過setup.py來安裝.

$ Python setup.py install

若是上述安裝方法都行不通,Beautiful Soup的發佈協議容許你將BS4的代碼打包在你的項目中,這樣無須安裝便可使用.

做者在Python2.7和Python3.2的版本下開發Beautiful Soup, 理論上Beautiful Soup應該在全部當前的Python版本中正常工做

安裝完成後的問題

Beautiful Soup發佈時打包成Python2版本的代碼,在Python3環境下安裝時,會自動轉換成Python3的代碼,若是沒有一個安裝的過程,那麼代碼就不會被轉換.

若是代碼拋出了 ImportError 的異常: 「No module named HTMLParser」, 這是由於你在Python3版本中執行Python2版本的代碼.

若是代碼拋出了 ImportError 的異常: 「No module named html.parser」, 這是由於你在Python2版本中執行Python3版本的代碼.

若是遇到上述2種狀況,最好的解決方法是從新安裝BeautifulSoup4.

若是在ROOT_TAG_NAME = u’[document]’代碼處遇到 SyntaxError 「Invalid syntax」錯誤,須要將把BS4的Python代碼版本從Python2轉換到Python3. 能夠從新安裝BS4:

$ Python3 setup.py install

或在bs4的目錄中執行Python代碼版本轉換腳本

$ 2to3-3.2 -w bs4

安裝解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,其中一個是 lxml .根據操做系統不一樣,能夠選擇下列方法來安裝lxml:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,能夠選擇下列方法來安裝html5lib:

$ apt-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

下表列出了主要的解析器,以及它們的優缺點:

解析器	使用方法	優點	劣勢
Python標準庫	`BeautifulSoup(markup, "html.parser")`	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])` `BeautifulSoup(markup, "xml")`	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

推薦使用lxml做爲解析器,由於效率更高. 在Python2.7.3以前的版本和Python3中3.2.2以前的版本,必須安裝lxml或html5lib, 由於那些Python版本的標準庫中內置的HTML解析方法不夠穩定.

提示: 若是一段HTML或XML文檔格式不正確的話,那麼在不一樣的解析器中返回的結果多是不同的,查看解析器之間的區別瞭解更多細節

如何使用

將一段文檔傳入BeautifulSoup 的構造方法,就能獲得一個文檔的對象, 能夠傳入一段字符串或一個文件句柄.

 
   from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")  
  

首先,文檔被轉換成Unicode,而且HTML的實例都被轉換成Unicode編碼

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

而後,Beautiful Soup選擇最合適的解析器來解析這段文檔,若是手動指定解析器那麼Beautiful Soup會選擇指定的解析器來解析文檔.(參考解析成XML ).

對象的種類

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種: Tag , NavigableString , BeautifulSoup , Comment .

Tag

Tag 對象與XML或HTML原生文檔中的tag相同:

 
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b type(tag) # <class 'bs4.element.Tag'>  
   

Tag有不少方法和屬性,在遍歷文檔樹和搜索文檔樹中有詳細解釋.如今介紹一下tag中最重要的屬性: name和attributes

Name

每一個tag都有本身的名字,經過 .name 來獲取:

 
     tag.name # u'b'  
    

若是改變了tag的name,那將影響全部經過當前Beautiful Soup對象生成的HTML文檔:

 
     tag.name = "blockquote" tag # <blockquote class="boldest">Extremely bold</blockquote>  
    

Attributes

一個tag可能有不少個屬性. tag  有一個「class」的屬性,值爲「boldest」 . tag的屬性的操做方法與字典相同:

 
     tag['class'] # u'boldest'  
    

也能夠直接」點」取屬性, 好比: .attrs :

 
     tag.attrs # {u'class': u'boldest'}  
    

tag的屬性能夠被添加,刪除或修改. 再說一次, tag的屬性操做方法與字典同樣

 
     tag['class'] = 'verybold' tag['id'] = 1 tag # <blockquote class="verybold" id="1">Extremely bold</blockquote> del tag['class'] del tag['id'] tag # <blockquote>Extremely bold</blockquote> tag['class'] # KeyError: 'class' print(tag.get('class')) # None  
    

多值屬性

HTML 4定義了一系列能夠包含多個值的屬性.在HTML5中移除了一些,卻增長更多.最多見的多值的屬性是 class (一個tag能夠有多個CSS的class). 還有一些屬性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值屬性的返回類型是list:

 
      css_soup = BeautifulSoup('<p class="body strikeout"></p>') css_soup.p['class'] # ["body", "strikeout"] css_soup = BeautifulSoup('<p class="body"></p>') css_soup.p['class'] # ["body"]  
     

若是某個屬性看起來好像有多個值,但在任何版本的HTML定義中都沒有被定義爲多值屬性,那麼Beautiful Soup會將這個屬性做爲字符串返回

 
      id_soup = BeautifulSoup('<p id="my id"></p>') id_soup.p['id'] # 'my id'  
     

將tag轉換成字符串時,多值屬性會合併爲一個值

 
      rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p) # <p>Back to the <a rel="index contents">homepage</a></p>  
     

若是轉換的文檔是XML格式,那麼tag中不包含多值屬性

 
      xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') xml_soup.p['class'] # u'body strikeout'  
     

能夠遍歷的字符串

字符串常被包含在tag內.Beautiful Soup用 NavigableString 類來包裝tag中的字符串:

 
    tag.string # u'Extremely bold' type(tag.string) # <class 'bs4.element.NavigableString'>  
   

一個 NavigableString 字符串與Python中的Unicode字符串相同,而且還支持包含在遍歷文檔樹和搜索文檔樹中的一些特性. 經過 unicode() 方法能夠直接將 NavigableString 對象轉換成Unicode字符串:

 
    unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # <type 'unicode'>  
   

tag中包含的字符串不能編輯,可是能夠被替換成其它的字符串,用 replace_with() 方法:

 
    tag.string.replace_with("No longer bold") tag # <blockquote>No longer bold</blockquote>  
   

NavigableString 對象支持遍歷文檔樹和搜索文檔樹中定義的大部分屬性, 並不是所有.尤爲是,一個字符串不能包含其它內容(tag可以包含字符串或是其它tag),字符串不支持 .contents 或 .string 屬性或 find() 方法.

若是想在Beautiful Soup以外使用 NavigableString 對象,須要調用 unicode() 方法,將該對象轉換成普通的Unicode字符串,不然就算Beautiful Soup已方法已經執行結束,該對象的輸出也會帶有對象的引用地址.這樣會浪費內存.

BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的所有內容.大部分時候,能夠把它看成 Tag 對象,它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法.

由於 BeautifulSoup 對象並非真正的HTML或XML的tag,因此它沒有name和attribute屬性.但有時查看它的 .name 屬性是很方便的,因此 BeautifulSoup 對象包含了一個值爲「[document]」的特殊屬性 .name

 
    soup.name # u'[document]'  
   

註釋及特殊字符串

Tag , NavigableString , BeautifulSoup 幾乎覆蓋了html和xml中的全部內容,可是還有一些特殊對象.容易讓人擔憂的內容是文檔的註釋部分:

 
    markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # <class 'bs4.element.Comment'>  
   

Comment 對象是一個特殊類型的 NavigableString 對象:

 
    comment
# u'Hey, buddy. Want to buy a used parser'

可是當它出如今HTML文檔中時, Comment 對象會使用特殊的格式輸出:

 
    print(soup.b.prettify()) # <b> # <!--Hey, buddy. Want to buy a used parser?--> # </b>  
   

Beautiful Soup中定義的其它類型均可能會出如今XML的文檔中: CData , ProcessingInstruction , Declaration , Doctype .與 Comment 對象相似,這些類都是 NavigableString 的子類,只是添加了一些額外的方法的字符串獨享.下面是用CDATA來替代註釋的例子:

 
    from bs4 import CData cdata = CData("A CDATA block") comment.replace_with(cdata) print(soup.b.prettify()) # <b> # <![CDATA[A CDATA block]]> # </b>  
   

遍歷文檔樹

還拿」愛麗絲夢遊仙境」的文檔來作例子:

 
   html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc)  
  

經過這段例子來演示怎樣從文檔的一段內容找到另外一段內容

子節點

一個Tag可能包含多個字符串或其它的Tag,這些都是這個Tag的子節點.Beautiful Soup提供了許多操做和遍歷子節點的屬性.

注意: Beautiful Soup中字符串節點不支持這些屬性,由於字符串沒有子節點

tag的名字

操做文檔樹最簡單的方法就是告訴它你想獲取的tag的name.若是想獲取 <head> 標籤,只要用 soup.head :

 
     soup.head # <head><title>The Dormouse's story</title></head> soup.title # <title>The Dormouse's story</title>  
    

這是個獲取tag的小竅門,能夠在文檔樹的tag中屢次調用這個方法.下面的代碼能夠獲取<body>標籤中的第一個標籤:

 
     soup.body.b # <b>The Dormouse's story</b>  
    

經過點取屬性的方式只能得到當前名字的第一個tag:

 
     soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>  
    

若是想要獲得全部的<a>標籤,或是經過名字獲得比一個tag更多的內容的時候,就須要用到 Searching the tree 中描述的方法,好比: find_all()

 
     soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

.contents 和 .children

tag的 .contents 屬性能夠將tag的子節點以列表的方式輸出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

BeautifulSoup 對象自己必定會包含子節點,也就是說<html>標籤也是 BeautifulSoup 對象的子節點:

 
     len(soup.contents) # 1 soup.contents[0].name # u'html'  
    

字符串沒有 .contents 屬性,由於字符串沒有子節點:

 
     text = title_tag.contents[0] text.contents # AttributeError: 'NavigableString' object has no attribute 'contents'  
    

經過tag的 .children 生成器,能夠對tag的子節點進行循環:

 
     for child in title_tag.children: print(child) # The Dormouse's story  
    

.descendants

.contents 和 .children 屬性僅包含tag的直接子節點.例如,<head>標籤只有一個直接子節點<title>

 
     head_tag.contents # [<title>The Dormouse's story</title>]  
    

可是<title>標籤也包含一個子節點:字符串「The Dormouse’s story」,這種狀況下字符串「The Dormouse’s story」也屬於<head>標籤的子孫節點. .descendants 屬性能夠對全部tag的子孫節點進行遞歸循環 [5] :

 
     for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's story  
    

上面的例子中, <head>標籤只有一個子節點,可是有2個子孫節點:<head>節點和<head>的子節點, BeautifulSoup 有一個直接子節點(<html>節點),卻有不少子孫節點:

 
     len(list(soup.children)) # 1 len(list(soup.descendants)) # 25  
    

.string

若是tag只有一個 NavigableString 類型子節點,那麼這個tag可使用 .string 獲得子節點:

 
     title_tag.string # u'The Dormouse's story'  
    

若是一個tag僅有一個子節點,那麼這個tag也可使用 .string 方法,輸出結果與當前惟一子節點的 .string 結果相同:

 
     head_tag.contents # [<title>The Dormouse's story</title>] head_tag.string # u'The Dormouse's story'  
    

若是tag包含了多個子節點,tag就沒法肯定 .string 方法應該調用哪一個子節點的內容, .string 的輸出結果是 None :

 
     print(soup.html.string) # None  
    

.strings 和 stripped_strings

若是tag中包含多個字符串 [2] ,可使用 .strings 來循環獲取:

 
     for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' # u'\n'  
    

輸出的字符串中可能包含了不少空格或空行,使用 .stripped_strings 能夠去除多餘空白內容:

 
     for string in soup.stripped_strings: print(repr(string)) # u"The Dormouse's story" # u"The Dormouse's story" # u'Once upon a time there were three little sisters; and their names were' # u'Elsie' # u',' # u'Lacie' # u'and' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'...'  
    

所有是空格的行會被忽略掉,段首和段末的空白會被刪除

父節點

繼續分析文檔樹,每一個tag或字符串都有父節點:被包含在某個tag中

.parent

經過 .parent 屬性來獲取某個元素的父節點.在例子「愛麗絲」的文檔中,<head>標籤是<title>標籤的父節點:

 
     title_tag = soup.title title_tag # <title>The Dormouse's story</title> title_tag.parent # <head><title>The Dormouse's story</title></head>  
    

文檔title的字符串也有父節點:<title>標籤

 
     title_tag.string.parent # <title>The Dormouse's story</title>  
    

文檔的頂層節點好比<html>的父節點是 BeautifulSoup 對象:

 
     html_tag = soup.html type(html_tag.parent) # <class 'bs4.BeautifulSoup'>  
    

BeautifulSoup 對象的 .parent 是None:

 
     print(soup.parent) # None  
    

.parents

經過元素的 .parents 屬性能夠遞歸獲得元素的全部父輩節點,下面的例子使用了 .parents 方法遍歷了<a>標籤到根節點的全部節點.

 
     link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> for parent in link.parents: if parent is None: print(parent) else: print(parent.name) # p # body # html # [document] # None  
    

兄弟節點

看一段簡單的例子:

 
    sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") print(sibling_soup.prettify()) # <html> # <body> # <a> # <b> # text1 # </b> # <c> # text2 # </c> # </a> # </body> # </html>  
   

由於標籤和<c>標籤是同一層:他們是同一個元素的子節點,因此和<c>能夠被稱爲兄弟節點.一段文檔以標準格式輸出時,兄弟節點有相同的縮進級別.在代碼中也可使用這種關係.

.next_sibling 和 .previous_sibling

在文檔樹中,使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節點:

 
     sibling_soup.b.next_sibling # <c>text2</c> sibling_soup.c.previous_sibling # <b>text1</b>  
    

標籤有 .next_sibling 屬性,可是沒有 .previous_sibling 屬性,由於標籤在同級節點中是第一個.同理,<c>標籤有 .previous_sibling 屬性,卻沒有 .next_sibling 屬性:

 
     print(sibling_soup.b.previous_sibling) # None print(sibling_soup.c.next_sibling) # None  
    

例子中的字符串「text1」和「text2」不是兄弟節點,由於它們的父節點不一樣:

 
     sibling_soup.b.string # u'text1' print(sibling_soup.b.string.next_sibling) # None  
    

實際文檔中的tag的 .next_sibling 和 .previous_sibling 屬性一般是字符串或空白. 看看「愛麗絲」文檔:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

若是覺得第一個<a>標籤的 .next_sibling 結果是第二個<a>標籤,那就錯了,真實結果是第一個<a>標籤和第二個<a>標籤之間的頓號和換行符:

 
     link = soup.a link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> link.next_sibling # u',\n'  
    

第二個<a>標籤是頓號的 .next_sibling 屬性:

 
     link.next_sibling.next_sibling # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>  
    

.next_siblings 和 .previous_siblings

經過 .next_siblings 和 .previous_siblings 屬性能夠對當前節點的兄弟節點迭代輸出:

 
     for sibling in soup.a.next_siblings: print(repr(sibling)) # u',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # u'; and they lived at the bottom of a well.' # None for sibling in soup.find(id="link3").previous_siblings: print(repr(sibling)) # ' and\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # u',\n' # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> # u'Once upon a time there were three little sisters; and their names were\n' # None  
    

回退和前進

看一下「愛麗絲」文檔:

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

HTML解析器把這段字符串轉換成一連串的事件: 「打開<html>標籤」,」打開一個<head>標籤」,」打開一個<title>標籤」,」添加一段字符串」,」關閉<title>標籤」,」打開標籤」,等等.Beautiful Soup提供了重現解析器初始化過程的方法.

.next_element 和 .previous_element

.next_element 屬性指向解析過程當中下一個被解析的對象(字符串或tag),結果可能與 .next_sibling 相同,但一般是不同的.

這是「愛麗絲」文檔中最後一個<a>標籤,它的 .next_sibling 結果是一個字符串,由於當前的解析過程 [2] 由於當前的解析過程由於遇到了<a>標籤而中斷了:

 
     last_a_tag = soup.find("a", id="link3") last_a_tag # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_a_tag.next_sibling # '; and they lived at the bottom of a well.'  
    

但這個<a>標籤的 .next_element 屬性結果是在<a>標籤被解析以後的解析內容,不是<a>標籤後的句子部分,應該是字符串」Tillie」:

 
     last_a_tag.next_element # u'Tillie'  
    

這是由於在原始文檔中,字符串「Tillie」在分號前出現,解析器先進入<a>標籤,而後是字符串「Tillie」,而後關閉</a>標籤,而後是分號和剩餘部分.分號與<a>標籤在同一層級,可是字符串「Tillie」會被先解析.

.previous_element 屬性恰好與 .next_element 相反,它指向當前被解析的對象的前一個解析對象:

 
     last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>  
    

.next_elements 和 .previous_elements

經過 .next_elements 和 .previous_elements 的迭代器就能夠向前或向後訪問文檔的解析內容,就好像文檔正在被解析同樣:

 
     for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # <p class="story">...</p> # u'...' # u'\n' # None  
    

搜索文檔樹

Beautiful Soup定義了不少搜索方法,這裏着重介紹2個: find() 和 find_all() .其它方法的參數和用法相似,請讀者觸類旁通.

再以「愛麗絲」文檔做爲例子:

 
   html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc)  
  

使用 find_all() 相似的方法能夠查找到想要查找的文檔內容

過濾器

介紹 find_all() 方法前,先介紹一下過濾器的類型 [3] ,這些過濾器貫穿整個搜索的API.過濾器能夠被用在tag的name中,節點的屬性中,字符串中或他們的混合中.

字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中全部的標籤:

 
     soup.find_all('b') # [<b>The Dormouse's story</b>]  
    

若是傳入字節碼參數,Beautiful Soup會看成UTF-8編碼,能夠傳入一段Unicode 編碼來避免Beautiful Soup解析編碼出錯

正則表達式

若是傳入正則表達式做爲參數,Beautiful Soup會經過正則表達式的 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和標籤都應該被找到:

 
     import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b  
    

下面代碼找出全部名字中包含」t」的標籤:

 
     for tag in soup.find_all(re.compile("t")): print(tag.name) # html # title  
    

列表

若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和標籤:

 
     soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

True

True 能夠匹配任何值,下面代碼查找到全部的tag,可是不會返回字符串節點

 
     for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p  
    

方法

若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數 [4] ,若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False

下面方法校驗了當前元素,若是包含 class 屬性卻不包含 id 屬性,那麼將返回 True:

 
     def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id')  
    

將這個方法做爲參數傳入 find_all() 方法,將獲得全部標籤:

 
     soup.find_all(has_class_but_no_id) # [<p class="title"><b>The Dormouse's story</b></p>, # <p class="story">Once upon a time there were...</p>, # <p class="story">...</p>]  
    

返回結果中只有標籤沒有<a>標籤,由於<a>標籤還定義了」id」,沒有返回<html>和<head>,由於<html>和<head>中沒有定義」class」屬性.

下面代碼找到全部被文字包含的節點內容:

 
     from bs4 import NavigableString def surrounded_by_strings(tag): return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString)) for tag in soup.find_all(surrounded_by_strings): print tag.name # p # a # a # a # p  
    

如今來了解一下搜索方法的細節

find_all()

find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件.這裏有幾個例子:

 
    soup.find_all("title") # [<title>The Dormouse's story</title>] soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>] soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] import re soup.find(text=re.compile("sisters")) # u'Once upon a time there were three little sisters; and their names were\n'  
   

有幾個方法很類似,還有幾個方法是新的,參數中的 text 和 id 是什麼含義? 爲何 find_all("p", "title") 返回的是CSS Class爲」title」的標籤? 咱們來仔細看一下 find_all() 的參數

name 參數

name 參數能夠查找全部名字爲 name 的tag,字符串對象會被自動忽略掉.

簡單的用法以下:

 
     soup.find_all("title") # [<title>The Dormouse's story</title>]  
    

重申: 搜索 name 參數的值可使任一類型的過濾器 ,字符竄,正則表達式,列表,方法或是 True .

keyword 參數

若是一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數看成指定名字tag的屬性來搜索,若是包含一個名字爲 id 的參數,Beautiful Soup會搜索每一個tag的」id」屬性.

 
     soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
    

若是傳入 href 參數,Beautiful Soup會搜索每一個tag的」href」屬性:

 
     soup.find_all(href=re.compile("elsie")) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  
    

搜索指定名字的屬性時可使用的參數值包括字符串 , 正則表達式 , 列表, True .

下面的例子在文檔樹中查找全部包含 id 屬性的tag,不管 id 的值是什麼:

 
     soup.find_all(id=True) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

使用多個指定名字的參數能夠同時過濾tag的多個屬性:

 
     soup.find_all(href=re.compile("elsie"), id='link1') # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]  
    

有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性:

 
     data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(data-foo="value") # SyntaxError: keyword can't be an expression  
    

可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:

 
     data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]  
    

按CSS搜索

按照CSS類名搜索tag的功能很是實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 作參數會致使語法錯誤.從Beautiful Soup的4.1.1版本開始,能夠經過 class_ 參數搜索有指定CSS類名的tag:

 
     soup.find_all("a", class_="sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

class_ 參數一樣接受不一樣類型的 過濾器 ,字符串,正則表達式,方法或 True :

 
     soup.find_all(class_=re.compile("itl")) # [<p class="title"><b>The Dormouse's story</b></p>] def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

tag的 class 屬性是多值屬性 .按照CSS類名搜索tag時,能夠分別搜索tag中的每一個CSS類名:

 
     css_soup = BeautifulSoup('<p class="body strikeout"></p>') css_soup.find_all("p", class_="strikeout") # [<p class="body strikeout"></p>] css_soup.find_all("p", class_="body") # [<p class="body strikeout"></p>]  
    

搜索 class 屬性時也能夠經過CSS值徹底匹配:

 
     css_soup.find_all("p", class_="body strikeout") # [<p class="body strikeout"></p>]  
    

徹底匹配 class 的值時,若是CSS類名的順序與實際不符,將搜索不到結果:

 
     soup.find_all("a", attrs={"class": "sister"}) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
    

`text` 參數

經過 text 參數能夠搜搜文檔中的字符串內容.與 name 參數的可選值同樣, text 參數接受字符串 , 正則表達式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
    ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

雖然 text 參數用於搜索字符串,還能夠與其它參數混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數值相符的tag.下面代碼用來搜索內容裏面包含「Elsie」的<a>標籤:

 
     soup.find_all("a", text="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]  
    

`limit` 參數

find_all() 方法返回所有的搜索結構,若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果.

文檔樹中有3個tag符合搜索條件,但結果只返回了2個,由於咱們限制了返回數量:

 
     soup.find_all("a", limit=2) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
    

`recursive` 參數

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的全部子孫節點,若是隻想搜索tag的直接子節點,可使用參數 recursive=False .

一段簡單的文檔:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

是否使用 recursive 參數的搜索結果:

 
     soup.html.find_all("title") # [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False) # []  
    

像調用 `find_all()` 同樣調用tag

find_all() 幾乎是Beautiful Soup中最經常使用的搜索方法,因此咱們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象能夠被看成一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:

 
    soup.find_all("a") soup("a")  
   

這兩行代碼也是等價的:

 
    soup.title.find_all(text=True) soup.title(text=True)  
   

find()

find( name , attrs , recursive , text , **kwargs )

find_all() 方法將返回文檔中符合條件的全部tag,儘管有時候咱們只想獲得一個結果.好比文檔中只有一個<body>標籤,那麼使用 find_all() 方法來查找<body>標籤就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法.下面兩行代碼是等價的:

 
    soup.find_all('title', limit=1) # [<title>The Dormouse's story</title>] soup.find('title') # <title>The Dormouse's story</title>  
   

惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.

find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .

 
    print(soup.find("nosuchtag")) # None  
   

soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是屢次調用當前tag的 find() 方法:

 
    soup.head.title # <title>The Dormouse's story</title> soup.find("head").find("title") # <title>The Dormouse's story</title>  
   

find_parents() 和 find_parent()

find_parents( name , attrs , recursive , text , **kwargs )

find_parent( name , attrs , recursive , text , **kwargs )

咱們已經用了很大篇幅來介紹 find_all() 和 find() 方法,Beautiful Soup中還有10個用於搜索的API.它們中的五個用的是與 find_all() 相同的搜索參數,另外5個與 find() 方法的搜索參數相似.區別僅是它們搜索文檔的不一樣部分.

記住: find_all() 和 find() 只搜索當前節點的全部子節點,孫子節點等. find_parents() 和 find_parent() 用來搜索當前節點的父輩節點,搜索方法與普通tag的搜索方法相同,搜索文檔搜索文檔包含的內容. 咱們從一個文檔中的一個葉子節點開始:

a_string = soup.find(text="Lacie")
a_string
# u'Lacie'

a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

a_string.find_parents("p", class="title")
# []

文檔中的一個<a>標籤是是當前葉子節點的直接父節點,因此能夠被找到.還有一個標籤,是目標葉子節點的間接父輩節點,因此也能夠被找到.包含class值爲」title」的標籤不是否是目標葉子節點的父輩節點,因此經過 find_parents() 方法搜索不到.

find_parent() 和 find_parents() 方法會讓人聯想到 .parent 和 .parents 屬性.它們之間的聯繫很是緊密.搜索父輩節點的方法實際上就是對 .parents 屬性的迭代搜索.

find_next_siblings() 合 find_next_sibling()

find_next_siblings( name , attrs , recursive , text , **kwargs )

find_next_sibling( name , attrs , recursive , text , **kwargs )

這2個方法經過 .next_siblings 屬性對當tag的全部後面解析 [5] 的兄弟tag節點進行迭代, find_next_siblings() 方法返回全部符合條件的後面的兄弟節點, find_next_sibling() 只返回符合條件的後面的第一個tag節點.

 
    first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> first_link.find_next_siblings("a") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_next_sibling("p") # <p class="story">...</p>  
   

find_previous_siblings() 和 find_previous_sibling()

find_previous_siblings( name , attrs , recursive , text , **kwargs )

find_previous_sibling( name , attrs , recursive , text , **kwargs )

這2個方法經過 .previous_siblings 屬性對當前tag的前面解析 [5] 的兄弟tag節點進行迭代, find_previous_siblings() 方法返回全部符合條件的前面的兄弟節點, find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點:

 
    last_link = soup.find("a", id="link3") last_link # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> last_link.find_previous_siblings("a") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_previous_sibling("p") # <p class="title"><b>The Dormouse's story</b></p>  
   

find_all_next() 和 find_next()

find_all_next( name , attrs , recursive , text , **kwargs )

find_next( name , attrs , recursive , text , **kwargs )

這2個方法經過 .next_elements 屬性對當前tag的以後的 [5] tag和字符串進行迭代, find_all_next() 方法返回全部符合條件的節點, find_next() 方法返回第一個符合條件的節點:

 
    first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> first_link.find_all_next(text=True) # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] first_link.find_next("p") # <p class="story">...</p>  
   

第一個例子中,字符串「Elsie」也被顯示出來,儘管它被包含在咱們開始查找的<a>標籤的裏面.第二個例子中,最後一個標籤也被顯示出來,儘管它與咱們開始查找位置的<a>標籤不屬於同一部分.例子中,搜索的重點是要匹配過濾器的條件,而且在文檔中出現的順序而不是開始查找的元素的位置.

find_all_previous() 和 find_previous()

find_all_previous( name , attrs , recursive , text , **kwargs )

find_previous( name , attrs , recursive , text , **kwargs )

這2個方法經過 .previous_elements 屬性對當前節點前面 [5] 的tag和字符串進行迭代, find_all_previous() 方法返回全部符合條件的節點, find_previous() 方法返回第一個符合條件的節點.

 
    first_link = soup.a first_link # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> first_link.find_all_previous("p") # [<p class="story">Once upon a time there were three little sisters; ...</p>, # <p class="title"><b>The Dormouse's story</b></p>] first_link.find_previous("title") # <title>The Dormouse's story</title>  
   

find_all_previous("p") 返回了文檔中的第一段(class=」title」的那段),但還返回了第二段,標籤包含了咱們開始查找的<a>標籤.不要驚訝,這段代碼的功能是查找全部出如今指定<a>標籤以前的標籤,由於這個標籤包含了開始的<a>標籤,因此標籤必定是在<a>以前出現的.

CSS選擇器

Beautiful Soup支持大部分的CSS選擇器 [6] ,在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數,便可使用CSS選擇器的語法找到tag:

 
    soup.select("title") # [<title>The Dormouse's story</title>] soup.select("p nth-of-type(3)") # [<p class="story">...</p>]  
   

經過tag標籤逐層查找:

 
    soup.select("body a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("html head title") # [<title>The Dormouse's story</title>]  
   

找到某個tag標籤下的直接子標籤 [6] :

 
    soup.select("head > title") # [<title>The Dormouse's story</title>] soup.select("p > a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("p > a:nth-of-type(2)") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] soup.select("p > #link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("body > a") # []  
   

找到兄弟節點標籤:

 
    soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("#link1 + .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
   

經過CSS的類名查找:

 
    soup.select(".sister") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select("[class~=sister]") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
   

經過tag的id查找:

 
    soup.select("#link1") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select("a#link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
   

經過是否存在某個屬性來查找:

 
    soup.select('a[href]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
   

經過屬性的值來查找:

 
    soup.select('a[href="http://example.com/elsie"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] soup.select('a[href^="http://example.com/"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href$="tillie"]') # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.select('a[href*=".com/el"]') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]  
   

經過語言設置來查找:

 
    multilingual_markup = """  <p lang="en">Hello</p>  <p lang="en-us">Howdy, y'all</p>  <p lang="en-gb">Pip-pip, old fruit</p>  <p lang="fr">Bonjour mes amis</p> """ multilingual_soup = BeautifulSoup(multilingual_markup) multilingual_soup.select('p[lang|=en]') # [<p lang="en">Hello</p>, # <p lang="en-us">Howdy, y'all</p>, # <p lang="en-gb">Pip-pip, old fruit</p>]  
   

對於熟悉CSS選擇器語法的人來講這是個很是方便的方法.Beautiful Soup也支持CSS選擇器API,若是你僅僅須要CSS選擇器的功能,那麼直接使用 lxml 也能夠,並且速度更快,支持更多的CSS選擇器語法,但Beautiful Soup整合了CSS選擇器的語法和自身方便使用API.

修改文檔樹

Beautiful Soup的強項是文檔樹的搜索,但同時也能夠方便的修改文檔樹

修改tag的名稱和屬性

在 Attributes 的章節中已經介紹過這個功能,可是再看一遍也無妨. 重命名一個tag,改變屬性的值,添加或刪除屬性:

 
    soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') tag = soup.b tag.name = "blockquote" tag['class'] = 'verybold' tag['id'] = 1 tag # <blockquote class="verybold" id="1">Extremely bold</blockquote> del tag['class'] del tag['id'] tag # <blockquote>Extremely bold</blockquote>  
   

修改 .string

給tag的 .string 屬性賦值,就至關於用當前的內容替代了原來的內容:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) tag = soup.a tag.string = "New link text." tag # <a href="http://example.com/">New link text.</a>  
   

注意: 若是當前的tag包含了其它tag,那麼給它的 .string 屬性賦值會覆蓋掉原有的全部內容包括子tag

append()

Tag.append() 方法想tag中添加內容,就好像Python的列表的 .append() 方法:

 
    soup = BeautifulSoup("<a>Foo</a>") soup.a.append("Bar") soup # <html><head></head><body><a>FooBar</a></body></html> soup.a.contents # [u'Foo', u'Bar']  
   

BeautifulSoup.new_string() 和 .new_tag()

若是想添加一段文本內容到文檔中也沒問題,能夠調用Python的 append() 方法或調用工廠方法 BeautifulSoup.new_string() :

 
    soup = BeautifulSoup("<b></b>") tag = soup.b tag.append("Hello") new_string = soup.new_string(" there") tag.append(new_string) tag # <b>Hello there.</b> tag.contents # [u'Hello', u' there']  
   

若是想要建立一段註釋,或 NavigableString 的任何子類,將子類做爲 new_string() 方法的第二個參數傳入:

 
    from bs4 import Comment new_comment = soup.new_string("Nice to see you.", Comment) tag.append(new_comment) tag # <b>Hello there<!--Nice to see you.--></b> tag.contents # [u'Hello', u' there', u'Nice to see you.']  
   

# 這是Beautiful Soup 4.2.1 中新增的方法

建立一個tag最好的方法是調用工廠方法 BeautifulSoup.new_tag() :

 
    soup = BeautifulSoup("<b></b>") original_tag = soup.b new_tag = soup.new_tag("a", href="http://www.example.com") original_tag.append(new_tag) original_tag # <b><a href="http://www.example.com"></a></b> new_tag.string = "Link text." original_tag # <b><a href="http://www.example.com">Link text.</a></b>  
   

第一個參數做爲tag的name,是必填,其它參數選填

insert()

Tag.insert() 方法與 Tag.append() 方法相似,區別是不會把新元素添加到父節點 .contents 屬性的最後,而是把元素插入到指定的位置.與Python列表總的 .insert() 方法的用法下同:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) tag = soup.a tag.insert(1, "but did not endorse ") tag # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> tag.contents # [u'I linked to ', u'but did not endorse', <i>example.com</i>]  
   

insert_before() 和 insert_after()

insert_before() 方法在當前tag或文本節點前插入內容:

 
    soup = BeautifulSoup("<b>stop</b>") tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag) soup.b # <b><i>Don't</i>stop</b>  
   

insert_after() 方法在當前tag或文本節點後插入內容:

 
    soup.b.i.insert_after(soup.new_string(" ever ")) soup.b # <b><i>Don't</i> ever stop</b> soup.b.contents # [<i>Don't</i>, u' ever ', u'stop']  
   

clear()

Tag.clear() 方法移除當前tag的內容:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) tag = soup.a tag.clear() tag # <a href="http://example.com/"></a>  
   

extract()

PageElement.extract() 方法將當前tag移除文檔樹,並做爲方法結果返回:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) a_tag = soup.a i_tag = soup.i.extract() a_tag # <a href="http://example.com/">I linked to</a> i_tag # <i>example.com</i> print(i_tag.parent) None  
   

這個方法實際上產生了2個文檔樹: 一個是用來解析原始文檔的 BeautifulSoup 對象,另外一個是被移除而且返回的tag.被移除並返回的tag能夠繼續調用 extract 方法:

 
    my_string = i_tag.string.extract() my_string # u'example.com' print(my_string.parent) # None i_tag # <i></i>  
   

decompose()

Tag.decompose() 方法將當前節點移除文檔樹並徹底銷燬:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) a_tag = soup.a soup.i.decompose() a_tag # <a href="http://example.com/">I linked to</a>  
   

replace_with()

PageElement.replace_with() 方法移除文檔樹中的某段內容,並用新tag或文本節點替代它:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) a_tag = soup.a new_tag = soup.new_tag("b") new_tag.string = "example.net" a_tag.i.replace_with(new_tag) a_tag # <a href="http://example.com/">I linked to <b>example.net</b></a>  
   

replace_with() 方法返回被替代的tag或文本節點,能夠用來瀏覽或添加到文檔樹其它地方

wrap()

PageElement.wrap() 方法能夠對指定的tag元素進行包裝 [8] ,並返回包裝後的結果:

 
    soup = BeautifulSoup("<p>I wish I was bold.</p>") soup.p.string.wrap(soup.new_tag("b")) # <b>I wish I was bold.</b> soup.p.wrap(soup.new_tag("div")) # <div><p><b>I wish I was bold.</b></p></div>  
   

該方法在 Beautiful Soup 4.0.5 中添加

unwrap()

Tag.unwrap() 方法與 wrap() 方法相反.將移除tag內的全部tag標籤,該方法常被用來進行標記的解包:

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) a_tag = soup.a a_tag.i.unwrap() a_tag # <a href="http://example.com/">I linked to example.com</a>  
   

與 replace_with() 方法相同, unwrap() 方法返回被移除的tag

輸出

格式化輸出

prettify() 方法將Beautiful Soup的文檔樹格式化後以Unicode編碼輸出,每一個XML/HTML標籤都獨佔一行

 
    markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' soup = BeautifulSoup(markup) soup.prettify() # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' print(soup.prettify()) # <html> # <head> # </head> # <body> # <a href="http://example.com/"> # I linked to # <i> # example.com # </i> # </a> # </body> # </html>  
   

BeautifulSoup 對象和它的tag節點均可以調用 prettify() 方法:

 
    print(soup.a.prettify()) # <a href="http://example.com/"> # I linked to # <i> # example.com # </i> # </a>  
   

壓縮輸出

若是隻想獲得結果字符串,不重視格式,那麼能夠對一個 BeautifulSoup 對象或 Tag 對象使用Python的 unicode() 或 str() 方法:

 
    str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' unicode(soup.a) # u'<a href="http://example.com/">I linked to <i>example.com</i></a>'  
   

str() 方法返回UTF-8編碼的字符串,能夠指定編碼的設置.

還能夠調用 encode() 方法得到字節碼或調用 decode() 方法得到Unicode.

輸出格式

Beautiful Soup輸出是會將HTML中的特殊字符轉換成Unicode,好比「&lquot;」:

 
    soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.") unicode(soup) # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'  
   

若是將文檔轉換成字符串,Unicode編碼會被編碼成UTF-8.這樣就沒法正確顯示HTML特殊字符了:

 
    str(soup) # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'  
   

get_text()

若是隻想獲得tag中包含的文本內容,那麼能夠嗲用 get_text() 方法,這個方法獲取到tag中包含的全部文版內容包括子孫tag中的內容,並將結果做爲Unicode字符串返回:

 
    markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' soup = BeautifulSoup(markup) soup.get_text() u'\nI linked to example.com\n' soup.i.get_text() u'example.com'  
   

能夠經過參數指定tag的文本內容的分隔符:

 
    # soup.get_text("|")
u'\nI linked to |example.com|\n'

還能夠去除得到文本內容的先後空白:

 
    # soup.get_text("|", strip=True)
u'I linked to|example.com'

或者使用 .stripped_strings 生成器,得到文本列表後手動處理列表:

 
    [text for text in soup.stripped_strings] # [u'I linked to', u'example.com']  
   

指定文檔解析器

若是僅是想要解析HTML文檔,只要用文檔建立 BeautifulSoup 對象就能夠了.Beautiful Soup會自動選擇一個解析器來解析文檔.可是還能夠經過參數指定使用那種解析器來解析當前文檔.

BeautifulSoup 第一個參數應該是要被解析的文檔字符串或是文件句柄,第二個參數用來標識怎樣解析文檔.若是第二個參數爲空,那麼Beautiful Soup根據當前系統安裝的庫自動選擇解析器,解析器的優先數序: lxml, html5lib, Python標準庫.在下面兩種條件下解析器優先順序會變化:

要解析的文檔是什麼類型: 目前支持, 「html」, 「xml」, 和「html5」

指定使用哪一種解析器: 目前支持, 「lxml」, 「html5lib」, 和「html.parser」

安裝解析器章節介紹了可使用哪一種解析器,以及如何安裝.

若是指定的解析器沒有安裝,Beautiful Soup會自動選擇其它方案.目前只有 lxml 解析器支持XML文檔的解析,在沒有安裝lxml庫的狀況下,建立 beautifulsoup 對象時不管是否指定使用lxml,都沒法獲得解析後的對象

解析器之間的區別

Beautiful Soup爲不一樣的解析器提供了相同的接口,但解析器自己時有區別的.同一篇文檔被不一樣的解析器解析後可能會生成不一樣結構的樹型文檔.區別最大的是HTML解析器和XML解析器,看下面片斷被解析成HTML結構:

 
    BeautifulSoup("<a><b /></a>") # <html><head></head><body><a><b></b></a></body></html>  
   

由於空標籤不符合HTML標準,因此解析器把它解析成

一樣的文檔使用XML解析以下(解析XML須要安裝lxml庫).注意,空標籤依然被保留,而且文檔前添加了XML頭,而不是被包含在<html>標籤內:

 
    BeautifulSoup("<a><b /></a>", "xml") # <?xml version="1.0" encoding="utf-8"?> # <a><b/></a>  
   

HTML解析器之間也有區別,若是被解析的HTML文檔是標準格式,那麼解析器之間沒有任何差異,只是解析速度不一樣,結果都會返回正確的文檔樹.

可是若是被解析文檔不是標準格式,那麼不一樣的解析器返回結果可能不一樣.下面例子中,使用lxml解析錯誤格式的文檔,結果標籤被直接忽略掉了:

 
    BeautifulSoup("<a></p>", "lxml") # <html><body><a></a></body></html>  
   

使用html5lib庫解析相同文檔會獲得不一樣的結果:

 
    BeautifulSoup("<a></p>", "html5lib") # <html><head></head><body><a><p></p></a></body></html>  
   

html5lib庫沒有忽略掉標籤,而是自動補全了標籤,還給文檔樹添加了<head>標籤.

使用pyhton內置庫解析結果以下:

 
    BeautifulSoup("<a></p>", "html.parser") # <a></a>  
   

與lxml [7] 庫相似的,Python內置庫忽略掉了標籤,與html5lib庫不一樣的是標準庫沒有嘗試建立符合標準的文檔格式或將文檔片斷包含在<body>標籤內,與lxml不一樣的是標準庫甚至連<html>標籤都沒有嘗試去添加.

由於文檔片斷「<a>」是錯誤格式,因此以上解析方式都能算做」正確」,html5lib庫使用的是HTML5的部分標準,因此最接近」正確」.不過全部解析器的結構都可以被認爲是」正常」的.

不一樣的解析器可能影響代碼執行結果,若是在分發給別人的代碼中使用了 BeautifulSoup ,那麼最好註明使用了哪一種解析器,以減小沒必要要的麻煩.

編碼

任何HTML或XML文檔都有本身的編碼方式,好比ASCII 或 UTF-8,可是使用Beautiful Soup解析後,文檔都被轉換成了Unicode:

 
   markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" soup = BeautifulSoup(markup) soup.h1 # <h1>Sacré bleu!</h1> soup.h1.string # u'Sacr\xe9 bleu!'  
  

這不是魔術(但很神奇),Beautiful Soup用了編碼自動檢測子庫來識別當前文檔編碼並轉換成Unicode編碼. BeautifulSoup 對象的 .original_encoding 屬性記錄了自動識別編碼的結果:

 
   soup.original_encoding 'utf-8'  
  

編碼自動檢測功能大部分時候都能猜對編碼格式,但有時候也會出錯.有時候即便猜想正確,也是在逐個字節的遍歷整個文檔後才猜對的,這樣很慢.若是預先知道文檔編碼,能夠設置編碼參數來減小自動檢查編碼出錯的機率而且提升文檔解析速度.在建立 BeautifulSoup 對象的時候設置 from_encoding 參數.

下面一段文檔用了ISO-8859-8編碼方式,這段文檔過短,結果Beautiful Soup覺得文檔是用ISO-8859-7編碼:

markup = b"<h1>\xed\xe5\xec\xf9</h1>"
soup = BeautifulSoup(markup)
soup.h1
<h1>νεμω</h1>
soup.original_encoding
'ISO-8859-7'

經過傳入 from_encoding 參數來指定編碼方式:

soup = BeautifulSoup(markup, from_encoding="iso-8859-8")
soup.h1
<h1>םולש</h1>
soup.original_encoding
'iso8859-8'

少數狀況下(一般是UTF-8編碼的文檔中包含了其它編碼格式的文件),想得到正確的Unicode編碼就不得不將文檔中少數特殊編碼字符替換成特殊Unicode編碼,「REPLACEMENT CHARACTER」 (U+FFFD, �) [9] . 若是Beautifu Soup猜想文檔編碼時做了特殊字符的替換,那麼Beautiful Soup會把 UnicodeDammit 或 BeautifulSoup 對象的 .contains_replacement_characters 屬性標記爲 True .這樣就能夠知道當前文檔進行Unicode編碼後丟失了一部分特殊內容字符.若是文檔中包含�而 .contains_replacement_characters 屬性是 False ,則表示�就是文檔中原來的字符,不是轉碼失敗.

輸出編碼

經過Beautiful Soup輸出文檔時,無論輸入文檔是什麼編碼方式,輸出編碼均爲UTF-8編碼,下面例子輸入文檔是Latin-1編碼:

 
    markup = b''' <html>  <head>  <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />  </head>  <body>  <p>Sacr\xe9 bleu!</p>  </body> </html> ''' soup = BeautifulSoup(markup) print(soup.prettify()) # <html> # <head> # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> # </head> # <body> # <p> # Sacré bleu! # </p> # </body> # </html>  
   

注意,輸出文檔中的<meta>標籤的編碼設置已經修改爲了與輸出編碼一致的UTF-8.

若是不想用UTF-8編碼輸出,能夠將編碼方式傳入 prettify() 方法:

 
    print(soup.prettify("latin-1")) # <html> # <head> # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> # ...  
   

還能夠調用 BeautifulSoup 對象或任意節點的 encode() 方法,就像Python的字符串調用 encode() 方法同樣:

 
    soup.p.encode("latin-1") # '<p>Sacr\xe9 bleu!</p>' soup.p.encode("utf-8") # '<p>Sacr\xc3\xa9 bleu!</p>'  
   

若是文檔中包含當前編碼不支持的字符,那麼這些字符將唄轉換成一系列XML特殊字符引用,下面例子中包含了Unicode編碼字符SNOWMAN:

 
    markup = u"<b>\N{SNOWMAN}</b>" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b  
   

SNOWMAN字符在UTF-8編碼中能夠正常顯示(看上去像是☃),但有些編碼不支持SNOWMAN字符,好比ISO-Latin-1或ASCII,那麼在這些編碼中SNOWMAN字符會被轉換成「&#9731」:

 
    print(tag.encode("utf-8")) # <b>☃</b> print tag.encode("latin-1") # <b>&#9731;</b> print tag.encode("ascii") # <b>&#9731;</b>  
   

Unicode, dammit! (靠!)

編碼自動檢測功能能夠在Beautiful Soup之外使用,檢測某段未知編碼時,可使用這個方法:

 
    from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8'  
   

若是Python中安裝了 chardet 或 cchardet 那麼編碼檢測功能的準確率將大大提升.輸入的字符越多,檢測結果越精確,若是事先猜想到一些可能編碼,那麼能夠將猜想的編碼做爲參數,這樣將優先檢測這些編碼:

 
    dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1'  
   

編碼自動檢測功能中有2項功能是Beautiful Soup庫中用不到的

智能引號

使用Unicode時,Beautiful Soup還會智能的把引號 [10] 轉換成HTML或XML中的特殊字符:

 
     markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'<p>I just &ldquo;love&rdquo; Microsoft Word&rsquo;s smart quotes</p>' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'<p>I just &#x201C;love&#x201D; Microsoft Word&#x2019;s smart quotes</p>'  
    

也能夠把引號轉換爲ASCII碼:

 
     UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup # u'<p>I just "love" Microsoft Word\'s smart quotes</p>'  
    

頗有用的功能,可是Beautiful Soup沒有使用這種方式.默認狀況下,Beautiful Soup把引號轉換成Unicode:

 
     UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>'  
    

矛盾的編碼

有時文檔的大部分都是用UTF-8,但同時還包含了Windows-1252編碼的字符,就像微軟的智能引號 [10] 同樣.一些包含多個信息的來源網站容易出現這種狀況. UnicodeDammit.detwingle() 方法能夠把這類文檔轉換成純UTF-8編碼格式,看個簡單的例子:

 
     snowmen = (u"\N{SNOWMAN}" * 3) quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") doc = snowmen.encode("utf8") + quote.encode("windows_1252")  
    

這段文檔很雜亂,snowmen是UTF-8編碼,引號是Windows-1252編碼,直接輸出時不能同時顯示snowmen和引號,由於它們編碼不一樣:

 
     print(doc) # ☃☃☃�I like snowmen!� print(doc.decode("windows-1252")) # â˜ƒâ˜ƒâ˜ƒ「I like snowmen!」  
    

若是對這段文檔用UTF-8解碼就會獲得 UnicodeDecodeError 異常,若是用Windows-1252解碼就回獲得一堆亂碼.幸虧, UnicodeDammit.detwingle() 方法會吧這段字符串轉換成UTF-8編碼,容許咱們同時顯示出文檔中的snowmen和引號:

 
     new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ☃☃☃「I like snowmen!」  
    

UnicodeDammit.detwingle() 方法只能解碼包含在UTF-8編碼中的Windows-1252編碼內容,但這解決了最多見的一類問題.

在建立 BeautifulSoup 或 UnicodeDammit 對象前必定要先對文檔調用 UnicodeDammit.detwingle() 確保文檔的編碼方式正確.若是嘗試去解析一段包含Windows-1252編碼的UTF-8文檔,就會獲得一堆亂碼,好比: â˜ƒâ˜ƒâ˜ƒ「I like snowmen!」.

UnicodeDammit.detwingle() 方法在Beautiful Soup 4.1.0版本中新增

解析部分文檔

若是僅僅由於想要查找文檔中的<a>標籤而將整片文檔進行解析,實在是浪費內存和時間.最快的方法是從一開始就把<a>標籤之外的東西都忽略掉. SoupStrainer 類能夠定義文檔的某段內容,這樣搜索文檔時就沒必要先解析整篇文檔,只會解析在 SoupStrainer 中定義過的文檔. 建立一個 SoupStrainer 對象並做爲 parse_only 參數給 BeautifulSoup 的構造方法便可.

SoupStrainer

SoupStrainer 類接受與典型搜索方法相同的參數：name , attrs , recursive , text , **kwargs 。下面舉例說明三種 SoupStrainer 對象：

 
    from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(text=is_short_string)  
   

再拿「愛麗絲」文檔來舉例，來看看使用三種 SoupStrainer 對象作參數會有什麼不一樣:

 
    html_doc = """ <html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # <a class="sister" href="http://example.com/tillie" id="link3"> # Tillie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... #  
   

還能夠將 SoupStrainer 做爲參數傳入搜索文檔樹中提到的方法.這可能不是個經常使用用法,因此仍是提一下:

 
    soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n']  
   

常見問題

代碼診斷

若是想知道Beautiful Soup到底怎樣處理一份文檔,能夠將文檔傳入 diagnose() 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup會輸出一份報告,說明不一樣的解析器會怎樣處理這段文檔,並標出當前的解析過程會使用哪一種解析器:

 
    from bs4.diagnose import diagnose data = open("bad.html").read() diagnose(data) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ...  
   

diagnose() 方法的輸出結果可能幫助你找到問題的緣由,若是不行,還能夠把結果複製出來以便尋求他人的幫助

文檔解析錯誤

文檔解析錯誤有兩種.一種是崩潰,Beautiful Soup嘗試解析一段文檔結果卻拋除了異常,一般是 HTMLParser.HTMLParseError .還有一種異常狀況,是Beautiful Soup解析後的文檔樹看起來與原來的內容相差不少.

這些錯誤幾乎都不是Beautiful Soup的緣由,這不會是由於Beautiful Soup得代碼寫的太優秀,而是由於Beautiful Soup沒有包含任何文檔解析代碼.異常產生自被依賴的解析器,若是解析器不能很好的解析出當前的文檔,那麼最好的辦法是換一個解析器.更多細節查看安裝解析器章節.

最多見的解析錯誤是 HTMLParser.HTMLParseError: malformed start tag 和 HTMLParser.HTMLParseError: bad end tag .這都是由Python內置的解析器引發的,解決方法是安裝lxml或html5lib

最多見的異常現象是當前文檔找不到指定的Tag,而這個Tag光是用眼睛就足夠發現的了. find_all() 方法返回 [] ,而 find() 方法返回 None .這是Python內置解析器的又一個問題: 解析器會跳過那些它不知道的tag.解決方法仍是安裝lxml或html5lib

版本錯誤

SyntaxError: Invalid syntax (異常位置在代碼行: ROOT_TAG_NAME = u'[document]' ),由於Python2版本的代碼沒有通過遷移就在Python3中窒息感
ImportError: No module named HTMLParser 由於在Python3中執行Python2版本的Beautiful Soup
ImportError: No module named html.parser 由於在Python2中執行Python3版本的Beautiful Soup
ImportError: No module named BeautifulSoup 由於在沒有安裝BeautifulSoup3庫的Python環境下執行代碼,或忘記了BeautifulSoup4的代碼須要從 bs4 包中引入
ImportError: No module named bs4 由於當前Python環境下尚未安裝BeautifulSoup4

解析成XML

默認狀況下,Beautiful Soup會將當前文檔做爲HTML格式解析,若是要解析XML文檔,要在 BeautifulSoup 構造方法中加入第二個參數「xml」:

 
    soup = BeautifulSoup(markup, "xml")  
   

固然,還須要安裝lxml

解析器的錯誤

若是一樣的代碼在不一樣環境下結果不一樣,多是由於兩個環境下使用不一樣的解析器形成的.例如這個環境中安裝了lxml,而另外一個環境中只有html5lib, 解析器之間的區別中說明了緣由.修復方法是在 BeautifulSoup 的構造方法中中指定解析器
由於HTML標籤是大小寫敏感的,因此3種解析器再出來文檔時都將tag和屬性轉換成小寫.例如文檔中的 <TAG></TAG> 會被轉換爲 <tag></tag> .若是想要保留tag的大寫的話,那麼應該將文檔解析成XML .

雜項錯誤

UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (或其它類型的 UnicodeEncodeError )的錯誤,主要是兩方面的錯誤(都不是Beautiful Soup的緣由),第一種是正在使用的終端(console)沒法顯示部分Unicode,參考 Python wiki ,第二種是向文件寫入時,被寫入文件不支持部分Unicode,這時只要用 u.encode("utf8") 方法將編碼轉換爲UTF-8.
KeyError: [attr] 由於調用 tag['attr'] 方法而引發,由於這個tag沒有定義該屬性.出錯最多的是 KeyError: 'href' 和 KeyError: 'class' .若是不肯定某個屬性是否存在時,用 tag.get('attr') 方法去獲取它,跟獲取Python字典的key同樣
AttributeError: 'ResultSet' object has no attribute 'foo' 錯誤一般是由於把 find_all() 的返回結果看成一個tag或文本節點使用,實際上返回結果是一個列表或 ResultSet 對象的字符串,須要對結果進行循環才能獲得每一個節點的 .foo 屬性.或者使用 find() 方法僅獲取到一個節點
AttributeError: 'NoneType' object has no attribute 'foo' 這個錯誤一般是在調用了 find() 方法後直節點取某個屬性 .foo 可是 find() 方法並無找到任何結果,因此它的返回值是 None .須要找出爲何 find() 的返回值是 None .

如何提升效率

Beautiful Soup對文檔的解析速度不會比它所依賴的解析器更快,若是對計算時間要求很高或者計算機的時間比程序員的時間更值錢,那麼就應該直接使用 lxml .

換句話說,還有提升Beautiful Soup效率的辦法,使用lxml做爲解析器.Beautiful Soup用lxml作解析器比用html5lib或Python內置解析器速度快不少.

安裝 cchardet 後文檔的解碼的編碼檢測會速度更快

解析部分文檔不會節省多少解析時間,可是會節省不少內存,而且搜索時也會變得更快.

Beautiful Soup 3

Beautiful Soup 3是上一個發佈版本,目前已經中止維護.Beautiful Soup 3庫目前已經被幾個主要的linux平臺添加到源裏:

$ apt-get install Python-beautifulsoup

在PyPi中分發的包名字是 BeautifulSoup :

$ easy_install BeautifulSoup

$ pip install BeautifulSoup

或經過 Beautiful Soup 3.2.0源碼包安裝

Beautiful Soup 3的在線文檔查看這裏 ,固然還有中文版 ,而後再讀本片文檔,來對比Beautiful Soup 4中有什新變化.

遷移到BS4

只要一個小變更就能讓大部分的Beautiful Soup 3代碼使用Beautiful Soup 4的庫和方法—-修改 BeautifulSoup 對象的引入方式:

 
  from BeautifulSoup import BeautifulSoup  
 

修改成:

 
  from bs4 import BeautifulSoup  
 

若是代碼拋出 ImportError 異常「No module named BeautifulSoup」,緣由多是嘗試執行Beautiful Soup 3,但環境中只安裝了Beautiful Soup 4庫
若是代碼跑出 ImportError 異常「No module named bs4」,緣由多是嘗試運行Beautiful Soup 4的代碼,但環境中只安裝了Beautiful Soup 3.

雖然BS4兼容絕大部分BS3的功能,但BS3中的大部分方法已經不推薦使用了,就方法按照 PEP8標準從新定義了方法名.不少方法都從新定義了方法名,但只有少數幾個方法沒有向下兼容.

上述內容就是BS3遷移到BS4的注意事項

須要的解析器

Beautiful Soup 3曾使用Python的 SGMLParser 解析器,這個模塊在Python3中已經被移除了.Beautiful Soup 4默認使用系統的 html.parser ,也可使用lxml或html5lib擴展庫代替.查看安裝解析器章節

由於 html.parser 解析器與 SGMLParser 解析器不一樣,它們在處理格式不正確的文檔時也會產生不一樣結果.一般 html.parser 解析器會拋出異常.因此推薦安裝擴展庫做爲解析器.有時 html.parser 解析出的文檔樹結構與 SGMLParser 的不一樣.若是發生這種狀況,那麼須要升級BS3來處理新的文檔樹.

方法名的變化

renderContents -> encode_contents
replaceWith -> replace_with
replaceWithChildren -> unwrap
findAll -> find_all
findAllNext -> find_all_next
findAllPrevious -> find_all_previous
findNext -> find_next
findNextSibling -> find_next_sibling
findNextSiblings -> find_next_siblings
findParent -> find_parent
findParents -> find_parents
findPrevious -> find_previous
findPreviousSibling -> find_previous_sibling
findPreviousSiblings -> find_previous_siblings
nextSibling -> next_sibling
previousSibling -> previous_sibling

Beautiful Soup構造方法的參數部分也有名字變化:

BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)

爲了適配Python3,修改了一個方法名:

Tag.has_key() -> Tag.has_attr()

修改了一個屬性名,讓它看起來更專業點:

Tag.isSelfClosing -> Tag.is_empty_element

修改了下面3個屬性的名字,以避免雨Python保留字衝突.這些變更不是向下兼容的,若是在BS3中使用了這些屬性,那麼在BS4中這些代碼沒法執行.

UnicodeDammit.Unicode -> UnicodeDammit.Unicode_markup``
Tag.next -> Tag.next_element
Tag.previous -> Tag.previous_element

生成器

將下列生成器按照PEP8標準從新命名,並轉換成對象的屬性:

childGenerator() -> children
nextGenerator() -> next_elements
nextSiblingGenerator() -> next_siblings
previousGenerator() -> previous_elements
previousSiblingGenerator() -> previous_siblings
recursiveChildGenerator() -> descendants
parentGenerator() -> parents

因此遷移到BS4版本時要替換這些代碼:

 
   for parent in tag.parentGenerator(): ...  
  

替換爲:

 
   for parent in tag.parents: ...  
  

(兩種調用方法如今都能使用)

BS3中有的生成器循環結束後會返回 None 而後結束.這是個bug.新版生成器再也不返回 None .

BS4中增長了2個新的生成器, .strings 和 stripped_strings . .strings 生成器返回NavigableString對象, .stripped_strings 方法返回去除先後空白的Python的string對象.

XML

BS4中移除了解析XML的 BeautifulStoneSoup 類.若是要解析一段XML文檔,使用 BeautifulSoup 構造方法並在第二個參數設置爲「xml」.同時 BeautifulSoup 構造方法也再也不識別 isHTML 參數.

Beautiful Soup處理XML空標籤的方法升級了.舊版本中解析XML時必須指明哪一個標籤是空標籤. 構造方法的 selfClosingTags 參數已經再也不使用.新版Beautiful Soup將全部空標籤解析爲空元素,若是向空元素中添加子節點,那麼這個元素就再也不是空元素了.

實體

HTML或XML實體都會被解析成Unicode字符,Beautiful Soup 3版本中有不少處理實體的方法,在新版中都被移除了. BeautifulSoup 構造方法也再也不接受 smartQuotesTo 或 convertEntities 參數. 編碼自動檢測方法依然有 smart_quotes_to 參數,可是默認會將引號轉換成Unicode.內容配置項 HTML_ENTITIES , XML_ENTITIES 和 XHTML_ENTITIES 在新版中被移除.由於它們表明的特性已經再也不被支持.

若是在輸出文檔時想把Unicode字符轉換成HTML實體,而不是輸出成UTF-8編碼,那就須要用到輸出格式的方法.

遷移雜項

Tag.string 屬性如今是一個遞歸操做.若是A標籤只包含了一個B標籤,那麼A標籤的.string屬性值與B標籤的.string屬性值相同.

多值屬性好比 class 屬性包含一個他們的值的列表,而不是一個字符串.這可能會影響到如何按照CSS類名哦搜索tag.

若是使用 find* 方法時同時傳入了 text 參數和 name 參數 .Beautiful Soup會搜索指定name的tag,而且這個tag的 Tag.string 屬性包含text參數的內容.結果中不會包含字符串自己.舊版本中Beautiful Soup會忽略掉tag參數,只搜索text參數.

BeautifulSoup 構造方法再也不支持 markupMassage 參數.如今由解析器負責文檔的解析正確性.

不多被用到的幾個解析器方法在新版中被移除,好比 ICantBelieveItsBeautifulSoup 和 BeautifulSOAP .如今由解析器徹底負責如何解釋模糊不清的文檔標記.

prettify() 方法在新版中返回Unicode字符串,再也不返回字節流.

BeautifulSoup3 文檔

[1]	BeautifulSoup的google討論組不是很活躍,多是由於庫已經比較完善了吧,可是做者仍是會很熱心的儘可能幫你解決問題的.

[2]	(1, 2) 文檔被解析成樹形結構,因此下一步解析過程應該是當前節點的子節點

[3]	過濾器只能做爲搜索文檔的參數,或者說應該叫參數類型更爲貼切,原文中用了 `filter` 所以翻譯爲過濾器

[4]	元素參數,HTML文檔中的一個tag節點,不能是文本節點

[5]	(1, 2, 3, 4, 5) 採用先序遍歷方式

[6]	(1, 2) CSS選擇器是一種單獨的文檔搜索語法, 參考 http://www.w3school.com.cn/css/css_selector_type.asp

[7]	原文寫的是 html5lib, 譯者以爲這是願文檔的一個筆誤

[8]	wrap含有包裝,打包的意思,可是這裏的包裝不是在外部包裝而是將當前tag的內部內容包裝在一個tag裏.包裝原來內容的新tag依然在執行 wrap() 方法的tag內

[9]	文檔中特殊編碼字符被替換成特殊字符(一般是�)的過程是Beautful Soup自動實現的,若是想要多種編碼格式的文檔被徹底轉換正確,那麼,只好,預先手動處理,統一編碼格式

[10]	(1, 2) 智能引號,常出如今microsoft的word軟件中,即在某一段落中按引號出現的順序每一個引號都被自動轉換爲左引號,或右引號.

相關標籤/搜索

python+urllib+beautifulsoup

python+requests+beautifulsoup

python+urllib+beautifulsoup+pymysql

python+selenium+beautifulsoup

python+phantomjs+selenium+beautifulsoup

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。