原文 by Leonard Richardson (leonardr@segfault.org)
翻譯 by Richie Yan (richieyan@gmail.com)
###若是有些翻譯的不許確或者難以理解,直接看例子吧。###
英文原文點這裏php
Beautiful Soup 是用Python寫的一個HTML/XML的解析器,它能夠很好的處理不規範標記並生成剖析樹(parse tree)。 它提供簡單又經常使用的導航(navigating),搜索以及修改剖析樹的操做。它能夠大大節省你的編程時間。 對於Ruby,使用Rubyful Soup。css
這個文檔說明了Beautiful Soup 3.0主要的功能特性,並附有例子。 從中你能夠知道這個庫有哪些好處,它是怎樣工做的, 怎樣讓它幫作你想作的事以及你該怎樣作當它作的和你期待不同。html
findAll(name, attrs, recursive, text, limit, **kwargs)
find(name, attrs, recursive, text, **kwargs)
first
哪裏去了?findNextSiblings(name, attrs, text, limit, **kwargs)
and findNextSibling(name, attrs, text, **kwargs)
findPreviousSiblings(name, attrs, text, limit, **kwargs)
and findPreviousSibling(name, attrs, text, **kwargs)
findAllNext(name, attrs, text, limit, **kwargs)
and findNext(name, attrs, text, **kwargs)
findAllPrevious(name, attrs, text, limit, **kwargs)
and findPrevious(name, attrs, text, **kwargs)
SoupStrainer
sextract
改進內存使用從這裏得到 Beautiful Soup。 變動日誌 描述了3.0 版本與以前版本的不一樣。java
在程序中中導入 Beautiful Soup庫:node
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
下面的代碼是Beautiful Soup基本功能的示範。你能夠複製粘貼到你的python文件中,本身運行看看。python
from BeautifulSoup import BeautifulSoup import re doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() #<html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> #</html>
navigate soup的一些方法:程序員
soup.contents[0].name #u'html' soup.contents[0].contents[0].name #u'head' head = soup.contents[0].contents[0] head.parent.name #u'html' head.next #<title>Page title</title> head.nextSibling.name #u'body' head.nextSibling.contents[0] #<p id="firstpara" align="center">This is paragraph <b>one</b>.</p> head.nextSibling.contents[0].nextSibling #<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
下面是一些方法搜索soup,得到特定標籤或有着特定屬性的標籤:web
titleTag = soup.html.head.title titleTag #<title>Page title</title> titleTag.string #u'Page title' len(soup('p')) #2 soup.findAll('p', align="center") #[<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>] soup.find('p', align="center") #<p id="firstpara" align="center">This is paragraph <b>one</b>. </p> soup('p', align="center")[0]['id'] #u'firstpara' soup.find('p', align=re.compile('^b.*'))['id'] #u'secondpara' soup.find('p').b.string #u'one' soup('p')[1].b.string #u'two'
修改soup也很簡單:正則表達式
titleTag['id'] = 'theTitle' titleTag.contents[0].replaceWith("New title") soup.html.head #<head><title id="theTitle">New title</title></head> soup.p.extract() soup.prettify() #<html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> #</html> soup.p.replaceWith(soup.b) #<html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <b> # two # </b> # </body> #</html> soup.body.insert(0, "This page used to have ") soup.body.insert(2, " <p> tags!") soup.body #<body>This page used to have <b>two</b> <p> tags!</body>
一個實際例子,用於抓取 ICC Commercial Crime Services weekly piracy report頁面, 使用Beautiful Soup剖析並得到發生的盜版事件:express
import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php") soup = BeautifulSoup(page) for incident in soup('td', width="90%"): where, linebreak, what = incident.contents[:3] print where.strip() print what.strip() print
Beautiful Soup使用XML或HTML文檔以字符串的方式(或類文件對象)構造。 它剖析文檔並在內存中建立通信的數據結構
若是你的文檔格式是很是標準的,解析出來的數據結構正如你的原始文檔。可是 若是你的文檔有問題,Beautiful Soup會使用heuristics修復可能的結構問題。
使用 BeautifulSoup
類剖析HTML文檔。 BeautifulSoup
會得出如下一些信息:
這是運行例子:
from BeautifulSoup import BeautifulSoup html = "<html><p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2" soup = BeautifulSoup(html) print soup.prettify() #<html> # <p> # Para 1 # </p> # <p> # Para 2 # <blockquote> # Quote 1 # <blockquote> # Quote 2 # </blockquote> # </blockquote> # </p> #</html>
注意:BeautifulSoup
會智能判斷那些須要添加關閉標籤的位置,即便原始的文檔沒有。
也就是說那個文檔不是一個有效的HTML,可是它也不是太糟糕。下面是一個比較糟糕的文檔。 在一些問題中,它的<FORM>的開始在 <TABLE> 外面,結束在<TABLE>裏面。 (這種HTML在一些大公司的頁面上也家常便飯)
from BeautifulSoup import BeautifulSoup html = """ <html> <form> <table> <td><input name="input1">Row 1 cell 1 <tr><td>Row 2 cell 1 </form> <td>Row 2 cell 2<br>This</br> sure is a long cell </body> </html>"""
Beautiful Soup 也能夠處理這個文檔:
print BeautifulSoup(html).prettify() #<html> # <form> # <table> # <td> # <input name="input1" /> # Row 1 cell 1 # </td> # <tr> # <td> # Row 2 cell 1 # </td> # </tr> # </table> # </form> # <td> # Row 2 cell 2 # <br /> # This # sure is a long cell # </td> #</html>
table的最後一個單元格已經在標籤<TABLE>外了;Beautiful Soup 決定關閉<TABLE>標籤當它在<FORM>標籤哪裏關閉了。 寫這個文檔傢伙本來打算使用<FORM>標籤擴展到table的結尾,可是Beautiful Soup 確定不知道這些。即便遇到這樣糟糕的狀況, Beautiful Soup 仍能夠剖析這個不合格文檔,使你開業存取全部數據。
BeautifulSoup
相似瀏覽器,是個具備啓發性的類,能夠儘量的推測HTML文檔做者的意圖。 可是XML沒有固定的標籤集合,所以這些啓發式的功能沒有做用。所以BeautifulSoup
處理XML不是很好。
使用BeautifulStoneSoup
類剖析XML文檔。它是一個 歸納的類,沒有任何特定的XML方言已經簡單的標籤內嵌規則。 下面是範例:
from BeautifulSoup import BeautifulStoneSoup xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3" soup = BeautifulStoneSoup(xml) print soup.prettify() #<doc> # <tag1> # Contents 1 # <tag2> # Contents 2 # </tag2> # </tag1> # <tag1> # Contents 3 # </tag1> #</doc>
BeautifulStoneSoup
的一個主要缺點就是它不知道如何處理自結束標籤 。 HTML 有固定的自結束標籤集合,可是XML取決對應的DTD文件。你能夠經過傳遞selfClosingTags
的參數的名字到 BeautifulStoneSoup
的構造器中,指定自結束標籤:
from BeautifulSoup import BeautifulStoneSoup xml = "<tag>Text 1<selfclosing>Text 2" print BeautifulStoneSoup(xml).prettify() #<tag> # Text 1 # <selfclosing> # Text 2 # </selfclosing> #</tag> print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify() #<tag> # Text 1 # <selfclosing /> # Text 2 #</tag>
這裏有 一些其餘的剖析類 使用與上述兩個類不一樣的智能感應。 你也能夠子類化以及定製一個剖析器 使用你本身的智能感應方法。
當你的文檔被剖析以後,它就自動被轉換爲unicode。 Beautiful Soup 只存儲Unicode字符串。
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("Hello") soup.contents[0] #u'Hello' soup.originalEncoding #'ascii'
使用UTF-8編碼的日文文檔例子:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf") soup.contents[0] #u'\u3053\u308c\u306f' soup.originalEncoding #'utf-8' str(soup) #'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' #Note: this bit uses EUC-JP, so it only works if you have cjkcodecs #installed, or are running Python 2.4. soup.__str__('euc-jp') #'\xa4\xb3\xa4\xec\xa4\xcf'
Beautiful Soup 使用一個稱爲UnicodeDammit
的類去來檢測文檔的編碼,並將其轉換爲Unicode。 若是你須要爲其餘文檔(沒有石油Beautiful Soup剖析過得文檔)使用這轉換,你也能夠 直接使用UnicodeDammit
。 它是基於Universal Feed Parser開發的。
若是你使用Python2.4以前的版本,請下載和安裝cjkcodecs
以及iconvcodec
是python支持更多的編碼,特別是CJK編碼。要想更好地自動檢測, 你也要安裝chardet
Beautiful Soup 會按順序嘗試不一樣的編碼將你的文檔轉換爲Unicode:
fromEncoding
參數傳遞編碼類型給soup的構造器http-equiv
的META標籤。 若是Beautiful Soup在文檔中發現編碼類型,它試着使用找到的類型轉換文檔。 可是,若是你明顯的指定一個編碼類型, 而且成功使用了編碼:這時它會忽略任何它在文檔中發現的編碼類型。chardet
庫,嗅探編碼,若是你安裝了這個庫。Beautiful Soup老是會猜對它能夠猜想的。可是對於那些沒有聲明以及有着奇怪編碼 的文檔,它會經常會失敗。這時,它會選擇Windows-1252編碼,這個多是錯誤的編碼。 下面是EUC-JP的例子,Beautiful Soup猜錯了編碼。(重申一下:由於它使用了EUC-JP, 這個例子只會在 python 2.4或者你安裝了cjkcodecs
的狀況下才工做。):
from BeautifulSoup import BeautifulSoup euc_jp = '\xa4\xb3\xa4\xec\xa4\xcf' soup = BeautifulSoup(euc_jp) soup.originalEncoding #'windows-1252' str(soup) #'\xc2\xa4\xc2\xb3\xc2\xa4\xc3\xac\xc2\xa4\xc3\x8f' # Wrong!
但若是你使用fromEncoding
參數指定編碼, 它能夠正確的剖析文檔,並能夠將文檔轉換爲UTF-8或者轉回EUC-JP。
soup = BeautifulSoup(euc_jp, fromEncoding="euc-jp") soup.originalEncoding #'windows-1252' str(soup) #'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf' # Right! soup.__str__(self, 'euc-jp') == euc_jp #True
若是你指定Beautiful Soup使用 Windows-1252編碼(或者相似的編碼如ISO-8859-1,ISO-8859-2), Beautiful Soup會找到並破壞文檔的smart quotes以及其餘的Windows-specific 字符。 這些字符不會轉換爲相應的Unicode,而是將它們變爲HTML entities(BeautifulSoup
) 或者XML entitis(BeautifulStoneSoup
)。
可是,你能夠指定參數smartQuotesTo=None
到soup構造器:這時 smart quotes會被正確的轉換爲Unicode。你也能夠指定smartQuotesTo
爲"xml"或"html" 去改變BeautifulSoup
和BeautifulStoneSoup
的默認操做。
from BeautifulSoup import BeautifulSoup, BeautifulStoneSoup text = "Deploy the \x91SMART QUOTES\x92!" str(BeautifulSoup(text)) #'Deploy the ‘SMART QUOTES’!' str(BeautifulStoneSoup(text)) #'Deploy the ‘SMART QUOTES’!' str(BeautifulSoup(text, smartQuotesTo="xml")) #'Deploy the ‘SMART QUOTES’!' BeautifulSoup(text, smartQuotesTo=None).contents[0] #u'Deploy the \u2018SMART QUOTES\u2019!'
你可使用 str
函數將Beautiful Soup文檔(或者它的子集)轉換爲字符串, 或者使用它的code>prettify或renderContents
。 你也可使用unicode
函數以Unicode字符串的形式得到。
prettify
方法添加了一些換行和空格以便讓文檔結構看起來更清晰。 它也將那些只包含空白符的,可能影響一個XML文檔意義的文檔節點(nodes)剔除(strips out)。 str
和unicode
函數不會剔除這些節點,他們也不會添加任何空白符。
看看這個例子:
from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) str(soup) #'<html><h1>Heading</h1><p>Text</p></html>' soup.renderContents() #'<html><h1>Heading</h1><p>Text</p></html>' soup.__str__() #'<html><h1>Heading</h1><p>Text</p></html>' unicode(soup) #u'<html><h1>Heading</h1><p>Text</p></html>' soup.prettify() #'<html>\n <h1>\n Heading\n </h1>\n <p>\n Text\n </p>\n</html>' print soup.prettify() #<html> # <h1> # Heading # </h1> # <p> # Text # </p> #</html>
能夠看到使用文檔中的tag成員時 str
和renderContents
返回的結果是不一樣的。
heading = soup.h1 str(heading) #'<h1>Heading</h1>' heading.renderContents() #'Heading'
當你調用__str__
,prettify
或者renderContents
時, 你能夠指定輸出的編碼。默認的編碼(str
使用的)是UTF-8。 下面是處理ISO-8851-1的串並以不一樣的編碼輸出一樣的串的例子。
from BeautifulSoup import BeautifulSoup doc = "Sacr\xe9 bleu!" soup = BeautifulSoup(doc) str(soup) #'Sacr\xc3\xa9 bleu!' # UTF-8 soup.__str__("ISO-8859-1") #'Sacr\xe9 bleu!' soup.__str__("UTF-16") #'\xff\xfeS\x00a\x00c\x00r\x00\xe9\x00 \x00b\x00l\x00e\x00u\x00!\x00' soup.__str__("EUC-JP") #'Sacr\x8f\xab\xb1 bleu!'
若是原始文檔含有編碼聲明,Beautiful Soup會將原始的編碼聲明改成新的編碼。 也就是說,你載入一個HTML文檔到BeautifulSoup
後,在輸出它,不只HTML被清理 過了,並且能夠明顯的看到它已經被轉換爲UTF-8。
這是HTML的例子:
from BeautifulSoup import BeautifulSoup doc = """<html> <meta http-equiv="Content-type" content="text/html; charset=ISO-Latin-1" > Sacr\xe9 bleu! </html>""" print BeautifulSoup(doc).prettify() #<html> # <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> # Sacré bleu! #</html>
這是XML的例子:
from BeautifulSoup import BeautifulStoneSoup doc = """<?xml version="1.0" encoding="ISO-Latin-1">Sacr\xe9 bleu!""" print BeautifulStoneSoup(doc).prettify() #<?xml version='1.0' encoding='utf-8'> #Sacré bleu!
到目前爲止,咱們只是載入文檔,而後再輸出它。 如今看看更讓咱們感興趣的剖析樹: Beautiful Soup剖析一個文檔後生成的數據結構。
剖析對象 (BeautifulSoup
或 BeautifulStoneSoup
的實例)是深層嵌套(deeply-nested), 精心構思的(well-connected)的數據結構,能夠與XML和HTML結構相互協調。 剖析對象包括2個其餘類型的對象,Tag
對象, 用於操縱像<TITLE> ,<B>這樣的標籤;NavigableString
對象, 用於操縱字符串,如"Page title"和"This is paragraph"。
NavigableString
的一些子類 (CData
, Comment
, Declaration
, and ProcessingInstruction
), 也處理特殊XML結構。 它們就像NavigableString
同樣, 除了但他們被輸出時, 他們會被添加一些額外的數據。下面是一個包含有註釋(comment)的文檔:
from BeautifulSoup import BeautifulSoup import re hello = "Hello! <!--I've got to be nice to get what I want.-->" commentSoup = BeautifulSoup(hello) comment = commentSoup.find(text=re.compile("nice")) comment.__class__ #<class 'BeautifulSoup.Comment'> comment #u"I've got to be nice to get what I want." comment.previousSibling #u'Hello! ' str(comment) #"<!--I've got to be nice to get what I want.-->" print commentSoup #Hello! <!--I've got to be nice to get what I want.-->
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() #<html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> #</html>
Tag
的屬性Tag
和NavigableString
對象有不少有用的成員,在 Navigating剖析樹和 Searching剖析樹中咱們會更詳細的介紹。 如今,咱們先看看這裏使用的Tag
成員:屬性
SGML標籤有屬性:.例如,在上面那個HTML 中每一個<P>標籤都有"id"屬性和"align"屬性。 你能夠將Tag
當作字典來訪問標籤的屬性:
firstPTag, secondPTag = soup.findAll('p') firstPTag['id'] #u'firstPara' secondPTag['id'] #u'secondPara'
NavigableString
對象沒有屬性;只有Tag
對象有屬性。
Tag
對象都有以下含有全部的成員的列表(儘管,某些實際的成員值可能爲None
). NavigableString
對象也有下面這些成員,除了contents
和 string
成員。
parent
上面那個 例子中, <HEAD> Tag
的parent是<HTML> Tag
. <HTML> Tag
的parent是BeautifulSoup
剖析對象本身。 剖析對象的parent是None
. 利用parent
,你能夠向前遍歷剖析樹。
soup.head.parent.name #u'html' soup.head.parent.parent.__class__.__name__ #'BeautifulSoup' soup.parent == None #True
contents
使用parent
向前遍歷樹。使用contents
向後遍歷樹。 contents
是Tag
的有序列表, NavigableString
對象包含在一個頁面元素內。只有最高層的剖析對象和 Tag
對象有contents
。NavigableString
只有strings,不能包含子元素,所以他們也沒有contents
.
在上面的例子中, contents
的第一個<P> Tag
是個列表,包含一個 NavigableString
("This is paragraph "), 一個<B> Tag
, 和其它的 NavigableString
(".")。而contents
的<B> Tag
: 包含一個NavigableString
("one")的列表。
pTag = soup.p pTag.contents #[u'This is paragraph ', <b>one</b>, u'.'] pTag.contents[1].contents #[u'one'] pTag.contents[0].contents #AttributeError: 'NavigableString' object has no attribute 'contents'
string
爲了方便,若是一個標籤只有一個子節點且是字符串類型,這個本身能夠這樣訪問 tag.string
,等同於tag.contents[0]
的形式。 在上面的例子中, soup.b.string
是個NavigableString
對象,它的值是Unicode字符串"one". 這是剖析樹中<B>Tag
的第一個string。
soup.b.string #u'one' soup.b.contents[0] #u'one'
可是soup.p.string
是None
, 剖析中的第一個<P> Tag
擁有多個子元素。soup.head.string
也爲None
, 雖然<HEAD> Tag只有一個子節點,可是這個子節點是Tag
類型 (<TITLE> Tag
), 不是NavigableString
。
soup.p.string == None #True soup.head.string == None #True
nextSibling
和previousSibling
使用它們你能夠跳往在剖析樹中同等層次的下一個元素。 在上面的文檔中, <HEAD> Tag
的nextSibling
是<BODY> Tag
, 由於<BODY> Tag
是在<html> Tag
的下一層。 <BODY>標籤的nextSibling
爲None
, 由於<HTML>下一層沒有標籤是直接的在它以後。
soup.head.nextSibling.name #u'body' soup.html.nextSibling == None #True
相應的<BODY> Tag
的previousSibling
是<HEAD>標籤, <HEAD> Tag
的previousSibling
爲None
:
soup.body.previousSibling.name #u'head' soup.head.previousSibling == None #True
更多例子:<P> Tag
的第一個nextSibling
是第二個 <P> Tag
。 第二個<P>Tag
裏的<B>
Tag
的previousSibling
是 NavigableString
"This is paragraph"。 這個NavigableString
的previousSibling
是None
, 不會是第一個<P> Tag
裏面的任何元素。
soup.p.nextSibling #<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> secondBTag = soup.findAlll('b')[1] secondBTag.previousSibling #u'This is paragraph' secondBTag.previousSibling.previousSibling == None #True
next
和previous
使用它們能夠按照soup處理文檔的次序遍歷整個文檔,而不是它們在剖析樹中看到那種次序。 例如<HEAD> Tag
的next
是<TITLE>Tag
, 而不是<BODY> Tag
。 這是由於在原始文檔中,<TITLE> tag 直接在<HEAD>標籤以後。
soup.head.next #u'title' soup.head.nextSibling.name #u'body' soup.head.previous.name #u'html'
Where next
and previous
are concerned, a Tag
's contents
come before its nextSibling
. 一般不會用到這些成員,但有時使用它們可以很是方便地從剖析樹得到不易找到的信息。
遍歷一個標籤(Iterating over a Tag)
你能夠像遍歷list同樣遍歷一個標籤(Tag
)的contents
。 這很是有用。相似的,一個Tag的有多少child能夠直接使用len(tag)而沒必要使用len(tag.contents)來得到。 以上面那個文檔中的爲例:
for i in soup.body: print i #<p id="firstpara" align="center">This is paragraph <b>one</b>.</p> #<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> len(soup.body) #2 len(soup.body.contents) #2
像剖析對象或Tag對象的成員同樣使用Tag名能夠很方便的操做剖析樹。 前面一些例子咱們已經用到了這種方式。以上述文檔爲例, soup.head
得到文檔第一個<HEAD>標籤:
soup.head #<head><title>Page title</title></head>
一般,調用mytag.foo
得到的是mytag
的第一個child,同時必須是一個<FOO> 標籤
。 若是在mytag
中沒有<FOO> 標籤
,mytag.foo
返回一個None
。 你可使用這中方法快速的讀取剖析樹:
soup.head.title #<title>Page title</title> soup.body.p.b.string #u'one'
你也可使用這種方法快速的跳到剖析樹的某個特定位置。例如,若是你擔憂<TITLE> tags會離奇的在<HEAD> tag以外, 你可使用soup.title
去得到一個HTML文檔的標題(title),而沒必要使用soup.head.title
:
soup.title.string #u'Page title'
soup.p
跳到文檔中的第一個 <P> tag,不論它在哪裏。 soup.table.tr.td
跳到文檔總第一個table的第一列第一行。
這些成員其實是下面first
方法的別名,這裏更多介紹。 這裏提到是由於別名使得一個定位(zoom)一個結構良好剖析樹變得異常容易。
得到第一個<FOO> 標籤另外一種方式是使用.fooTag
而不是 .foo
。 例如,soup.table.tr.td
能夠表示爲soup.tableTag.trTag.tdTag
,甚至爲soup.tableTag.tr.tdTag
。 若是你喜歡更明確的知道表示的意義,或者你在剖析一個標籤與Beautiful Soup的方法或成員有衝突的XML文檔是,使用這種方式很是有用。
from BeautifulSoup import BeautifulStoneSoup xml = '<person name="Bob"><parent rel="mother" name="Alice">' xmlSoup = BeautifulStoneSoup(xml) xmlSoup.person.parent # A Beautiful Soup member #<person name="Bob"><parent rel="mother" name="Alice"></parent></person> xmlSoup.person.parentTag # A tag name #<parent rel="mother" name="Alice"></parent>
若是你要找的標籤名不是有效的Python標識符,(例如hyphenated-name
),你就須要使用first
方法了。
Beautiful Soup提供了許多方法用於瀏覽一個剖析樹,收集你指定的Tag
和NavigableString
。
有幾種方法去定義用於Beautiful Soup的匹配項。 咱們先用深刻解釋最基本的一種搜索方法findAll
。 和前面同樣,咱們使用下面這個文檔說明:
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() #<html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> #</html>
還有, 這裏的兩個方法(findAll
和 find
)僅對Tag
對象以及 頂層剖析對象有效,但 NavigableString
不可用。 這兩個方法在Searching 剖析樹內部一樣可用。
findAll(
name, attrs, recursive, text, limit, **kwargs)方法findAll
從給定的點開始遍歷整個樹,並找到知足給定條件全部Tag
以及NavigableString
。 findall
函數原型定義以下:
findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
這些參數會反覆的在這個文檔中出現。其中最重要的是name
參數 和keywords參數(譯註:就是**kwargs參數)。
參數name
匹配tags的名字,得到相應的結果集。 有幾種方法去匹配name,在這個文檔中會一再的用到。
最簡單用法是僅僅給定一個tag name值。下面的代碼尋找文檔中全部的 <B> Tag
:
soup.findAll('b') #[<b>one</b>, <b>two</b>]
你能夠傳一個正則表達式。下面的代碼尋找全部以b開頭的標籤:
import re tagsStartingWithB = soup.findAll(re.compile('^b')) [tag.name for tag in tagsStartingWithB] #[u'body', u'b', u'b']
你能夠傳一個list或dictionary。下面兩個調用是查找全部的<TITLE>和<P>標籤。 他們得到結果同樣,可是後一種方法更快一些:
soup.findAll(['title', 'p']) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll({'title' : True, 'p' : True}) #[<title>Page title</title>, # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
你能夠傳一個True
值,這樣能夠匹配每一個tag的name:也就是匹配每一個tag。
allTags = soup.findAll(True) [tag.name for tag in allTags] [u'html', u'head', u'title', u'body', u'p', u'b', u'p', u'b']
這看起來不是頗有用,可是當你限定屬性(attribute)值時候,使用True
就頗有用了。
你能夠傳callable對象,就是一個使用Tag
對象做爲它惟一的參數,並返回布爾值的對象。 findAll
使用的每一個做爲參數的Tag
對象都會傳遞給這個callable對象, 而且若是調用返回True
,則這個tag即是匹配的。
下面是查找兩個並僅兩個屬性的標籤(tags):
soup.findAll(lambda tag: len(tag.attrs) == 2) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
下面是尋找單個字符爲標籤名而且沒有屬性的標籤:
soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs) #[<b>one</b>, <b>two</b>]
keyword參數用於篩選tag的屬性。下面這個例子是查找擁有屬性align且值爲center的 全部標籤:
soup.findAll(align="center") #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]
如同name
參數,你也可使用不一樣的keyword參數對象,從而更加靈活的指定屬性值的匹配條件。 你能夠向上面那樣傳遞一個字符串,來匹配屬性的值。你也能夠傳遞一個正則表達式,一個列表(list),一個哈希表(hash), 特殊值True
或None
,或者一個可調用的以屬性值爲參數的對象(注意:這個值可能爲None
)。 一些例子:
soup.findAll(id=re.compile("para$")) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(align=["center", "blah"]) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(align=lambda(value): value and len(value) < 5) #[<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
特殊值True
和None
更讓人感興趣。 True
匹配給定屬性爲任意值的標籤,None
匹配那些給定的屬性值爲空的標籤。 一些例子以下:
soup.findAll(align=True) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] [tag.name for tag in soup.findAll(align=None)] #[u'html', u'head', u'title', u'body', u'b', u'b']
若是你須要在標籤的屬性上添加更加複雜或相互關聯的(interlocking)匹配值, 如同上面同樣,以callable對象的傳遞參數來處理Tag
對象。
在這裏你也許注意到一個問題。 若是你有一個文檔,它有一個標籤訂義了一個name屬性,會怎麼樣? 你不能使用name
爲keyword參數,由於Beautiful Soup 已經定義了一個name
參數使用。 你也不能用一個Python的保留字例如for
做爲關鍵字參數。
Beautiful Soup提供了一個特殊的參數attrs
,你可使用它來應付這些狀況。 attrs
是一個字典,用起來就和keyword參數同樣:
soup.findAll(id=re.compile("para$")) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.findAll(attrs={'id' : re.compile("para$")}) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
你可使用attrs
去匹配那些名字爲Python保留字的屬性, 例如class
, for
, 以及import
; 或者那些不是keyword參數可是名字爲Beautiful Soup搜索方法使用的參數名的屬性, 例如name
, recursive
, limit
, text
, 以及attrs
自己。
from BeautifulSoup import BeautifulStoneSoup xml = '<person name="Bob"><parent rel="mother" name="Alice">' xmlSoup = BeautifulStoneSoup(xml) xmlSoup.findAll(name="Alice") #[] xmlSoup.findAll(attrs={"name" : "Alice"}) #[parent rel="mother" name="Alice"></parent>]
對於CSS類attrs
參數更加方便。例如class不只是一個CSS屬性, 也是Python的保留字。
你可使用soup.find("tagName", { "class" : "cssClass" })
搜索CSS class,可是因爲有不少這樣的操做, 你也能夠只傳遞一個字符串給attrs
。 這個字符串默認處理爲CSS的class的參數值。
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""Bob's <b>Bold</b> Barbeque Sauce now available in <b class="hickory">Hickory</b> and <b class="lime">Lime</a>""") soup.find("b", { "class" : "lime" }) #<b class="lime">Lime</b> soup.find("b", "hickory") #<b class="hickory">Hickory</b>
text
是一個用於搜索NavigableString
對象的參數。 它的值能夠是字符串,一個正則表達式, 一個list或dictionary,True
或None
, 一個以NavigableString
爲參數的可調用對象:
soup.findAll(text="one") #[u'one'] soup.findAll(text=u'one') #[u'one'] soup.findAll(text=["one", "two"]) #[u'one', u'two'] soup.findAll(text=re.compile("paragraph")) #[u'This is paragraph ', u'This is paragraph '] soup.findAll(text=True) #[u'Page title', u'This is paragraph ', u'one', u'.', u'This is paragraph ', # u'two', u'.'] soup.findAll(text=lambda(x): len(x) < 12) #[u'Page title', u'one', u'.', u'two', u'.']
若是你使用text
,任何指定給name
以及keyword參數的值都會被忽略。
recursive
是一個布爾參數(默認爲True
),用於指定Beautiful Soup遍歷整個剖析樹, 仍是隻查找當前的子標籤或者剖析對象。下面是這兩種方法的區別:
[tag.name for tag in soup.html.findAll()] #[u'head', u'title', u'body', u'p', u'b', u'p', u'b'] [tag.name for tag in soup.html.findAll(recursive=False)] #[u'head', u'body']
當recursive
爲false,只有當前的子標籤<HTML>會被搜索。若是你須要搜索樹, 使用這種方法能夠節省一些時間。
設置limit
參數可讓Beautiful Soup 在找到特定個數的匹配時中止搜索。 文檔中若是有上千個表格,但你只須要前四個,傳值4到limit
可讓你節省不少時間。 默認是沒有限制(limit沒有指定值).
soup.findAll('p', limit=1) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>] soup.findAll('p', limit=100) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]
findall
同樣調用tag一個小捷徑。若是你像函數同樣調用剖析對象或者Tag
對象, 這樣你調用所用參數都會傳遞給findall
的參數,就和調用findall
同樣。 就上面那個文檔爲例:
soup(text=lambda(x): len(x) < 12) #[u'Page title', u'one', u'.', u'two', u'.'] soup.body('p', limit=1) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>]
find(name, attrs, recursive, text, **kwargs)
好了,咱們如今看看其餘的搜索方法。他們都是有和 findAll
幾乎同樣的參數。
find
方法是最接近findAll
的函數, 只是它並不會得到全部的匹配對象,它僅僅返回找到第一個可匹配對象。 也就是說,它至關於limit
參數爲1的結果集。 以上面的 文檔爲例:
soup.findAll('p', limit=1) #[<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>] soup.find('p', limit=1) #<p id="firstpara" align="center">This is paragraph <b>one</b>.</p> soup.find('nosuchtag', limit=1) == None #True
一般,當你看到一個搜索方法的名字由複數構成 (如findAll
和findNextSiblings
)時, 這個方法就會存在limit
參數,並返回一個list的結果。但你 看到的方法不是複數形式(如find
和findNextSibling
)時, 你就能夠知道這函數沒有limit參數且返回值是單一的結果。
first
哪裏去了?
早期的Beautiful Soup 版本有一些first
,fetch
以及fetchPrevious
方法。 這些方法還在,可是已經被棄用了,也許不久就不在存在了。 由於這些名字有些使人迷惑。新的名字更加有意義: 前面提到了,複數名稱的方法名,好比含有All
的方法名,它將返回一個 多對象。不然,它只會返回單個對象。
上面說明的方法findAll
及find
,都是從剖析樹的某一點開始並一直往下。 他們反覆的遍歷對象的contents
直到最低點。
也就是說你不能在 NavigableString
對象上使用這些方法, 由於NavigableString
沒有contents:它們是剖析樹的葉子。
[這段翻譯的不太準確]可是向下搜索不是惟一的遍歷剖析樹的方法。在Navigating剖析樹 中,咱們可使用這些方法:parent
, nextSibling
等。 他們都有2個相應的方法:一個相似findAll
,一個相似find
. 因爲NavigableString
對象也支持這些方法,你能夠像Tag
同樣 使用這些方法。
爲何這個頗有用?由於有些時候,你不能使用findAll
或find
從Tag
或NavigableString
得到你想要的。例如,下面的HTML文檔:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup('''<ul> <li>An unrelated list </ul> <h1>Heading</h1> <p>This is <b>the list you want</b>:</p> <ul><li>The data you want</ul>''')
有不少方法去定位到包含特定數據的<LI> 標籤。最明顯的方式以下:
soup('li', limit=2)[1] #<li>The data you want</li>
顯然,這樣得到所需的<LI>標籤並不穩定。若是,你只分析一次頁面,這沒什麼影響。 可是若是你須要在一段時間分析不少次這個頁面,就須要考慮一下這種方法。 If the irrelevant list grows another <LI> tag, you'll get that tag instead of the one you want, and your script will break or give the wrong data.
由於若是列表發生變化,你可能就得不到你想要的結果。
soup('ul', limit=2)[1].li #<li>The data you want</li>
That's is a little better, because it can survive changes to the irrelevant list. But if the document grows another irrelevant list at the top, you'll get the first <LI> tag of that list instead of the one you want. A more reliable way of referring to the ul tag you want would better reflect that tag's place in the structure of the document.
這有一點好處,由於那些不相干的列表的變動生效了。 可是若是文檔增加的不相干的列表在頂部,你會得到第一個<LI>標籤而不是 你想要的標籤。一個更可靠的方式是去引用對應的ul標籤, 這樣能夠更好的處理文檔的結構。
在HTML裏面,你也許認爲你想要的list是<H1>標籤下的<UL>標籤。 問題是那個標籤不是在<H1>下,它只是在它後面。得到<H1>標籤很容易,可是得到 <UL>卻無法使用first
和fetch
, 由於這些方法只是搜索<H1>標籤的contents
。 你須要使用next
或nextSibling
來得到<UL>標籤。
s = soup.h1 while getattr(s, 'name', None) != 'ul': s = s.nextSibling s.li #<li>The data you want</li>
或者,你以爲這樣也許會比較穩定:
s = soup.find(text='Heading') while getattr(s, 'name', None) != 'ul': s = s.next s.li #<li>The data you want</li>
可是還有不少困難須要你去克服。這裏會介紹一下很是有用的方法。 你能夠在你須要的使用它們寫一些遍歷成員的方法。它們以某種方式遍歷樹,並跟蹤那些知足條件的Tag
和NavigableString
對象。代替上面那個例子的第一的循環的代碼,你能夠這樣寫:
soup.h1.findNextSibling('ul').li #<li>The data you want</li>
第二循環,你能夠這樣寫:
soup.find(text='Heading').findNext('ul').li #<li>The data you want</li>
這些循環代替調用findNextString
和findNext
。 本節剩下的內容是這種類型所用方法的參考。同時,對於遍歷老是有兩種方法: 一個是返回list的findAll
,一個是返回單一量的find
。
from BeautifulSoup import BeautifulSoup doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() #<html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> #</html>
findNextSiblings(name, attrs, text, limit, **kwargs)
and findNextSibling(name, attrs, text, **kwargs)
這兩個方法以nextSibling
的成員爲依據, 得到知足條件的Tag
或NavigableText
對象。 以上面的文檔爲例:
paraText = soup.find(text='This is paragraph ') paraText.findNextSiblings('b') #[<b>one</b>] paraText.findNextSibling(text = lambda(text): len(text) == 1) #u'.'
findPreviousSiblings(name, attrs, text, limit, **kwargs)
and findPreviousSibling(name, attrs, text, **kwargs)
這兩個方法以previousSibling
成員爲依據,得到知足條件的Tag
和 NavigableText
對象。 以上面的文檔爲例:
paraText = soup.find(text='.') paraText.findPreviousSiblings('b') #[<b>one</b>] paraText.findPreviousSibling(text = True) #u'This is paragraph '
findAllNext(name, attrs, text, limit, **kwargs)
and findNext(name, attrs, text, **kwargs)
這兩個方法以next
的成員爲依據, 得到知足條件的Tag
和NavigableText
對象。 以上面的文檔爲例:
pTag = soup.find('p') pTag.findAllNext(text=True) #[u'This is paragraph ', u'one', u'.', u'This is paragraph ', u'two', u'.'] pTag.findNext('p') #<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> pTag.findNext('b') #<b>one</b>
findAllPrevious(name, attrs, text, limit, **kwargs)
and findPrevious(name, attrs, text, **kwargs)
這兩方法以previous
的成員依據, 得到知足條件的Tag
和NavigableText
對象。 以上面的文檔爲例:
lastPTag = soup('p')[-1] lastPTag.findAllPrevious(text=True) #[u'.', u'one', u'This is paragraph ', u'Page title'] #Note the reverse order! lastPTag.findPrevious('p') #<p id="firstpara" align="center">This is paragraph <b>one</b>.</p> lastPTag.findPrevious('b') #<b>one</b>
findParents(name, attrs, limit, **kwargs)
and findParent(name, attrs, **kwargs)
這兩個方法以parent
成員爲依據, 得到知足條件的Tag
和NavigableText
對象。 他們沒有text
參數,由於這裏的對象的parent不會有NavigableString
。 以上面的文檔爲例:
bTag = soup.find('b') [tag.name for tag in bTag.findParents()] #[u'p', u'body', u'html', '[document]'] #NOTE: "u'[document]'" means that that the parser object itself matched. bTag.findParent('body').name #u'body'
如今你已經知道如何在剖析樹中尋找東西了。但也許你想對它作些修改並輸出出來。 你能夠僅僅將一個元素從其父母的contents
中分離,可是文檔的其餘部分仍然 擁有對這個元素的引用。Beautiful Soup 提供了幾種方法幫助你修改剖析樹並保持其內部的一致性。
你可使用字典賦值來修改Tag
對象的屬性值。
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("<b id="2">Argh!</b>") print soup #<b id="2">Argh!</b> b = soup.b b['id'] = 10 print soup #<b id="10">Argh!</b> b['id'] = "ten" print soup #<b id="ten">Argh!</b> b['id'] = 'one "million"' print soup #<b id='one "million"'>Argh!</b>
你也能夠刪除一個屬性值,而後添加一個新的屬性:
del(b['id']) print soup #<b>Argh!</b> b['class'] = "extra bold and brassy!" print soup #<b class="extra bold and brassy!">Argh!</b>
要是你引用了一個元素,你可使用extract
將它從樹中抽離。 下面是將全部的註釋從文檔中移除的代碼:
from BeautifulSoup import BeautifulSoup, Comment soup = BeautifulSoup("""1<!--The loneliest number--> <a>2<!--Can be as bad as one--><b>3""") comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] print soup #1 #<a>2<b>3</b></a>
這段代碼是從文檔中移除一個子樹:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("<a1></a1><a><b>Amazing content<c><d></a><a2></a2>") soup.a1.nextSibling #<a><b>Amazing content<c><d></d></c></b></a> soup.a2.previousSibling #<a><b>Amazing content<c><d></d></c></b></a> subtree = soup.a subtree.extract() print soup #<a1></a1><a2></a2> soup.a1.nextSibling #<a2></a2> soup.a2.previousSibling #<a1></a1>
extract
方法將一個剖析樹分離爲兩個不連貫的樹。naviation的成員也所以變得看起來好像這兩個樹 歷來不是一塊兒的。
soup.a1.nextSibling #<a2></a2> soup.a2.previousSibling #<a1></a1> subtree.previousSibling == None #True subtree.parent == None #True
replaceWith
方法抽出一個頁面元素並將其替換爲一個不一樣的元素。 新元素能夠爲一個Tag
(它可能包含一個剖析樹)或者NavigableString
。 若是你傳一個字符串到replaceWith
, 它會變爲NavigableString
。 這個Navigation成員會徹底融入到這個剖析樹中,就像它原本就存在同樣。
下面是一個簡單的例子:
The new element can be a
Tag
(possibly with a whole parse tree beneath it) or a NavigableString
. If you pass a plain old string into replaceWith
, it gets turned into a NavigableString
. The navigation members are changed as though the document had been parsed that way in the first place.
Here's a simple example:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("<b>Argh!</b>") soup.find(text="Argh!").replaceWith("Hooray!") print soup #<b>Hooray!</b> newText = soup.find(text="Hooray!") newText.previous #<b>Hooray!</b> newText.previous.next #u'Hooray!' newText.parent #<b>Hooray!</b> soup.b.contents #[u'Hooray!']
這裏有一個更復雜點的,相互替換標籤(tag)的例子:
from BeautifulSoup import BeautifulSoup, Tag soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>") tag = Tag(soup, "newTag", [("id", 1)]) tag.insert(0, "Hooray!") soup.a.replaceWith(tag) print soup #<b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>
You can even rip out an element from one part of the document and stick it in another part:
你也能夠將一個元素抽出而後插入到文檔的其餘地方:
from BeautifulSoup import BeautifulSoup text = "<html>There's <b>no</b> business like <b>show</b> business</html>" soup = BeautifulSoup(text) no, show = soup.findAll('b') show.replaceWith(no) print soup #<html>There's business like <b>no</b> business</html>
The Tag
class and the parser classes support a method called insert
. It works just like a Python list's insert
method: it takes an index to the tag's contents
member, and sticks a new element in that slot.
標籤類和剖析類有一個insert方法,它就像Python列表的insert
方法: 它使用索引來定位標籤的contents
成員,而後在那個位置插入一個新的元素。
This was demonstrated in the previous section, when we replaced a tag in the document with a brand new tag. You can use insert
to build up an entire parse tree from scratch:
在前面那個小結中,咱們在文檔替換一個新的標籤時有用到這個方法。你可使用insert
來從新構建整個剖析樹:
from BeautifulSoup import BeautifulSoup, Tag, NavigableString soup = BeautifulSoup() tag1 = Tag(soup, "mytag") tag2 = Tag(soup, "myOtherTag") tag3 = Tag(soup, "myThirdTag") soup.insert(0, tag1) tag1.insert(0, tag2) tag1.insert(1, tag3) print soup #<mytag><myOtherTag></myOtherTag><myThirdTag></myThirdTag></mytag> text = NavigableString("Hello!") tag3.insert(0, text) print soup #<mytag><myOtherTag></myOtherTag><myThirdTag>Hello!</myThirdTag></mytag>
An element can occur in only one place in one parse tree. If you give insert
an element that's already connected to a soup object, it gets disconnected (with extract
) before it gets connected elsewhere. In this example, I try to insert my NavigableString
into a second part of the soup, but it doesn't get inserted again. It gets moved:
一個元素可能只在剖析樹中出現一次。若是你給insert
的元素已經和soup對象所關聯, 它會被取消關聯(使用extract
)在它在被鏈接別的地方以前。在這個例子中,我試着插入個人NavigableString
到 soup對象的第二部分,可是它並無被再次插入而是被移動了:
tag2.insert(0, text) print soup #<mytag><myOtherTag>Hello!</myOtherTag><myThirdTag></myThirdTag></mytag>
This happens even if the element previously belonged to a completely different soup object. An element can only have one parent
, one nextSibling
, et cetera, so it can only be in one place at a time.
即便這個元素屬於一個徹底不一樣的soup對象,仍是會這樣。 一個元素只能夠有一個parent
,一個nextSibling
等等,也就是說一個地方只能出現一次。
This section covers common problems people have with Beautiful Soup. 這一節是使用BeautifulSoup時會遇到的一些常見問題的解決方法。
If you're getting errors that say: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)"
, the problem is probably with your Python installation rather than with Beautiful Soup. Try printing out the non-ASCII characters without running them through Beautiful Soup and you should have the same problem. For instance, try running code like this:
若是你遇到這樣的錯誤: "'ascii' codec can't encode character 'x' in position y: ordinal not in range(128)"
, 這個錯誤多是Python的問題而不是BeautifulSoup。
(譯者注:在已知文檔編碼類型的狀況下,能夠先將編碼轉換爲unicode形式,在轉換爲utf-8編碼,而後才傳遞給BeautifulSoup。 例如HTML的內容htm是GB2312編碼:
htm=unicode(htm,'gb2312','ignore').encode('utf-8','ignore')
soup=BeautifulSoup(htm)
若是不知道編碼的類型,可使用chardet先檢測一下文檔的編碼類型。chardet須要本身安裝一下,在網上很容下到。)
試着不用Beautiful Soup而直接打印non-ASCII 字符,你也會遇到同樣的問題。 例如,試着運行如下代碼:
latin1word = 'Sacr\xe9 bleu!' unicodeword = unicode(latin1word, 'latin-1') print unicodeword
If this works but Beautiful Soup doesn't, there's probably a bug in Beautiful Soup. However, if this doesn't work, the problem's with your Python setup. Python is playing it safe and not sending non-ASCII characters to your terminal. There are two ways to override this behavior.
若是它沒有問題而Beautiful Soup不行,這多是BeautifulSoup的一個bug。 可是,若是這個也有問題,就是Python自己的問題。Python爲了安全緣故不支持發送non-ASCII 到終端。有兩種方法能夠解決這個限制。
The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal.
最簡單的方式是將標準輸出從新映射到一個轉換器,不在乎發送到終端的字符類型是ISO-Latin-1仍是UTF-8字符串。
import codecs import sys streamWriter = codecs.lookup('utf-8')[-1] sys.stdout = streamWriter(sys.stdout)
codecs.lookup
returns a number of bound methods and other objects related to a codec. The last one is a StreamWriter
object capable of wrapping an output stream. codecs.lookup
返回一些綁定的方法和其它和codec相關的對象。 最後一行是一個封裝了輸出流的StreamWriter
對象。
The hard way is to create a sitecustomize.py
file in your Python installation which sets the default encoding to ISO-Latin-1 or to UTF-8. Then all your Python programs will use that encoding for standard output, without you having to do something for each program. In my installation, I have a /usr/lib/python/sitecustomize.py
which looks like this:
稍微困難點的方法是建立一個sitecustomize.py
文件在你的Python安裝中, 將默認編碼設置爲ISO-Latin-1或UTF-8。這樣你全部的Python程序都會使用這個編碼做爲標準輸出, 不用在每一個程序裏再設置一下。在個人安裝中,我有一個 /usr/lib/python/sitecustomize.py
,內容以下:
import sys sys.setdefaultencoding("utf-8")
For more information about Python's Unicode support, look at Unicode for Programmers or End to End Unicode Web Applications in Python. Recipes 1.20 and 1.21 in the Python cookbook are also very helpful.
更多關於Python的Unicode支持的信息,參考 Unicode for Programmers or End to End Unicode Web Applications in Python。Python食譜的給的菜譜1.20和1.21也頗有用。
Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print
.
可是即便你的終端顯示被限制爲ASCII,你也可使用BeautifulSoup以UTF-8和其它的編碼類型來剖析,處理和修改文檔。 只是對於某些字符,你不能使用print
來輸出。
Beautiful Soup can handle poorly-structured SGML, but sometimes it loses data when it gets stuff that's not SGML at all. This is not nearly as common as poorly-structured markup, but if you're building a web crawler or something you'll surely run into it.
Beautiful Soup能夠處理結構不太規範的SGML,可是給它的材料很是不規範, 它會丟失數據。若是你是在寫一個網絡爬蟲之類的程序,你確定會遇到這種,不太常見的結構有問題的文檔。
The only solution is to sanitize the data ahead of time with a regular expression. Here are some examples that I and Beautiful Soup users have discovered:
惟一的解決方法是先使用正則表達式來規範數據。 下面是一些我和一些Beautiful Soup的使用者發現的例子:
Beautiful Soup treats ill-formed XML definitions as data. However, it loses well-formed XML definitions that don't actually exist:
Beautiful Soup 將不規範德XML定義處理爲數據(data)。然而,它丟失了那些實際上不存在的良好的XML定義:
from BeautifulSoup import BeautifulSoup BeautifulSoup("< ! FOO @=>") #< ! FOO @=> BeautifulSoup("<b><!FOO>!</b>") #<b>!</b>
If your document starts a declaration and never finishes it, Beautiful Soup assumes the rest of your document is part of the declaration. If the document ends in the middle of the declaration, Beautiful Soup ignores the declaration totally. A couple examples:
若是你的文檔開始了聲明但卻沒有關閉,Beautiful Soup假定你的文檔的剩餘部分都是這個聲明的一部分。 若是文檔在聲明的中間結束了,Beautiful Soup會忽略這個聲明。以下面這個例子:
from BeautifulSoup import BeautifulSoup BeautifulSoup("foo<!bar") #foo soup = BeautifulSoup("<html>foo<!bar</html>") print soup.prettify() #<html> # foo<!bar</html> #</html>
There are a couple ways to fix this; one is detailed here.
有幾種方法來處理這種狀況;其中一種在 這裏有詳細介紹。
Beautiful Soup also ignores an entity reference that's not finished by the end of the document:
Beautiful Soup 也會忽略實體引用,若是它沒有在文檔結束的時候關閉:
BeautifulSoup("<foo>") #<foo
I've never seen this in real web pages, but it's probably out there somewhere. 我歷來沒有在實際的網頁中遇到這種狀況,可是也許別的地方會出現。
A malformed comment will make Beautiful Soup ignore the rest of the document. This is covered as the example in Sanitizing Bad Data with Regexps.
一個畸形的註釋會是Beautiful Soup回來文檔的剩餘部分。在使用正則規範數據這裏有詳細的例子。
BeautifulSoup
class offends my senses! BeautifulSoup
類構建的剖析樹讓我感到頭痛。To get your markup parsed differently, check out
嘗試一下別的剖析方法,試試 其餘內置的剖析器,或者 自定義一個剖析器.
Beautiful Soup will never run as fast as ElementTree or a custom-built SGMLParser
subclass. ElementTree is written in C, and SGMLParser
lets you write your own mini-Beautiful Soup that only does what you want. The point of Beautiful Soup is to save programmer time, not processor time.
Beautiful Soup 不會像ElementTree或者自定義的SGMLParser
子類同樣快。 ElementTree是用C寫的,而且作那些你想要作的事。 Beautiful Soup是用來節省程序員的時間,而不是處理器的時間。
That said, you can speed up Beautiful Soup quite a lot by only parsing the parts of the document you need, and you can make unneeded objects get garbage-collected by using extract
.
可是你能夠加快Beautiful Soup經過解析部分的文檔,
That does it for the basic usage of Beautiful Soup. But HTML and XML are tricky, and in the real world they're even trickier. So Beautiful Soup keeps some extra tricks of its own up its sleeve.
那些是對Beautiful Soup的基本用法。可是現實中的HTML和XML是很是棘手的(tricky),即便他們不是trickier。 所以Beautiful Soup也有一些額外的技巧。
The search methods described above are driven by generator methods. You can use these methods yourself: they're called nextGenerator
, previousGenerator
, nextSiblingGenerator
,previousSiblingGenerator
, and parentGenerator
. Tag
and parser objects also have childGenerator
and recursiveChildGenerator
available.
以上的搜索方法都是由產生器驅動的。你也能夠本身使用這些方法: 他們是nextGenerator
, previousGenerator
, nextSiblingGenerator
, previousSiblingGenerator
, 和parentGenerator
. Tag
和剖析對象 可使用childGenerator
和recursiveChildGenerator
。
Here's a simple example that strips HTML tags out of a document by iterating over the document and collecting all the strings.
下面是一個簡單的例子,將遍歷HTML的標籤並將它們從文檔中剝離,蒐集全部的字符串:
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("""<div>You <i>bet</i> <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> rocks!</div>""") ''.join([e for e in soup.recursiveChildGenerator() if isinstance(e,unicode)]) #u'You bet\nBeautifulSoup\nrocks!'
Here's a more complex example that uses recursiveChildGenerator
to iterate over the elements of a document, printing each one as it gets it. 這是一個稍微複雜點的使用recursiveChildGenerator
的例子來遍歷文檔中全部元素, 並打印它們。
from BeautifulSoup import BeautifulSoup soup = BeautifulSoup("1<a>2<b>3") g = soup.recursiveChildGenerator() while True: try: print g.next() except StopIteration: break #1 #<a>2<b>3</b></a> #2 #<b>3</b> #3
Beautiful Soup comes with three parser classes besides BeautifulSoup
and BeautifulStoneSoup
:
除了BeautifulSoup
和 BeautifulStoneSoup
,還有其它三個Beautiful Soup剖析器:
MinimalSoup
is a subclass of BeautifulSoup
. It knows most facts about HTML like which tags are self-closing, the special behavior of the <SCRIPT> tag, the possibility of an encoding mentioned in a <META> tag, etc. But it has no nesting heuristics at all. So it doesn't know that <LI> tags go underneath <UL> tags and not the other way around. It's useful for parsing pathologically bad markup, and for subclassing. MinimalSoup
是BeautifulSoup
的子類。對於HTML的大部份內容均可以處理, 例如自關閉的標籤,特殊的標籤<SCRIPT>,<META>中寫到的可能的編碼類型,等等。 可是它沒有內置的智能判斷能力。例如它不知道<LI>標籤應該在<UL>下,而不是其餘方式。 對於處理糟糕的標記和用來被繼承仍是有用的。
ICantBelieveItsBeautifulSoup
is also a subclass of BeautifulSoup
. It has HTML heuristics that conform more closely to the HTML standard, but ignore how HTML is used in the real world. For instance, it's valid HTML to nest <B> tags, but in the real world a nested <B> tag almost always means that the author forgot to close the first <B> tag. If you run into someone who actually nests <B> tags, then you can use ICantBelieveItsBeautifulSoup
. ICantBelieveItsBeautifulSoup
也是BeautifulSoup的子類。 它具備HTML的智能(heuristics)判斷能力,更加符合標準的HTML,可是忽略實際使用的HTML。 例如:一個嵌入<B>標籤的HTML是有效的,可是實際上一個嵌入的<B>一般意味着 那個HTML的做者忘記了關閉第一個<B>標籤。若是你運行某些人確實使用嵌入的<B>標籤的HTML, 這是你能夠是使用ICantBelieveItsBeautifulSoup
。
BeautifulSOAP
is a subclass of BeautifulStoneSoup
. It's useful for parsing documents like SOAP messages, which use a subelement when they could just use an attribute of the parent element. Here's an example: BeautifulSOAP
是BeautifulStoneSoup
的子類。對於處理那些相似SOAP消息的文檔, 也就是處理那些能夠將標籤的子標籤變爲其屬性的文檔很方便。下面是一個例子:
from BeautifulSoup import BeautifulStoneSoup, BeautifulSOAP xml = "<doc><tag>subelement</tag></doc>" print BeautifulStoneSoup(xml) #<doc><tag>subelement</tag></doc> print BeautifulSOAP(xml) <doc tag="subelement"><tag>subelement</tag></doc>
With BeautifulSOAP
you can access the contents of the <TAG> tag without descending into the tag.
使用BeautifulSOAP
,你能夠直接存取<TAG>而不須要再往下解析。
When the built-in parser classes won't do the job, you need to customize. This usually means customizing the lists of nestable and self-closing tags. You can customize the list of self-closing tags by passing a selfClosingTags
argument into the soup constructor. To customize the lists of nestable tags, though, you'll have to subclass.
當內置的剖析類不能作一些工做時,你須要定製它們。 這一般意味着從新定義可內嵌的標籤和自關閉的標籤列表。 你能夠經過傳遞參數selfClosingTags
給soup的構造器來定製自關閉的標籤。自定義能夠內嵌的標籤的列表,你須要子類化。
The most useful classes to subclass are MinimalSoup
(for HTML) and BeautifulStoneSoup
(for XML). I'm going to show you how to override RESET_NESTING_TAGS
and NESTABLE_TAGS
in a subclass. This is the most complicated part of Beautiful Soup and I'm not going to explain it very well here, but I'll get something written and then I can improve it with feedback.
很是有用的用來子類的類是MinimalSoup
類(針對HTML)和BeautifulStoneSoup
(針對XML)。 我會說明如何在子類中重寫RESET_NESTING_TAGS
和NESTABLE_TAGS
。這是Beautiful Soup 中 最爲複雜的部分,因此我也不會在這裏詳細的解釋,可是我會寫些東西並利用反饋來改進它。
When Beautiful Soup is parsing a document, it keeps a stack of open tags. Whenever it sees a new start tag, it tosses that tag on top of the stack. But before it does, it might close some of the open tags and remove them from the stack. Which tags it closes depends on the qualities of tag it just found, and the qualities of the tags in the stack.
當Beautiful Soup剖析一個文檔的時候,它會保持一個打開的tag的堆棧。任什麼時候候只要它看到一個新的 開始tag,它會將這個tag拖到堆棧的頂端。但在作這步以前,它可能會關閉某些已經打開的標籤並將它們從 堆棧中移除。
The best way to explain it is through example. Let's say the stack looks like ['html', 'p', 'b']
, and Beautiful Soup encounters a <P> tag. If it just tossed another 'p'
onto the stack, this would imply that the second <P> tag is within the first <P> tag, not to mention the open <B> tag. But that's not the way <P> tags work. You can't stick a <P> tag inside another <P> tag. A <P> tag isn't "nestable" at all.
咱們最好仍是經過例子來解釋。咱們假定堆棧如同['html','p','b']
, 而且Beautiful Soup遇到一個<P>標籤。若是它僅僅將另外一個'p'
拖到堆棧的頂端, 這意味着第二個<P>標籤在第一個<P>內,而不會影響到打開的<B>。 可是這不是<P>應該的樣子。你不能插入一個<P>到另外一個<P>裏面去。<P>標籤不是可內嵌的。
So when Beautiful Soup encounters a <P> tag, it closes and pops all the tags up to and including the previously encountered tag of the same type. This is the default behavior, and this is how BeautifulStoneSoup
treats every tag. It's what you get when a tag is not mentioned in either NESTABLE_TAGS
or RESET_NESTING_TAGS
. It's also what you get when a tag shows up inRESET_NESTING_TAGS
but has no entry in NESTABLE_TAGS
, the way the <P> tag does.
所以當Beautiful Soup 遇到一個<P>時,它先關閉並彈出全部的標籤,包括前面遇到的同類型的標籤。 這是默認的操做,這也是Beautiful Soup對待每一個標籤的方式。當一個標籤不在NESTABLE_TAGS
或RESET_NESTING_TAGS
中時,你會遇到的處理方式。這也是當一個標籤在RESET_NESTING_TAGS
中而不在NESTABLE_TAGS
中時的處理方式,就像處理<P>同樣。
from BeautifulSoup import BeautifulSoup BeautifulSoup.RESET_NESTING_TAGS['p'] == None #True BeautifulSoup.NESTABLE_TAGS.has_key('p') #False print BeautifulSoup("<html><p>Para<b>one<p>Para two") #<html><p>Para<b>one</b></p><p>Para two</p></html> # ^---^--The second <p> tag made those two tags get closed
Let's say the stack looks like ['html', 'span', 'b']
, and Beautiful Soup encounters a <SPAN> tag. Now, <SPAN> tags can contain other <SPAN> tags without limit, so there's no need to pop up to the previous <SPAN> tag when you encounter one. This is represented by mapping the tag name to an empty list in NESTABLE_TAGS
. This kind of tag should not be mentioned in RESET_NESTING_TAGS
: there are no circumstances when encountering a <SPAN> tag would cause any tags to be popped.
咱們假定堆棧如同['html','span','b']
,而且Beautiful Soup 遇到一個<SPAN>標籤。 如今,<SPAN>能夠無限制包含其餘的<SPAN>,所以當再次遇到<SPAN>標籤時沒有必要彈出前面的<SPAN>標籤。 這是經過映射標籤名到NESTABLE_TAGS
中的一個空列表裏。這樣的標籤也須要在RESET_NESTING_TAGS
裏 設置:當再次遇到<SPAN>是不會再致使任何標籤被彈出並關閉。
from BeautifulSoup import BeautifulSoup BeautifulSoup.NESTABLE_TAGS['span'] #[] BeautifulSoup.RESET_NESTING_TAGS.has_key('span') #False print BeautifulSoup("<html><span>Span<b>one<span>Span two") #<html><span>Span<b>one<span>Span two</span></b></span></html>
Third example: suppose the stack looks like ['ol','li','ul']
: that is, we've got an ordered list, the first element of which contains an unordered list. Now suppose Beautiful Soup encounters a <LI> tag. It shouldn't pop up to the first <LI> tag, because this new <LI> tag is part of the unordered sublist. It's okay for an <LI> tag to be inside another <LI> tag, so long as there's a <UL> or <OL> tag in the way.
第三個例子:假定堆棧如同['ol','li','ul']
: 也就是,咱們有一個有序的list,且列表的第一個元素包含一個無序的list。如今假設,Beautiful Soup 遇到一個<LI>標籤。它不會彈出第一個<LI>,由於這個新的<LI>是無序的子list一部分。 <LI>中內嵌一個<LI>是能夠的,一樣的<UL>和<OL>標籤也能夠這樣。
from BeautifulSoup import BeautifulSoup print BeautifulSoup("<ol><li>1<ul><li>A").prettify() #<ol> # <li> # 1 # <ul> # <li> # A # </li> # </ul> # </li> #</ol>
But if there is no intervening <UL> or <OL>, then one <LI> tag can't be underneath another:
若是<UL>和<OL>沒有被幹擾,這時一個<LI>標籤也不能在另外一個之下。[bad]
print BeautifulSoup("<ol><li>1<li>A").prettify() #<ol> # <li> # 1 # </li> # <li> # A # </li> #</ol>
We tell Beautiful Soup to treat <LI> tags this way by putting "li" in RESET_NESTING_TAGS
, and by giving "li" a NESTABLE_TAGS
entry showing list of tags under which it can nest.
Beautiful Soup這樣對待<LI>是經過將"li"放入RESET_NESTING_TAGS
,並給在NESTABLE_TAGS
中給"li"一個能夠內嵌接口。
BeautifulSoup.RESET_NESTING_TAGS.has_key('li') #True BeautifulSoup.NESTABLE_TAGS['li'] #['ul', 'ol']
This is also how we handle the nesting of table tags:
這也是處理內嵌的table標籤的方式:
BeautifulSoup.NESTABLE_TAGS['td'] #['tr'] BeautifulSoup.NESTABLE_TAGS['tr'] #['table', 'tbody', 'tfoot', 'thead'] BeautifulSoup.NESTABLE_TAGS['tbody'] #['table'] BeautifulSoup.NESTABLE_TAGS['thead'] #['table'] BeautifulSoup.NESTABLE_TAGS['tfoot'] #['table'] BeautifulSoup.NESTABLE_TAGS['table'] #[]
That is: <TD> tags can be nested within <TR> tags. <TR> tags can be nested within <TABLE>, <TBODY>, <TFOOT>, and <THEAD> tags. <TBODY>, <TFOOT>, and <THEAD> tags can be nested in <TABLE> tags, and <TABLE> tags can be nested in other <TABLE> tags. If you know about HTML tables, these rules should already make sense to you.
也就是<TD>標籤能夠嵌入到<TR>中。 <TR>能夠被嵌入到<TABLE>, <TBODY>, <TFOOT>, 以及 <THEAD> 中。 <TBODY>,<TFOOT>, and <THEAD>標籤能夠嵌入到 <TABLE> 標籤中, 而 <TABLE> 嵌入到其它的<TABLE> 標籤中. 若是你對HTML有所瞭解,這些規則對你而言應該很熟悉。
One more example. Say the stack looks like ['html', 'p', 'table']
and Beautiful Soup encounters a <P> tag.
再舉一個例子,假設堆棧如同['html','p','table']
,而且Beautiful Soup遇到一個<P>標籤。
At first glance, this looks just like the example where the stack is ['html', 'p', 'b']
and Beautiful Soup encounters a <P> tag. In that example, we closed the <B> and <P> tags, because you can't have one paragraph inside another. 首先,這看起來像前面的一樣是Beautiful Soup遇到了堆棧['html','p','b']
。 在那個例子中,咱們關閉了<B>和<P>標籤,由於你不能在一個段落裏內嵌另外一個段落。
Except... you can have a paragraph that contains a table, and then the table contains a paragraph. So the right thing to do is to not close any of these tags. Beautiful Soup does the right thing: 除非,你的段落裏包含了一個table,而後這table包含了一個段落。所以,這種狀況下正確的處理是 不關閉任何標籤。Beautiful Soup就是這樣作的:
from BeautifulSoup import BeautifulSoup print BeautifulSoup("<p>Para 1<b><p>Para 2") #<p> # Para 1 # <b> # </b> #</p> #<p> # Para 2 #</p> print BeautifulSoup("<p>Para 1<table><p>Para 2").prettify() #<p> # Para 1 # <table> # <p> # Para 2 # </p> # </table> #</p>
What's the difference? The difference is that <TABLE> is in RESET_NESTING_TAGS
and <B> is not. A tag that's in RESET_NESTING_TAGS
doesn't get popped off the stack as easily as a tag that's not.
有什麼不一樣?不一樣是<TABLE>標籤在RESET_NESTING_TAGS
中,而<B>不在。 一個在RESET_NESTING_TAGS
中標籤不會像不在其裏面的標籤那樣,會是堆棧中標籤被彈出。
Okay, hopefully you get the idea. Here's the NESTABLE_TAGS
for the BeautifulSoup
class. Correlate this with what you know about HTML, and you should be able to create your own NESTABLE_TAGS
for bizarre HTML documents that don't follow the normal rules, and for other XML dialects that have different nesting rules.
好了,但願你明白了(我被弄有點暈,有些地方翻譯的不清,還請見諒)。 NESTABLE_TAGS
用於BeautifulSoup
類。 依據你所知道的HTML,你能夠建立你本身NESTABLE_TAGS
來處理那些不遵循標準規則的HTML文檔。 以及那些使用不一樣嵌入規則XML的方言。
from BeautifulSoup import BeautifulSoup nestKeys = BeautifulSoup.NESTABLE_TAGS.keys() nestKeys.sort() for key in nestKeys: print "%s: %s" % (key, BeautifulSoup.NESTABLE_TAGS[key]) #bdo: [] #blockquote: [] #center: [] #dd: ['dl'] #del: [] #div: [] #dl: [] #dt: ['dl'] #fieldset: [] #font: [] #ins: [] #li: ['ul', 'ol'] #object: [] #ol: [] #q: [] #span: [] #sub: [] #sup: [] #table: [] #tbody: ['table'] #td: ['tr'] #tfoot: ['table'] #th: ['tr'] #thead: ['table'] #tr: ['table', 'tbody', 'tfoot', 'thead'] #ul: []
And here's BeautifulSoup
's RESET_NESTING_TAGS
. Only the keys are important: RESET_NESTING_TAGS
is actually a list, put into the form of a dictionary for quick random access.
這是BeautifulSoup
的RESET_NESTING_TAGS
。只有鍵(keys)是重要的: RESET_NESTING_TAGS
實際是一個list,以字典的形式能夠快速隨機存取。
from BeautifulSoup import BeautifulSoup resetKeys = BeautifulSoup.RESET_NESTING_TAGS.keys() resetKeys.sort() resetKeys #['address', 'blockquote', 'dd', 'del', 'div', 'dl', 'dt', 'fieldset', # 'form', 'ins', 'li', 'noscript', 'ol', 'p', 'pre', 'table', 'tbody', # 'td', 'tfoot', 'th', 'thead', 'tr', 'ul']
Since you're subclassing anyway, you might as well override SELF_CLOSING_TAGS
while you're at it. It's a dictionary that maps self-closing tag names to any values at all (likeRESET_NESTING_TAGS
, it's actually a list in the form of a dictionary). Then you won't have to pass that list in to the constructor (as selfClosingTags
) every time you instantiate your subclass.
由於不管如何都有使用繼承,你最好仍是在須要的時候重寫SELF_CLOSING_TAGS
。 這是一個映射自關閉標籤名的字典(如同RESET_NESTING_TAGS
,它實際是字典形式的list)。 這樣每次實例化你的子類時,你就不用傳list給構造器(如selfClosingTags
)。
When you parse a document, you can convert HTML or XML entity references to the corresponding Unicode characters. This code converts the HTML entity "é" to the Unicode character LATIN SMALL LETTER E WITH ACUTE, and the numeric entity "e" to the Unicode character LATIN SMALL LETTER E.
當你剖析一個文檔是,你能夠轉換HTML或者XML實體引用到可表達Unicode的字符。 這個代碼轉換HTML實體"é"到Unicode字符 LATIN SMALL LETTER E WITH ACUTE,以及將 數量實體"e"轉換到Unicode字符LATIN SMALL LETTER E.
from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Sacré bleu!", convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0] #u'Sacr\xe9 bleu!'
That's if you use HTML_ENTITIES
(which is just the string "html"). If you use XML_ENTITIES
(or the string "xml"), then only numeric entities and the five XML entities (""", "'", ">", "<", and "&") get converted. If you use ALL_ENTITIES
(or the list ["xml", "html"]
), then both kinds of entities will be converted. This last one is neccessary because ' is an XML entity but not an HTML entity.
這是針對使用HTML_ENTITIES
(也就是字符串"html")。若是你使用XML_ENTITIES
(或字符串"xml"), 這是隻有數字實體和五個XML實體((""","'", ">", "<", 和 "&") 會被轉換。若是你使用ALL_ENTITIES
(或者列表["xml","html"]
), 兩種實體都會被轉換。最後一種方式是必要的,由於'是一個XML的實體而不是HTML的。
BeautifulStoneSoup("Sacré bleu!", convertEntities=BeautifulStoneSoup.XML_ENTITIES) #Sacré bleu! from BeautifulSoup import BeautifulStoneSoup BeautifulStoneSoup("Il a dit, <<Sacré bleu!>>", convertEntities=BeautifulStoneSoup.XML_ENTITIES) #Il a dit, <<Sacré bleu!>>
If you tell Beautiful Soup to convert XML or HTML entities into the corresponding Unicode characters, then Windows-1252 characters (like Microsoft smart quotes) also get transformed into Unicode characters. This happens even if you told Beautiful Soup to convert those characters to entities.
若是你指定Beautiful Soup轉換XML或HTML實體到可通訊的Unicode字符時,Windows-1252(微軟的smart quotes)也會 被轉換爲Unicode字符。即便你指定Beautiful Soup轉換這些字符到實體是,也仍是這樣。
from BeautifulSoup import BeautifulStoneSoup smartQuotesAndEntities = "Il a dit, \x8BSacré bleu!\x9b" BeautifulStoneSoup(smartQuotesAndEntities, smartQuotesTo="html").contents[0] #u'Il a dit, ‹Sacré bleu!›' BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="html", smartQuotesTo="html").contents[0] #u'Il a dit, \u2039Sacr\xe9 bleu!\u203a' BeautifulStoneSoup(smartQuotesAndEntities, convertEntities="xml", smartQuotesTo="xml").contents[0] #u'Il a dit, \u2039Sacré bleu!\u203a'
It doesn't make sense to create new HTML/XML entities while you're busy turning all the existing entities into Unicode characters.
將全部存在的實體轉換爲Unicode時,不會影響建立新的HTML/XML實體。
Beautiful Soup does pretty well at handling bad markup when "bad markup" means tags in the wrong places. But sometimes the markup is just malformed, and the underlying parser can't handle it. So Beautiful Soup runs regular expressions against an input document before trying to parse it.
對於那些在錯誤的位置的"壞標籤",Beautiful Soup處理的還不錯。但有時有些 很是不正常的標籤,底層的剖析器也不能處理。這時Beautiful Soup會在剖析以前運用正則表達式 來處理輸入的文檔。
By default, Beautiful Soup uses regular expressions and replacement functions to do search-and-replace on input documents. It finds self-closing tags that look like <BR/>, and changes them to look like <BR />. It finds declarations that have extraneous whitespace, like <! --Comment-->, and removes the whitespace: <!--Comment-->.
默認狀況下,Beautiful Soup使用正則式和替換函數對輸入文檔進行搜索替換操做。 它能夠發現自關閉的標籤如<BR/>,轉換它們如同<BR />(譯註:多加了一個空格)。 它能夠找到有多餘空格的聲明,如<! --Comment-->,移除空格:<!--Comment-->.
If you have bad markup that needs fixing in some other way, you can pass your own list of (regular expression, replacement function)
tuples into the soup constructor, as the markupMassage
argument.
若是你的壞標籤須要以其餘的方式修復,你也能夠傳遞你本身的以(regular expression, replacement function)
元組的list到soup對象構造器,做爲markupMassage
參數。
Let's take an example: a page that has a malformed comment. The underlying SGML parser can't cope with this, and ignores the comment and everything afterwards: 咱們舉個例子:有一個頁面的註釋很糟糕。底層的SGML不能解析它,並會忽略註釋以及它後面的全部內容。
from BeautifulSoup import BeautifulSoup badString = "Foo<!-This comment is malformed.-->Bar<br/>Baz" BeautifulSoup(badString) #Foo
Let's fix it up with a regular expression and a function:
讓咱們使用正則式和一個函數來解決這個問題:
import re myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))] BeautifulSoup(badString, markupMassage=myMassage) #Foo<!--This comment is malformed.-->Bar
Oops, we're still missing the <BR> tag. Our markupMassage
overrides the parser's default massage, so the default search-and-replace functions don't get run. The parser makes it past the comment, but it dies at the malformed self-closing tag. Let's add our new massage function to the default list, so we run all the functions.
哦呃呃,咱們仍是漏掉了<BR>標籤。咱們的markupMassage
重載了剖析默認的message,所以默認的搜索替換函數不會運行。 剖析器讓它來處理註釋,可是它在壞的自關閉標籤那裏中止了。讓我加一些新的message函數到默認的list中去, 並讓這些函數都運行起來。
import copy myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE) myNewMassage.extend(myMassage) BeautifulSoup(badString, markupMassage=myNewMassage) #Foo<!--This comment is malformed.-->Bar<br />Baz
Now we've got it all.
這樣咱們就搞定了。
If you know for a fact that your markup doesn't need any regular expressions run on it, you can get a faster startup time by passing in False
for markupMassage
.
若是你已經知道你的標籤不須要任何正則式,你能夠經過傳遞一個False
給markupMassage
.
SoupStrainer
Recall that all the search methods take more or less the same arguments. Behind the scenes, your arguments to a search method get transformed into a SoupStrainer
object. If you call one of the methods that returns a list (like findAll
), the SoupStrainer
object is made available as the source
property of the resulting list.
回憶起全部的搜索方法都是或多或少使用了一些同樣的參數。 在後臺,你傳遞給搜索函數的參數都會傳給SoupStrainer
對象。 若是你所使用的函數返回一個list(如findAll
),那是SoupStrainer
對象使 結果列表的source
屬性變的可用。
from BeautifulSoup import BeautifulStoneSoup xml = '<person name="Bob"><parent rel="mother" name="Alice">' xmlSoup = BeautifulStoneSoup(xml) results = xmlSoup.findAll(rel='mother') results.source #<BeautifulSoup.SoupStrainer instance at 0xb7e0158c> str(results.source) #"None|{'rel': 'mother'}"
The SoupStrainer
constructor takes most of the same arguments as find
: name
, attrs
, text
, and **kwargs
. You can pass in a SoupStrainer
as the name
argument to any search method: SoupStrainer
的構造器幾乎使用和find
同樣的參數: name
, attrs
, text
, 和**kwargs
. 你能夠在一個SoupStrainer
中傳遞和其餘搜索方法同樣的name參數:
xmlSoup.findAll(results.source) == results #True customStrainer = BeautifulSoup.SoupStrainer(rel='mother') xmlSoup.findAll(customStrainer) == results # True
Yeah, who cares, right? You can carry around a method call's arguments in many other ways. But another thing you can do with SoupStrainer
is pass it into the soup constructor to restrict the parts of the document that actually get parsed. That brings us to the next section:
耶,誰會在乎,對不對?你能夠把一個方法的參數用在不少其餘地方。 還有一件你能夠用SoupStrainer
作的事是,將它傳遞給soup的構建器,來部分的解析文檔。 下一節,咱們就談這個。
Beautiful Soup turns every element of a document into a Python object and connects it to a bunch of other Python objects. If you only need a subset of the document, this is really slow. But you can pass in a SoupStrainer
as the parseOnlyThese
argument to the soup constructor. Beautiful Soup checks each element against the SoupStrainer
, and only if it matches is the element turned into a Tag
or NavigableText
, and added to the tree.
Beautiful Soup 將一個文檔的每一個元素都轉換爲Python對象並將文檔轉換爲一些Python對象的集合。 若是你只須要這個文檔的子集,所有轉換確實很是慢。 可是你能夠傳遞SoupStrainer
做爲parseOnlyThese
參數的值給 soup的構造器。Beautiful Soup檢查每個元素是否知足SoupStrainer
條件, 只有那些知足條件的元素會轉換爲Tag
標籤或NavigableText
,並被添加到剖析樹中。
If an element is added to to the tree, then so are its children—even if they wouldn't have matched the SoupStrainer
on their own. This lets you parse only the chunks of a document that contain the data you want.
若是一個元素被加到剖析樹中,那麼的子元素即便不知足SoupStrainer
也會被加入到樹中。 這可讓你只剖析文檔中那些你想要的數據塊。
Here's a pretty varied document:
看看下面這個有意思的例子:
doc = '''Bob reports <a href="http://www.bob.com/">success</a> with his plasma breeding <a href="http://www.bob.com/plasma">experiments</a>. <i>Don't get any on us, Bob!</i> <br><br>Ever hear of annular fusion? The folks at <a href="http://www.boogabooga.net/">BoogaBooga</a> sure seem obsessed with it. Secret project, or <b>WEB MADNESS?</b> You decide!'''
Here are several different ways of parsing the document into soup, depending on which parts you want. All of these are faster and use less memory than parsing the whole document and then using the same SoupStrainer
to pick out the parts you want.
有幾種不一樣的方法能夠根據你的需求來剖析部分文檔.比起剖析所有文檔,他們都更快並佔用更少的內存,他們都是使用相同的 SoupStrainer
來挑選文檔中你想要的部分。
from BeautifulSoup import BeautifulSoup, SoupStrainer import re links = SoupStrainer('a') [tag for tag in BeautifulSoup(doc, parseOnlyThese=links)] #[<a href="http://www.bob.com/">success</a>, # <a href="http://www.bob.com/plasma">experiments</a>, # <a href="http://www.boogabooga.net/">BoogaBooga</a>] linksToBob = SoupStrainer('a', href=re.compile('bob.com/')) [tag for tag in BeautifulSoup(doc, parseOnlyThese=linksToBob)] #[<a href="http://www.bob.com/">success</a>, # <a href="http://www.bob.com/plasma">experiments</a>] mentionsOfBob = SoupStrainer(text=re.compile("Bob")) [text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)] #[u'Bob reports ', u"Don't get any on\nus, Bob!"] allCaps = SoupStrainer(text=lambda(t):t.upper()==t) [text for text in BeautifulSoup(doc, parseOnlyThese=allCaps)] #[u'. ', u'\n', u'WEB MADNESS?']
There is one major difference between the SoupStrainer
you pass into a search method and the one you pass into a soup constructor. Recall that the name
argument can take a function whose argument is a Tag
object. You can't do this for a SoupStrainer
's name
, because the SoupStrainer
is used to decide whether or not a Tag
object should be created in the first place. You can pass in a function for a SoupStrainer
's name
, but it can't take a Tag
object: it can only take the tag name and a map of arguments.
把SoupStrainer
傳遞給搜索方法和soup構造器有一個很大的不一樣。 回憶一下,name
參數可使用以Tag
對象爲參數的函數。 可是你不能對SoupStrainer
的name
使用這招,由於SoupStrainer
被用於決定 一個Tag
對象是否能夠在第一個地方被建立。 你能夠傳遞一個函數給SoupStrainer
的name
,可是不能是使用Tag
對象的函數: 只能使用tag的名字和一個參數映射。
shortWithNoAttrs = SoupStrainer(lambda name, attrs: \ len(name) == 1 and not attrs) [tag for tag in BeautifulSoup(doc, parseOnlyThese=shortWithNoAttrs)] #[<i>Don't get any on us, Bob!</i>, # <b>WEB MADNESS?</b>]
extract
改進內存使用When Beautiful Soup parses a document, it loads into memory a large, densely connected data structure. If you just need a string from that data structure, you might think that you can grab the string and leave the rest of it to be garbage collected. Not so. That string is a NavigableString
object. It's got a parent
member that points to a Tag
object, which points to other Tag
objects, and so on. So long as you hold on to any part of the tree, you're keeping the whole thing in memory.
但Beautiful Soup剖析一個文檔的時候,它會將整個文檔以一個很大很密集的數據結構中載入內存。 若是你僅僅須要從這個數據結構中得到一個字符串, 你可能以爲爲了這個字符串而弄了那麼一堆要被當垃圾收集的數據會很不划算。 並且,那個字符串仍是NavigableString
對象。 也就是要得到一個指向Tag
對象的parent
的成員,而這個Tag又會指向其餘的Tag
對象,等等。 所以,你不得不保持一顆剖析樹全部部分,也就是把整個東西放在內存裏。
The
extract
method breaks those connections. If you call extract
on the string you need, it gets disconnected from the rest of the parse tree. The rest of the tree can then go out of scope and be garbage collected, while you use the string for something else. If you just need a small part of the tree, you can call extract
on its top-level Tag
and let the rest of the tree get garbage collected. extrace
方法能夠破壞這些連接。若是你調用extract
來得到你須要字符串, 它將會從樹的其餘部分中連接中斷開。 當你使用這個字符串作什麼時,樹的剩下部分能夠離開做用域而被垃圾收集器捕獲。 若是你即便須要一個樹的一部分,你也能夠講extract
使用在頂層的Tag
上, 讓其它部分被垃圾收集器收集。
This works the other way, too. If there's a big chunk of the document you don't need, you can call
extract
to rip it out of the tree, then abandon it to be garbage collected while retaining control of the (smaller) tree. 也可使用extract實現些別的功能。若是文檔中有一大塊不是你須要,你也可使用extract
來將它弄出剖析樹, 再把它丟給垃圾收集器同時對(較小的那個)剖析樹的控制。
If you find yourself destroying big chunks of the tree, you might have been able to save time by not parsing that part of the tree in the first place.
若是你以爲你正在破壞樹的大塊頭,你應該看看 經過剖析部分文檔來提高效率來省省時間。
其它
Lots of real-world applications use Beautiful Soup. Here are the publicly visible applications that I know about:
不少實際的應用程序已經使用Beautiful Soup。 這裏是一些我瞭解的公佈的應用程序:
I've found several other parsers for various languages that can handle bad markup, do tree traversal for you, or are otherwise more useful than your average parser.
我已經找了幾個其餘的用於不一樣語言的能夠處理爛標記的剖析器。簡單介紹一下,也許對你有所幫助。
That's it! Have fun! I wrote Beautiful Soup to save everybody time. Once you get used to it, you should be able to wrangle data out of poorly-designed websites in just a few minutes. Send me email if you have any comments, run into problems, or want me to know about your project that uses Beautiful Soup.
就這樣了!玩的開心!我寫的Beautiful Soup是爲了幫助每一個人節省時間。 一旦你習慣上用它以後,只要幾分鐘就能整好那些有些糟糕的站點。能夠發郵件給我,若是你有什麼建議,或者 遇到什麼問題,或者讓我知道你的項目再用Beautiful Soup。
--Leonard
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Thursday, February 02 2012, 13:11:38 Nowhere Standard Time and last built on Saturday, July 01 2017, 10:00:01 Nowhere Standard Time.
|
Document tree: Site Search: |