使用BeautifulSoup的官方文檔的例子:html
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
使用soup.prettify(),bs4解析出來的DOM樹輸出出來。python
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
1.幾個簡單的瀏覽結構化數據的方法:性能
>>>soup.title
<title>The Dormouse's story</title>
>>>soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>soup.title.string
The Dormouse's story
2.將文檔傳入bs4的方法code
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
BeautifulSoup將複雜的HTML文檔解析成DOM樹,在bs中有tag、NavigableString、BeautifulSoup、Comment四種類型。orm
2.1 標籤 (tag)xml
這裏的tag與html中的tag類似。介紹一下tag中最重要的屬性: name和attributes。tag.name表示標籤的名字;tag.attributes是tag的屬性。htm
tag有不少屬性,例如:tag<b class="boldest">中,有一個屬性是class的值是‘boldest’對象
>>>soup.a.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
(都是已字典的形式給出)three
tag的屬性操做與操做字典徹底相同。內存
>>>soup.a['href']
http://example.com/elsie
>>>soup.a[‘class’]
['sister']
tag屬性也能夠進行添加與刪除與修改:
>>>tag['class'] = 'verybold' >>>del tag[‘class’] >>>print(tag.get('class'))
# None
有些屬性有多個值稱爲多值屬性.
2.2能夠遍歷的字符串
>>>tag.string
# u'Extremely bold'
>>>type(tag.string)
# <class 'bs4.element.NavigableString'>
將NavigableString輸出成unicode的形式:
>>>unicode_string = unicode(tag.string) >>>unicode_string
# u'Extremely bold'type(unicode_string)# <type 'unicode'>
tag中包含的字符串不能編輯,可是能夠被替換成其它的字符串,用 replace_with() 方法:
tag.string.replace_with("No longer bold") tag
# <blockquote>No longer bold</blockquote>
字符串不支持 .contents 或 .string 屬性或 find() 方法.
若是想在Beautiful Soup以外使用 NavigableString 對象,須要調用 unicode() 方法,將該對象轉換成普通的Unicode字符串,不然就算Beautiful Soup已方法已經執行結束,該對象的輸出也會帶有對象的引用地址.這樣會浪費內存.
Tag , NavigableString , BeautifulSoup 幾乎覆蓋了html和xml中的全部內容,可是還有一些特殊對象.容易讓人擔憂的內容是文檔的註釋部分comment。
Beautiful Soup的性能會在之後時間繼續更新。