BeautifulSoup 使用總結

使用BeautifulSoup的官方文檔的例子:html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用soup.prettify(),bs4解析出來的DOM樹輸出出來。python

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

1.幾個簡單的瀏覽結構化數據的方法:性能

>>>soup.title
<title>The Dormouse's story</title>
>>>soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>soup.title.string
The Dormouse's story

2.將文檔傳入bs4的方法code

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

BeautifulSoup將複雜的HTML文檔解析成DOM樹,在bs中有tag、NavigableString、BeautifulSoup、Comment四種類型。orm

2.1    標籤 (tag)xml

這裏的tag與html中的tag類似。介紹一下tag中最重要的屬性: name和attributes。tag.name表示標籤的名字;tag.attributes是tag的屬性。htm

tag有不少屬性,例如:tag<b class="boldest">中,有一個屬性是class的值是‘boldest’對象

>>>soup.a.attrs
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

(都是已字典的形式給出)three

tag的屬性操做與操做字典徹底相同。內存

>>>soup.a['href']
http://example.com/elsie
>>>soup.a[‘class’]
​​​​​​​['sister']

tag屬性也能夠進行添加與刪除與修改:

>>>tag['class'] = 'verybold'
>>>del tag[‘class’]
>>>print(tag.get('class'))
# None

有些屬性有多個值稱爲多值屬性.

2.2能夠遍歷的字符串

>>>tag.string
# u'Extremely bold'
>>>type(tag.string)
# <class 'bs4.element.NavigableString'>

將NavigableString輸出成unicode的形式:

>>>unicode_string = unicode(tag.string)
>>>unicode_string
# u'Extremely bold'type(unicode_string)# <type 'unicode'>

tag中包含的字符串不能編輯,可是能夠被替換成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
字符串不支持 .contents 或 .string 屬性或 find() 方法.

    若是想在Beautiful Soup以外使用 NavigableString 對象,須要調用 unicode() 方法,將該對象轉換成普通的Unicode字符串,不然就算Beautiful Soup已方法已經執行結束,該對象的輸出也會帶有對象的引用地址.這樣會浪費內存.

    Tag , NavigableString , BeautifulSoup 幾乎覆蓋了html和xml中的全部內容,可是還有一些特殊對象.容易讓人擔憂的內容是文檔的註釋部分comment。

    Beautiful Soup的性能會在之後時間繼續更新。

相關文章
相關標籤/搜索