BeautifulSoup 使用總結

時間 2019-11-20

標籤 beautifulsoup 使用總結简体版

原文原文鏈接

使用BeautifulSoup的官方文檔的例子：html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用soup.prettify()，bs4解析出來的DOM樹輸出出來。python

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

1.幾個簡單的瀏覽結構化數據的方法:性能

>>>soup.title

<title>The Dormouse's story</title>

>>>soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

>>>soup.title.string

The Dormouse's story

2.將文檔傳入bs4的方法code

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

BeautifulSoup將複雜的HTML文檔解析成DOM樹，在bs中有tag、NavigableString、BeautifulSoup、Comment四種類型。orm

2.1 標籤（tag）xml

這裏的tag與html中的tag類似。介紹一下tag中最重要的屬性: name和attributes。tag.name表示標籤的名字；tag.attributes是tag的屬性。htm

tag有不少屬性，例如：tag<b class="boldest">中，有一個屬性是class的值是‘boldest’對象

>>>soup.a.attrs

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

（都是已字典的形式給出）three

tag的屬性操做與操做字典徹底相同。內存

>>>soup.a['href']

http://example.com/elsie

>>>soup.a[‘class’]

['sister']

tag屬性也能夠進行添加與刪除與修改：

>>>tag['class'] = 'verybold'
>>>del tag[‘class’]
>>>print(tag.get('class'))

# None

有些屬性有多個值稱爲多值屬性.

2.2能夠遍歷的字符串

>>>tag.string

# u'Extremely bold'

>>>type(tag.string)

# <class 'bs4.element.NavigableString'>

將NavigableString輸出成unicode的形式：

>>>unicode_string = unicode(tag.string)
>>>unicode_string

# u'Extremely bold'type(unicode_string)# <type 'unicode'>

tag中包含的字符串不能編輯,可是能夠被替換成其它的字符串,用 replace_with() 方法:

tag.string.replace_with("No longer bold")
tag

# <blockquote>No longer bold</blockquote>

字符串不支持 .contents 或 .string 屬性或 find() 方法.

若是想在Beautiful Soup以外使用 NavigableString 對象,須要調用 unicode() 方法,將該對象轉換成普通的Unicode字符串,不然就算Beautiful Soup已方法已經執行結束,該對象的輸出也會帶有對象的引用地址.這樣會浪費內存.

Tag , NavigableString , BeautifulSoup 幾乎覆蓋了html和xml中的全部內容,可是還有一些特殊對象.容易讓人擔憂的內容是文檔的註釋部分comment。

Beautiful Soup的性能會在之後時間繼續更新。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。