Python：bs4的使用

時間 2019-11-16

標籤 python bs4 使用欄目 Python 简体版

原文原文鏈接

概述

　　bs4 全名 BeautifulSoup，是編寫 python 爬蟲經常使用庫之一，主要用來解析 html 標籤。css

1、初始化

from bs4 import BeautifulSoup soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

　　兩個參數：第一個參數是要解析的html文本，第二個參數是使用那種解析器，對於HTML來說就是html.parser，這個是bs4自帶的解析器。html

　　若是一段HTML或XML文檔格式不正確的話，那麼在不一樣的解析器中返回的結果多是不同的。html5

解析器python	使用方法正則表達式	優點express
Python標準庫api	BeautifulSoup(html, "html.parser")瀏覽器	一、Python的內置標準庫函數二、執行速度適中post 三、文檔容錯能力強
lxml HTML	BeautifulSoup(html, "lxml")	一、速度快二、文檔容錯能力強
lxml XML	BeautifulSoup(html, ["lxml", "xml"]) BeautifulSoup(html, "xml")	一、速度快二、惟一支持XML的解析器
html5lib	BeautifulSoup(html, "html5lib")	一、最好的容錯性二、以瀏覽器的方式解析文檔三、生成HTML5格式的文檔

格式化輸出

soup.prettify()  # prettify 有括號和沒括號均可以

2、對象

　　Beautfiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構，每一個節點都是Python對象，全部對象能夠概括爲4種：tag，NavigableString，BeautifulSoup，Comment。

一、tag

　　Tag對象與 xml 或 html 原生文檔中的 tag 相同。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

　　若是不存在，則返回 None，若是存在多個，則返回第一個。

Name

　　每一個 tag 都有本身的名字

tag.name
# 'b'

Attributes

　　tag 的屬性是一個字典

tag['class']
# 'boldest'

tag.attrs
# {'class': 'boldest'}

type(tag.attrs)
# <class 'dict'>

多值屬性

　　最多見的多值屬性是class，多值屬性的返回 list。

soup = BeautifulSoup('<p class="body strikeout"></p>')

print(soup.p['class'])  # ['body', 'strikeout']

print(soup.p.attrs)     # {'class': ['body', 'strikeout']}

　　若是某個屬性看起來好像有多個值,但在任何版本的HTML定義中都沒有被定義爲多值屬性，那麼Beautiful Soup會將這個屬性做爲字符串返回。

soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
print(soup.p['id'])    # 'my id'

Text

　　text 屬性返回 tag 的全部字符串連成的字符串。

其餘方法

　　tag.has_attr('id') # 返回 tag 是否包含 id 屬性

　　固然，以上代碼還能夠寫成 'id' in tag.attrs，以前說過，tag 的屬性是一個字典。順便提一下，has_key是老舊遺留的api，爲了支持2.2以前的代碼留下的。Python3已經刪除了該函數。

二、NavigableString

　　字符串常被包含在 tag 內，Beautiful Soup 用 NavigableString 類來包裝 tag 中的字符串。可是字符串中不能包含其餘 tag。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

s = soup.b.string

print(s)        # Extremely bold

print(type(s))  # <class 'bs4.element.NavigableString'>

3、BeautifulSoup

　　BeautifulSoup 對象表示的是一個文檔的所有內容。大部分時候，能夠把它看成 Tag 對象。可是 BeautifulSoup 對象並非真正的 HTM L或 XML 的 tag，它沒有attribute屬性，name 屬性是一個值爲「[document]」的特殊屬性。

四、Comment

　　Comment 通常表示文檔的註釋部分。

soup = BeautifulSoup("<b><!--This is a comment--></b>")

comment = soup.b.string

print(comment)          # This is a comment

print(type(comment))    # <class 'bs4.element.Comment'>

3、遍歷

一、子節點

contents 屬性

　　contents 屬性返回全部子節點的列表，包括 NavigableString 類型節點。若是節點當中有換行符，會被當作是 NavigableString 類型節點而做爲一個子節點。

　　NavigableString 類型節點沒有 contents 屬性，由於沒有子節點。

soup = BeautifulSoup("""<div>
<span>test</span>
</div>
""")

element = soup.div.contents

print(element)          # ['\n', <span>test</span>, '\n']

children 屬性

　　children 屬性跟 contents 屬性基本同樣，只不過返回的不是子節點列表，而是子節點的可迭代對象。

descendants 屬性

　　descendants 屬性返回 tag 的全部子孫節點。

string 屬性

　　若是 tag 只有一個 NavigableString 類型子節點，那麼這個 tag 可使用 .string 獲得子節點。

　　若是一個 tag 僅有一個子節點，那麼這個 tag 也可使用 .string 方法，輸出結果與當前惟一子節點的 .string 結果相同。

　　若是 tag 包含了多個子節點，tag 就沒法肯定 .string 方法應該調用哪一個子節點的內容, .string 的輸出結果是 None。

soup = BeautifulSoup("""<div>
    <p><span><b>test</b></span></p>
</div>
""")

element = soup.p.string

print(element)          # test

print(type(element))    # <class 'bs4.element.NavigableString'>

　　特別注意，爲了清楚顯示，通常咱們會將 html 節點換行縮進顯示，而在BeautifulSoup 中會被認爲是一個 NavigableString 類型子節點，致使出錯。上例中，若是改爲 element = soup.div.string 就會出錯。

strings 和 stripped_strings 屬性

　　若是 tag 中包含多個字符串，能夠用 strings 屬性來獲取。若是返回結果中要去除空行，則能夠用 stripped_strings 屬性。

soup = BeautifulSoup("""<div>
    <p>      </p>
    <p>test 1</p>
    <p>test 2</p>
</div>
""", 'html.parser')

element = soup.div.stripped_strings

print(list(element))          # ['test 1', 'test 2']

二、父節點

parent 屬性

　　parent 屬性返回某個元素（tag、NavigableString）的父節點，文檔的頂層節點的父節點是 BeautifulSoup 對象，BeautifulSoup 對象的父節點是 None。

parents 屬性

　　parent 屬性遞歸獲得元素的全部父輩節點，包括 BeautifulSoup 對象。

三、兄弟節點

next_sibling 和 previous_sibling

　　next_sibling 返回後一個兄弟節點，previous_sibling 返回前一個兄弟節點。直接看個例子，注意別被換行縮進攪了局。

soup = BeautifulSoup("""<div>
    <p>test 1</p><b>test 2</b><h>test 3</h></div>
""", 'html.parser')

print(soup.b.next_sibling)      # <h>test 3</h>

print(soup.b.previous_sibling)  # <p>test 1</p>

print(soup.h.next_sibling)      # None

next_siblings 和 previous_siblings

　　next_siblings 返回後面的兄弟節點

　　previous_siblings　　返回前面的兄弟節點

四、回退和前進

　　把html解析當作依次解析標籤的一連串事件，BeautifulSoup 提供了重現解析器初始化過程的方法。

　　next_element 屬性指向解析過程當中下一個被解析的對象（tag 或 NavigableString）。

　　previous_element 屬性指向解析過程當中前一個被解析的對象。

　　另外還有next_elements 和 previous_elements 屬性，不贅述了。

4、搜索

一、過濾器

　　介紹 find_all() 方法前，先介紹一下過濾器的類型，這些過濾器貫穿整個搜索的API。過濾器能夠被用在tag的name中，節點的屬性中，字符串中或他們的混合中。

示例使用的 html 文檔以下：

html = """
<div>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a></p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

字符串

查找全部的<b>標籤

soup.find_all('b')  # [<b>The Dormouse's story</b>]

正則表達式

傳入正則表達式做爲參數，返回知足正則表達式的標籤。下面例子中找出全部以b開頭的標籤。

soup.find_all(re.compile("^b"))  # [<b>The Dormouse's story</b>]

列表

傳入列表參數，將返回與列表中任一元素匹配的內容。下面例子中找出全部<a>標籤和<b>標籤。

soup.find_all(["a", "b"])

True

True能夠匹配任何值，下面的代碼查找到全部的tag，可是不會返回字符串節點。

soup.find_all(True)

方法

若是沒有合適過濾器，那麼還能夠自定義一個方法，方法只接受一個元素參數，若是這個方法返回True表示當前元素匹配被找到。下面示例返回全部包含 class 屬性但不包含 id 屬性的標籤。

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')


print(soup.find_all(has_class_but_no_id))

返回結果：

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a></p>]

這個結果乍一看不對，<a>標籤含有 id 屬性，其實返回的 list 中只有2個元素，都是<p>標籤，<a>標籤是<p>標籤的子節點。

二、find 和 find_all

　　搜索當前 tag 的全部 tag 子節點，並判斷是否符合過濾器的條件

語法：

　　find(name=None, attrs={}, recursive=True, text=None, **kwargs)

　　find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

參數：

　　name：查找全部名字爲 name 的 tag，字符串對象會被自動忽略掉。上面過濾器示例中的參數都是 name 參數。固然，其餘參數中也可使用過濾器。

　　attrs：按屬性名和值查找。傳入字典，key 爲屬性名，value 爲屬性值。

　　recursive：是否遞歸遍歷全部子孫節點，默認 True。

　　text：用於搜索字符串，會找到 .string 方法與 text 參數值相符的tag，一般配合正則表達式使用。也就是說，雖然參數名是 text，但實際上搜索的是 string 屬性。

　　limit：限定返回列表的最大個數。

　　kwargs：若是一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數看成 tag 的屬性來搜索。這裏注意，若是要按 class 屬性搜索，由於 class 是 python 的保留字，須要寫做 class_。

　　Tag 的有些屬性在搜索中不能做爲 kwargs 參數使用，好比 html5 中的 data-* 屬性。

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

print(data_soup.find_all(data-foo="value"))

# SyntaxError: keyword can't be an expression

　　可是能夠經過 attrs 參數傳遞：

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

print(data_soup.find_all(attrs={"data-foo": "value"}))

# [<div data-foo="value">foo!</div>]

　　而按 class_ 查找時，只要一個CSS類名知足便可，若是寫了多個CSS名稱，那麼順序必須一致，並且不能跳躍。如下示例中，前三個能夠查找到元素，後兩個不能夠。

css_soup = BeautifulSoup('<p class="body bold strikeout"></p>')

print(css_soup.find_all("p", class_="strikeout"))

print(css_soup.find_all("p", class_="body"))

print(css_soup.find_all("p", class_="body bold strikeout"))

# [<p class="body strikeout"></p>]

print(css_soup.find_all("p", class_="body strikeout"))

print(css_soup.find_all("p", class_="strikeout body"))

# []

三、像調用find_all()同樣調用tag

　　find_all() 幾乎是 BeautifulSoup 中最經常使用的搜索方法，因此咱們定義了它的簡寫方法。BeautifulSoup 對象和 tag 對象能夠被看成一個方法來使用，這個方法的執行結果與調用這個對象的 find_all() 方法相同，下面兩行代碼是等價的:

soup.find_all('b')

soup('b')

四、其餘搜索方法

find_parents()　　　　　返回全部祖先節點

find_parent()　　　　　　返回直接父節點

find_next_siblings()　　返回後面全部的兄弟節點

find_next_sibling()　　返回後面的第一個兄弟節點

find_previous_siblings() 返回前面全部的兄弟節點

find_previous_sibling()　返回前面第一個兄弟節點

find_all_next()　　　　返回節點後全部符合條件的節點

find_next()　　　　　　返回節點後第一個符合條件的節點

find_all_previous()　　返回節點前全部符合條件的節點

find_previous()　　　　返回節點前全部符合條件的節點

5、CSS選擇器

BeautifulSoup支持大部分的CSS選擇器，這裏直接用代碼來演示。

from bs4 import BeautifulSoup

 
html = """
<html>
<head><title>標題</title></head>
<body>
 <p class="title" name="dromouse"><b>標題</b></p>
 <div name="divlink">
  <p>
   <a href="http://example.com/1" class="sister" id="link1">連接1</a>
   <a href="http://example.com/2" class="sister" id="link2">連接2</a>
   <a href="http://example.com/3" class="sister" id="link3">連接3</a>
  </p>
 </div>
 <p></p>
 <div name='dv2'></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# 經過tag查找
print(soup.select('title'))             # [<title>標題</title>]

# 經過tag逐層查找
print(soup.select("html head title"))   # [<title>標題</title>]

# 經過class查找
print(soup.select('.sister'))
# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">連接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">連接3</a>]


# 經過id查找
print(soup.select('#link1, #link2'))
# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">連接2</a>]


# 組合查找
print(soup.select('p #link1'))　　　　# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>]

 
# 查找直接子標籤
print(soup.select("head > title"))　 # [<title>標題</title>]

print(soup.select("p > #link1"))　　 # [<a class="sister" href="http://example.com/1" id="link1">連接1</a>]

print(soup.select("p > a:nth-of-type(2)"))　　# [<a class="sister" href="http://example.com/2" id="link2">連接2</a>]
# nth-of-type 是CSS選擇器

 

# 查找兄弟節點（向後查找）
print(soup.select("#link1 ~ .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">連接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">連接3</a>]

print(soup.select("#link1 + .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">連接2</a>]

 

# 經過屬性查找
print(soup.select('a[href="http://example.com/1"]'))

# ^ 以XX開頭
print(soup.select('a[href^="http://example.com/"]'))

# * 包含
print(soup.select('a[href*=".com/"]'))

# 查找包含指定屬性的標籤
print(soup.select('[name]'))

 

# 查找第一個元素
print(soup.select_one(".sister"))