BeautifulSoup支持大部分的CSS選擇器,其語法爲:向tag或soup對象的.select()方法中傳入字符串參數,選擇的結果以列表形式返回。html
tag.select("string")編碼
BeautifulSoup.select("string")spa
源代碼示例:code
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title" name="dromouse">
<b>The Dormouse's story</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="mysis" href="http://example.com/elsie" id="link1">
<b>the first b tag<b>
Elsie
</a>,
<a class="mysis" href="http://example.com/lacie" id="link2" myname="kong">
Lacie
</a>and
<a class="mysis" href="http://example.com/tillie" id="link3">
Tillie
</a>;and they lived at the bottom of a well.
</p>
<p class="story">
myStory
<a>the end a tag</a>
</p>
<a>the p tag sibling</a>
</body>
</html>
"""
soup = BeautifulSoup(html,'lxml')
一、經過標籤選擇orm
# 選擇全部title標籤 soup.select("title") # 選擇全部p標籤中的第三個標籤 soup.select("p:nth-of-type(3)") 至關於soup.select(p)[2] # 選擇body標籤下的全部a標籤 soup.select("body a") # 選擇body標籤下的直接a子標籤 soup.select("body > a") # 選擇id=link1後的全部兄弟節點標籤 soup.select("#link1 ~ .mysis") # 選擇id=link1後的下一個兄弟節點標籤 soup.select("#link1 + .mysis")
二、經過類名查找xml
# 選擇a標籤,其類屬性爲mysis的標籤 soup.select("a.mysis")
三、經過id查找htm
# 選擇a標籤,其id屬性爲link1的標籤 soup.select("a#link1")
四、經過【屬性】查找,固然也適用於class對象
# 選擇a標籤,其屬性中存在myname的全部標籤 soup.select("a[myname]") # 選擇a標籤,其屬性href=http://example.com/lacie的全部標籤 soup.select("a[href='http://example.com/lacie']") # 選擇a標籤,其href屬性以http開頭 soup.select('a[href^="http"]') # 選擇a標籤,其href屬性以lacie結尾 soup.select('a[href$="lacie"]') # 選擇a標籤,其href屬性包含.com soup.select('a[href*=".com"]') # 從html中排除某標籤,此時soup中再也不有script標籤 [s.extract() for s in soup('script')] # 若是想排除多個呢 [s.extract() for s in soup(['script','fram']
五、獲取文本及屬性blog
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup ''' 以列表的形式返回 ''' soup = BeautifulSoup(html_doc, 'html.parser') s = soup.select('p.story') s[0].get_text() # p節點及子孫節點的文本內容 s[0].get_text("|") # 指定文本內容的分隔符 s[0].get_text("|", strip=True) # 去除文本內容先後的空白 print(s[0].get("class")) # p節點的class屬性值列表(除class外都是返回字符串)
六、UnicodeDammit.detwingle() 方法只能解碼包含在UTF-8編碼中的Windows-1252編碼內容,three
new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ☃☃☃「I like snowmen!」
在建立 BeautifulSoup 或 UnicodeDammit 對象前必定要先對文檔調用 UnicodeDammit.detwingle() 確保文檔的編碼方式正確.若是嘗試去解析一段包含Windows-1252編碼的UTF-8文檔,就會獲得一堆亂碼,好比: ☃☃☃「I like snowmen!」.
七、其它:
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; </p> and they lived at the bottom of a well. <p class="story">...</p> </body> """ from bs4 import BeautifulSoup ''' 以列表的形式返回 ''' soup = BeautifulSoup(html_doc, 'html.parser') soup.select('title') # title標籤 soup.select("p:nth-of-type(3)") # 第三個p節點 soup.select('body a') # body下的全部子孫a節點 soup.select('p > a') # 全部p節點下的全部a直接節點 soup.select('p > #link1') # 全部p節點下的id=link1的直接子節點 soup.select('#link1 ~ .sister') # id爲link1的節點後面class=sister的全部兄弟節點 soup.select('#link1 + .sister') # id爲link1的節點後面class=sister的第一個兄弟節點 soup.select('.sister') # class=sister的全部節點 soup.select('[class="sister"]') # class=sister的全部節點 soup.select("#link1") # id=link1的節點 soup.select("a#link1") # a節點,且id=link1的節點 soup.select('a[href]') # 全部的a節點,有href屬性 soup.select('a[href="http://example.com/elsie"]') # 指定href屬性值的全部a節點 soup.select('a[href^="http://example.com/"]') # href屬性以指定值開頭的全部a節點 soup.select('a[href$="tillie"]') # href屬性以指定值結尾的全部a節點 soup.select('a[href*=".com/el"]') # 支持正則匹配