Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.css
多看官方文檔https://beautifulsoup.readthedocs.io/zh_CN/latest/html
from bs4 import BeautifulSoup html = ''' <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> '''
output: <html\> <head\> <title\> The Dormouse's story </title\> </head\> <body\> <p class\="title"\> <b\> The Dormouse's story </b\> </p\> <p class\="story"\> Once upon a time there were three little sisters; and their names were <a class\="sister" href\="http://example.com/elsie" id\="link1"\> Elsie </a\>, <a class\="sister" href\="http://example.com/lacie" id\="link2"\> Lacie </a\>and <a class\="sister" href\="http://example.com/tillie" id\="link3"\> Tillie </a\> ; and they lived at the bottom of a well. </p\> <p class\="story"\> ... </p\> </body\> </html\>
bs的使用和字典的使用極爲類似,用.來進行運算瀏覽器
<title>The Dormouse's story</title>
函數
The Dormouse's story
spa
\['title'\]
code
bs能夠屢次調用獲得須要的標籤內容orm
<p class\="title"\> <b\> The Dormouse's story </b\> </p\> input:print(soup.p.b.string) output: The Dormouse's story
find_all( name , attrs , recursive , string , **kwargs )xml
name: 根據標籤名來進行查詢(經常使用)htm
經常使用方法是將列表中的元素提取出來進行處理 alist = soup.find\_all('a') for a in alist: function(a)
html\=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find\_all(attrs\={'id': 'list-1'})) print(soup.find\_all(attrs\={'name': 'elements'})) 上面兩句的output: \[<ul class\="list" id\="list-1" name\="elements"\> <li class\="element"\>Foo</li\> <li class\="element"\>Bar</li\> <li class\="element"\>Jay</li\> </ul\>\]
用tag的屬性來進行搜索,搜索每一個tag的id屬性three
soup.find_all(id = 'list-2')
class是特殊字,用下面方法進行處理
soup.find_all('',{"class":"element"})
能夠用class_ = ...... 來對class屬性進行搜索新屬性
soup.find_all("div",class_ = "panel-body")
谷歌瀏覽器快速得到標籤CSS選擇器方法
用選擇器對組件選擇---->找到相應的語句----->右鍵------>
能夠根據須要進行copy,selector即爲CSS的路徑
find_all()若沒有找到相應的數據返回一個空的列表
find()則返回一個None