BeautifulSoup是靈活又方便的網頁解析庫,處理高效,支持多種解析器css
利用它不用編寫正則表達式便可方便地實現網頁信息的提取html
安裝:pip3 install beautifulsoup4html5
用法詳解:python
beautifulsoup支持的一些解析庫正則表達式
解析器 | 使用方法 | 優點 | 劣勢 |
Python標準庫 | BeautifulSoup(makeup,"html.parser") | python的內置標準庫,執行速度適中,文檔容錯能力強 | python2.7 or python3.2.2前的版本中文容錯能力差 |
lxml HTML解析器 | BeautifulSoup(makeup,"lxml") | 速度快,文檔容錯能力強 | 須要安裝c語言庫 |
lxml XML解析器 | BeautifulSoup(makeup,"xmlr") | 速度快,惟一支持xml的解析器 | 須要安裝c語言庫 |
html5lib | BeautifulSoup(makeup,"html5lib") | 最好的容錯性,以瀏覽器的方式解析文檔,生成HTML5格式的文檔 | 速度慢,不依賴外部擴展 |
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代碼 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') #將代碼補全,也就是容錯處理 print(soup.prettify()) #選擇title這個標籤,並打印內容
print(soup.title.string)
輸出結果爲: <html> <head> <title> The Demouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse's story
如上面例程中的soup.title.string,就是選擇了title標籤瀏覽器
選擇元素:import bs4python2.7
from bs4 import BeautifulSoup #下面是一段不完整的 html代碼 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
輸出結果爲:
<title>The Demouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Demouse's story</title></head>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
#只輸出第一個匹配結果
獲取名稱:函數
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代碼 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.title.name)
輸出結果爲:title
獲取屬性: url
import bs4 from bs4 import BeautifulSoup #下面是一段不完整的 html代碼 html = ''' <html><head><title>The Demouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Domouse's story</b></p> <p class="story">Once upon a time there were three little sisters,and their name were <a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a> <a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a> <a href="http://examlpe.com/title" class="sister" ld="link3"><title></a> and they lived the bottom of a wall</p> <p clas="stuy">..</p> ''' soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
#注意soup.a.attrs或者soup.p['name']這兩種獲取屬性的方法都是能夠的
#還有就是要注意中括號!!!
獲取內容:spa
如例程中所示,使用string方法,如:soup.title.string便可獲取內容
嵌套選擇:
如:print(soup.head.title.string)
子節點和子孫節點:
如:print(soup.p.contents)使用contents能夠獲取p標籤的全部子節點,類型是一個列表
也可使用children,與contents不一樣的是,children是一個迭代器,獲取全部子節點,須要使用循環才能把他的內容取到如:
print(soup.p.children)
for i ,child in enumerate(soup.p.children):
print(i,child)
此外還有一個屬性descendants,這個是獲取全部的子孫節點,一樣也是一個迭代器
print(soup.p.descendants)
for i ,child in enumerate(soup.p.descendants):
print(i,child)
注意:子節點,子孫節點和下面的父節點,祖先節點中使用的相似於soup.p語法,是獲取第一個匹配到的p標籤,因此這些節點也都是第一個匹配到的標籤所對應的節點
父節點和祖先節點:
parent屬性:獲取全部的父節點
parents屬性:獲取全部的祖先節點
兄弟節點:
next_siblings屬性
previous_siblings屬性
--------------------------------------------------------------------------------------------------------------------
上面說的是標籤選擇器,速度比較快,可是不能知足解析html文檔的需求的
find_all方法:
find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔
根據name進行查找:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all('url')) 輸出結果爲: [<url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>, <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url>]
返回結果能夠看到爲一個列表,能夠對列表進行循環,而後對每一項元素進行查找,如:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.find_all('url'): print(url.find_all('li')) 輸出結果爲: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>] [<li lass="element">Foo</li>, <li lass="element">Bar</li>]
根據attrs進行查找:
attrs傳入的參數爲字典形式的參數,如:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'}))#也能夠soup.find_all(id='list-1')這樣的來進行查找 print(soup.find_all(attrs={'name':'elements'})) 輸出結果爲: [<url class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>] [<url class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url>]
###注意:能夠利用soup.find_all(id='list-1')這樣的來進行查找,但對於class屬性,須要寫成class_='內容'的形式,由於在python中,class是一個關鍵字,因此在這裏看成屬性進行查找的時候,須要寫成class_的樣子
利用text進行查找:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li lass="element">Foo</li> <li lass="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') print(soup.find_all(text='Foo')) 輸出結果爲: ['Foo', 'Foo']
find方法,用法跟find_all方法是徹底同樣的,只不過find_all返回全部元素,是一個列表,find返回單個元素,列表中的第一個值
find(name,attrs,recurslve,text,**kwargs)
find_parents()
find_parent()
find_next_siblings()
find_next_sibling()
find_previous_siblings()
find_previous_sibling()
find_all_next()
find_next()
find_all_previous()
find_previous()
這些函數的用法都同樣,只不過實現的方式不同
經過select()直接傳入css選擇器便可完成選擇
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') #若是選擇的是class,須要加上一個點,.panel .panel-heading print(soup.select('.panel .panel-heading')) #直接選擇標籤 print(soup.select('url li')) #選擇id,要用#來選 print(soup.select('#list-2 .element')) 輸出結果爲: [<div class="panel-heading"> <h4>hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
進行層層嵌套的選擇:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.select('url'): print(url.select('li')) 輸出結果爲: [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
獲取屬性
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for url in soup.select('url'): print(url['id'])
#也可使用print(url.attrs['id']) 輸出結果爲: list-1 list-2
獲取內容:
import bs4 from bs4 import BeautifulSoup html = ''' <div class="panel"> <div class="panel-heading"> <h4>hello</h4> </div> <div class="panel-body"> <url class="list" id="list-1" name='elements'> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">jay</li> </url> <url class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </url> </div> </div> ''' soup = BeautifulSoup(html,'lxml') for l in soup.select('li'): print(l.get_text()) 輸出結果爲: Foo Bar jay Foo Bar
推薦使用lxml解析庫,必要時使用html.parser
標籤選擇篩選功能弱可是速度快
建議使用find(),find_all()查詢匹配單個結果或多個結果
若是對css選擇器熟悉建議使用select()
記住經常使用的獲取屬性和文本值的方法