上一次介紹正則表達式的時候,分享了一個爬蟲實戰,即爬取豆瓣首頁全部的:書籍、連接、做者、出版日期等。在上個實戰中咱們是經過正則表達式來解析源碼爬取數據,總體來講上次實戰中的正則表達式是比較複雜的,因此引入了今天的主角BeautifulSoup:它是靈活方便的網頁解析庫,處理高效,並且支持多種解析器。使用Beautifulsoup,不用編寫正則表達式就能夠方便的實現網頁信息的提取。html
pip install beautifulsoup4html5
解析器 | 使用方法 | 優點 | 劣勢 |
---|---|---|---|
Python標準庫 | BeautifulSoup(markup, "html.parser") | Python的內置標準庫、執行速度適中 、文檔容錯能力強 | Python 2.7.3 or 3.2.2)前的版本中文容錯能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文檔容錯能力強,經常使用 | 須要安裝C語言庫 lxml |
lxml XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、惟一支持XML的解析器 | 須要安裝C語言庫 |
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
下面是一個不完整的html:body標籤、html標籤都沒有閉合java
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
複製代碼
下面使用lxml解析庫解析上面的htmlpython
from bs4 import BeautifulSoup#引包
soup = BeautifulSoup(html, 'lxml')#聲明bs對象和解析器
print(soup.prettify())#格式化代碼,自動補全代碼,進行容錯的處理
print(soup.title.string)#打印出title標籤中的內容
複製代碼
下面是容錯處理時標籤補全後的結果和獲取的title內容,能夠看到html和body標籤都被補全了:面試
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p >
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<!-- Elsie -->
</ a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</ a>
;
and they lived at the bottom of a well.
</p >
<p class="story">
...
</p >
</body>
</html>
The Dormouse's story
複製代碼
####(1)選擇元素 依舊使用上面的html正則表達式
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
複製代碼
結果是:瀏覽器
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p >
複製代碼
從結果發現只輸出了一個p標籤,可是HTML中有3個p標籤 標籤選擇器的特性:當有多個標籤的時候,它只返回第一個標籤的內容學習
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
複製代碼
輸出結果:測試
dromouse dromouse大數據
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)
複製代碼
輸出結果:
The Dormouse's story
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
複製代碼
輸出:
The Dormouse's story
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)
複製代碼
輸出的是一個列表
['\n Once upon a time there were three little sisters; and their names were\n ',
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>,
'\n'
, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
, ' \n and\n '
, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
, '\n and they lived at the bottom of a well.\n ']
複製代碼
另一種獲取方式
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
print(i, child)
複製代碼
輸出:
<list_iterator object at 0x1064f7dd8>
0
Once upon a time there were three little sisters; and their names were
 
1 <a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
2
 
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
4
and   
5    
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>           
6
and they lived at the bottom of a well.
複製代碼
####(6)獲取父節點
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
複製代碼
程序打印出的是p標籤,即a標籤的父節點:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href=" " id="link1">
<span>Elsie</span>
</ a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</ a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</ a>
and they lived at the bottom of a well.
</p >
複製代碼
於此相似的還有:
上面是標籤選擇器:處理速度很快,可是這種方式不能知足咱們解析HTML的需求。所以beautifulsoup還提供了一些其餘的方法
**find_all( name , attrs , recursive , text , kwargs ) 可根據標籤名、屬性、內容查找文檔 下面使用的測試HTML都是下面這個
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
複製代碼
(1) 根據標籤名,即name查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
複製代碼
輸出了全部的ul標籤:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
複製代碼
上述能夠繼續進行嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
#能夠更進一步,獲取li中的屬性值:ul.find_all('li')[0]['class']
複製代碼
(2)根據屬性名進行查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(name='element'))
複製代碼
輸出:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
複製代碼
(3)根據文本的內容,即text進行選擇
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))
複製代碼
輸出:
['Foo;'Foo']
返回的不是標籤,在查找的時候用途不大,更可能是作內容匹配
find( name , attrs , recursive , text , kwargs ) 和findall相似,只不過find方法只是返回單個元素
find_parents() find_parent() find_parents()返回全部祖先節點,find_parent()返回直接父節點。
find_next_siblings() find_next_sibling() find_next_siblings()返回後面全部兄弟節點,find_next_sibling()返回後面第一個兄弟節點。
find_previous_siblings() find_previous_sibling() find_previous_siblings()返回前面全部兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next() find_next() find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous() 和 find_previous() find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點
經過select()直接傳入CSS選擇器便可完成選擇
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
#選擇class爲panel中的class爲panel-heading的HTML,選擇class時要在前面加‘.’
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))#標籤選擇,選擇ul標籤中的li標籤
print(soup.select('#list-2 .element'))#‘#’表示id選擇:選擇id爲list-2中class爲element中的元素
print(type(soup.select('ul')[0]))
複製代碼
輸出:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
複製代碼
也能夠進行嵌套,不過不必,上面經過標籤之間使用空格就實現了嵌套:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
複製代碼
輸出:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
複製代碼
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])#或者 print(ul.attrs['id'])
獲取內容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print(li.get_text())
複製代碼
更多關於Beautifulsoup的使用能夠查看對應的文檔說明
歡迎關注我的公衆號【菜鳥名企夢】,公衆號專一:互聯網求職面經、java、python、爬蟲、大數據等技術分享**: 公衆號**菜鳥名企夢
後臺發送「csdn」便可免費領取【csdn】和【百度文庫】下載服務; 公衆號菜鳥名企夢
後臺發送「資料」:便可領取5T精品學習資料**、java面試考點和java面經總結,以及幾十個java、大數據項目,資料很全,你想找的幾乎都有