BeautifulSoup解析庫詳解

BeautifulSoup是靈活又方便的網頁解析庫,處理高效,支持多種解析器css

利用它不用編寫正則表達式便可方便地實現網頁信息的提取html

安裝:pip3 install beautifulsoup4html5

用法詳解:python

beautifulsoup支持的一些解析庫正則表達式

解析器 使用方法 優點 劣勢
Python標準庫 BeautifulSoup(makeup,"html.parser") python的內置標準庫,執行速度適中,文檔容錯能力強 python2.7 or python3.2.2前的版本中文容錯能力差
lxml HTML解析器 BeautifulSoup(makeup,"lxml") 速度快,文檔容錯能力強 須要安裝c語言庫
lxml XML解析器 BeautifulSoup(makeup,"xmlr") 速度快,惟一支持xml的解析器 須要安裝c語言庫
html5lib BeautifulSoup(makeup,"html5lib") 最好的容錯性,以瀏覽器的方式解析文檔,生成HTML5格式的文檔 速度慢,不依賴外部擴展

基本使用方法:

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代碼
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')

#將代碼補全,也就是容錯處理
print(soup.prettify())

#選擇title這個標籤,並打印內容
print(soup.title.string)
輸出結果爲: <html> <head> <title> The Demouse's story </title> </head> <body> <p class="title" name="dromouse"> <b> The Domouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters,and their name were <a class="sister" href="http://examlpe.com/elele" ld="link1"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/lacie" ld="link2"> <!--Elsle--> </a> <a class="sister" href="http://examlpe.com/title" ld="link3"> <title> </title> </a> and they lived the bottom of a wall </p> <p clas="stuy"> .. </p> </body> </html> The Demouse's story

標籤選擇器

如上面例程中的soup.title.string,就是選擇了title標籤瀏覽器

選擇元素:import bs4python2.7

from bs4 import BeautifulSoup

#下面是一段不完整的 html代碼
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
輸出結果爲:
<title>The Demouse's story</title>
<class 'bs4.element.Tag'>
<head><title>The Demouse's story</title></head>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
#只輸出第一個匹配結果

獲取名稱:函數

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代碼
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
輸出結果爲:title 

獲取屬性: url

import bs4
from bs4 import BeautifulSoup

#下面是一段不完整的 html代碼
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
#注意soup.a.attrs或者soup.p['name']這兩種獲取屬性的方法都是能夠的
#還有就是要注意中括號!!!

獲取內容:spa

如例程中所示,使用string方法,如:soup.title.string便可獲取內容

嵌套選擇:

如:print(soup.head.title.string)

子節點和子孫節點:

如:print(soup.p.contents)使用contents能夠獲取p標籤的全部子節點,類型是一個列表 

也可使用children,與contents不一樣的是,children是一個迭代器,獲取全部子節點,須要使用循環才能把他的內容取到如:

print(soup.p.children)

for i ,child in enumerate(soup.p.children):

  print(i,child)

此外還有一個屬性descendants,這個是獲取全部的子孫節點,一樣也是一個迭代器 

print(soup.p.descendants)

for i ,child in enumerate(soup.p.descendants):

  print(i,child)

注意:子節點,子孫節點和下面的父節點,祖先節點中使用的相似於soup.p語法,是獲取第一個匹配到的p標籤,因此這些節點也都是第一個匹配到的標籤所對應的節點

父節點和祖先節點:

parent屬性:獲取全部的父節點

parents屬性:獲取全部的祖先節點

兄弟節點:

next_siblings屬性

previous_siblings屬性

--------------------------------------------------------------------------------------------------------------------

標準選擇器

上面說的是標籤選擇器,速度比較快,可是不能知足解析html文檔的需求的

find_all方法:

find_all(name,attrs,recursive,text,**kwargs)

可根據標籤名、屬性、內容查找文檔

根據name進行查找:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')
print(soup.find_all('url'))
輸出結果爲:
[<url class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>, <url class="list list-small" id="list-2">
<li lass="element">Foo</li>
<li lass="element">Bar</li>
</url>]

 返回結果能夠看到爲一個列表,能夠對列表進行循環,而後對每一項元素進行查找,如:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')
for url in soup.find_all('url'):
    print(url.find_all('li'))

輸出結果爲:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
[<li lass="element">Foo</li>, <li lass="element">Bar</li>]  

 根據attrs進行查找:

attrs傳入的參數爲字典形式的參數,如:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(attrs={'id':'list-1'}))#也能夠soup.find_all(id='list-1')這樣的來進行查找
print(soup.find_all(attrs={'name':'elements'}))
輸出結果爲:
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]

###注意:能夠利用soup.find_all(id='list-1')這樣的來進行查找,但對於class屬性,須要寫成class_='內容'的形式,由於在python中,class是一個關鍵字,因此在這裏看成屬性進行查找的時候,須要寫成class_的樣子

利用text進行查找:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li lass="element">Foo</li>
            <li lass="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(text='Foo'))
輸出結果爲:
['Foo', 'Foo'] 

find方法,用法跟find_all方法是徹底同樣的,只不過find_all返回全部元素,是一個列表,find返回單個元素,列表中的第一個值

find(name,attrs,recurslve,text,**kwargs)

find_parents()

find_parent()

find_next_siblings()

find_next_sibling()

find_previous_siblings()

find_previous_sibling()

find_all_next()

find_next()

find_all_previous()

find_previous()

這些函數的用法都同樣,只不過實現的方式不同

css選擇器

經過select()直接傳入css選擇器便可完成選擇

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

#若是選擇的是class,須要加上一個點,.panel .panel-heading
print(soup.select('.panel .panel-heading'))
#直接選擇標籤
print(soup.select('url li'))
#選擇id,要用#來選
print(soup.select('#list-2 .element'))
輸出結果爲:
[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

 進行層層嵌套的選擇:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for url in soup.select('url'):
    print(url.select('li'))
輸出結果爲:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

 獲取屬性 

 

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for url in soup.select('url'):
    print(url['id'])
   #也可使用print(url.attrs['id']) 輸出結果爲: list-1 list-2

 獲取內容:

import bs4
from bs4 import BeautifulSoup

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>hello</h4>
    </div>
    <div class="panel-body">
        <url class="list" id="list-1" name='elements'>
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">jay</li>
        </url>
        <url class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </url>
    </div>
    </div>
'''

soup = BeautifulSoup(html,'lxml')

for l in soup.select('li'):
    print(l.get_text())
輸出結果爲:
Foo
Bar
jay
Foo
Bar

  

總結:

推薦使用lxml解析庫,必要時使用html.parser

標籤選擇篩選功能弱可是速度快

建議使用find(),find_all()查詢匹配單個結果或多個結果

若是對css選擇器熟悉建議使用select()

記住經常使用的獲取屬性和文本值的方法

相關文章
相關標籤/搜索