python-BeautifulSoup庫詳解

 

快速使用

經過下面的一個例子,對bs4有個簡單的瞭解,以及看一下它的強大之處:html

 

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()) #補全HTML
print('-------------------------')
print(soup.title)
print('-------------------------')
print(soup.title.name)
print('-------------------------')
print(soup.title.string) #獲得標籤裏面的內容
print('-------------------------')
print(soup.title.parent.name)
print('-------------------------')
print(soup.p)
print('-------------------------')
print(soup.p["class"])
print('-------------------------')
print(soup.a)
print('-------------------------')
print(soup.find_all('a'))
print('-------------------------')
print(soup.find(id='link3'))
print('-------------------------')

 

結果以下:前端

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
-------------------------
<title>The Dormouse's story</title>
-------------------------
title
-------------------------
The Dormouse's story
-------------------------
head
-------------------------
<p class="title"><b>The Dormouse's story</b></p>
-------------------------
['title']
-------------------------
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
-------------------------
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
-------------------------
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
-------------------------

 

基本使用

標籤選擇器html5

在快速使用中咱們添加以下代碼:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)python

經過這種soup.標籤名 咱們就能夠得到這個標籤的內容
這裏有個問題須要注意,經過這種方式獲取標籤,若是文檔中有多個這樣的標籤,返回的結果是第一個標籤的內容,如上面咱們經過soup.p獲取p標籤,而文檔中有多個p標籤,可是隻返回了第一個p標籤內容spa

獲取名稱code

當咱們經過soup.title.name的時候就能夠得到該title標籤的名稱,即titleorm

獲取屬性xml

print(soup.p.attrs['name'])
print(soup.p['name'])
上面兩種方式均可以獲取p標籤的name屬性值htm

獲取內容對象

print(soup.p.string)
結果就能夠獲取第一個p標籤的內容:
The Dormouse's story

嵌套選擇

咱們直接能夠經過下面嵌套的方式獲取

print(soup.head.title.string)

 

標籤選擇器

選擇元素

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print('-------------------------')
print(type(soup.title))
print('-------------------------')
print(soup.head)
print('-------------------------')
print(soup.p)

 

結果:

<title>The Dormouse's story</title>
-------------------------
<class 'bs4.element.Tag'>
-------------------------
<head><title>The Dormouse's story</title></head>
-------------------------
<p class="title"><b>The Dormouse's story</b></p>

HTML中有多個 p 標籤,可是最終 值輸出一個,能夠得出當有多個是隻返回第一個結果

 

獲取名稱

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

 

結果:

title

 title.name 獲取最外層的標題

 

獲取屬性

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

 

結果:

dromouse
dromouse

 

獲取內容

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

 

結果:

The Dormouse's story

 

嵌套選擇

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

 

結果:

The Dormouse's story
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.p.b.string)

 

結果:

The Dormouse's story

 

使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出。

同時咱們經過下面代碼能夠分別獲取全部的連接,以及文字內容:

for link in soup.find_all('a'):
    print(link.get('href'))

print(soup.get_text())

 

 

解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,若是咱們不安裝它,則 Python 會使用 Python默認的解析器,lxml 解析器更增強大,速度更快,推薦安裝。

下面是常看法析器:

 

推薦使用lxml做爲解析器,由於效率更高. 在Python2.7.3以前的版本和Python3中3.2.2以前的版本,必須安裝lxml或html5lib, 由於那些Python版本的標準庫中內置的HTML解析方法不夠穩定.

 

子節點和子孫節點

contents的使用
經過下面例子演示:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

 

 

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

結果是將p標籤下的全部子標籤存入到了一個列表中

列表中會存入以下元素

 

children的使用

經過下面的方式也能夠獲取p標籤下的全部子節點內容和經過contents獲取的結果是同樣的,可是不一樣的地方是soup.p.children是一個迭代對象,而不是列表,只能經過循環的方式獲取素有的信息

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for child in soup.p.children:
    print(child)

 

結果:

<list_iterator object at 0x000001AD777B49B0>

            Once upon a time there were three little sisters; and their names were
            
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>


<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children): #enumerate 枚舉
    print(i,child)

 

結果:

<list_iterator object at 0x000001AD778605F8>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

經過contents以及children都是獲取子節點,若是想要獲取子孫節點能夠經過descendants
print(soup.descendants)同時這種獲取的結果也是一個迭代器

 

獲取子孫節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i,child)

 

結果:

<generator object Tag.descendants at 0x000001AD777B3A98>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

 

父節點和祖先節點

經過soup.a.parent就能夠獲取父節點的信息

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)

 

結果:

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>

 

經過list(enumerate(soup.a.parents))能夠獲取祖先節點,這個方法返回的結果是一個列表,會分別將a標籤的父節點的信息存放到列表中,以及父節點的父節點也放到列表中,而且最後還會講整個文檔放到列表中,全部列表的最後一個元素以及倒數第二個元素都是存的整個文檔的信息

祖先節點:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents)))

 

 

結果:

[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>)]

兄弟節點

soup.a.next_siblings 獲取後面的兄弟節點
soup.a.previous_siblings 獲取前面的兄弟節點
soup.a.next_sibling 獲取下一個兄弟標籤
souo.a.previous_sinbling 獲取上一個兄弟標籤

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

 

結果:

[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), 
(2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>),
(4, '\n and they lived at the bottom of a well.\n ')] [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

標準選擇器

find_all

find_all(name,attrs,recursive,text,**kwargs)
能夠根據標籤名,屬性,內容查找文檔

name的用法

 

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print('--------------------')
print(type(soup.find_all('ul')[0]))

 

 結果返回的是一個列表的方式

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
-----------------------
<class 'bs4.element.Tag'>

同時咱們是能夠針對結果再次find_all,從而獲取全部的li標籤信息,層層嵌套的方法

 

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

 

 結果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

 

attrs

例子以下:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

 結果:

 

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

 

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

 

結果:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>,
<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs能夠傳入字典的方式來查找標籤,可是這裏有個特殊的就是class,由於class在python中是特殊的字段,因此若是想要查找class相關的能夠更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的標籤屬性能夠不寫attrs,例如id

text

例子以下:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

 結果返回的是查到的全部的text='Foo'的文本

['Foo', 'Foo']

find

find(name,attrs,recursive,text,**kwargs)
find返回的匹配結果的第一個元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

 

結果:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
None

其餘一些相似的用法:


find_parents()返回全部祖先節點,find_parent()返回直接父節點。
find_next_siblings()返回後面全部兄弟節點,find_next_sibling()返回後面第一個兄弟節點。
find_previous_siblings()返回前面全部兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點

CSS選擇器

經過select()直接傳入CSS選擇器就能夠完成選擇
熟悉前端的人對CSS可能更加了解,其實用法也是同樣的

標籤1,標籤2 找到全部的標籤1和標籤2
標籤1 標籤2 找到標籤1內部的全部的標籤2
[attr] 能夠經過這種方法找到具備某個屬性的全部標籤
[atrr=value] 例子[target=_blank]表示查找全部target=_blank的標籤

 

前面加 .表示class #表示id ,選擇標籤不須要加

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

 結果:

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, 
<li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>

 

獲取內容

經過get_text()就能夠獲取文本內容

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

 結果:

Foo
Bar
Jay
Foo
Bar

 

 獲取屬性
或者屬性的時候能夠經過[屬性名]或者attrs[屬性名]

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul)
    print('---------------')
    print(ul['id'])
    print('---------------')
    print(ul.attrs['id'])
    print('***************')

 

結果:

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
---------------
list-1
---------------
list-1
***************
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
---------------
list-2
---------------
list-2
***************

 

總結

推薦使用lxml解析庫,必要時使用html.parser
標籤選擇篩選功能弱可是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
若是對CSS選擇器熟悉建議使用select()
記住經常使用的獲取屬性和文本值的方法

 

 原文:http://www.javashuo.com/article/p-cuegazqw-p.html

相關文章
相關標籤/搜索