python爬蟲---從零開始(四)BeautifulSoup庫

 BeautifulSoup是什麼?

  BeautifulSoup是一個網頁解析庫,相比urllib、Requests要更加靈活和方便,處理高校,支持多種解析器。

利用它不用編寫正則表達式即可方便地實現網頁信息的提取。

BeautifulSoup的安裝:直接輸入pip3 install beautifulsoup4即可安裝。4也就是它的最新版本。

BeautifulSoup的用法:

  解析庫:

解析器 使用方法 優勢 不足
Python標準庫 BeautifulSoup(markup,"html.parser") python的內置標準庫、執行速度適中、文檔容錯能力強 Python2.7.3 or 3.2.2之前的版本容錯能力較差
lxml HTML解析庫 BeautifulSoup(markup,"lxml") 速度快、文檔容錯能力強 需要安裝C語言庫
lxml XML解析庫 BeautifulSoup(markup,"xml") 速度快、唯一支持XML的解析器 需要安裝C語言庫
html5lib BeautifulSoup(markup,"html5lib") 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴展

  基本使用:

 

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

 

我們可以看到,html這次參數是一段html代碼,但是該代碼並不完成,我們來看下下面的運行結果。

 

 我們可以看到結果,BeautifulSoup將我們的代碼自動補全了,並且幫我們自動格式化了html代碼。

  標籤選擇器:

  選擇元素

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

 

我們先來看下運行結果

我們可以到看,.title方法將整個title標籤全部提取出來了, .head也是如此的,但是p標籤有很多,這裏會默認只取第一個標籤

  獲取名稱 :

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

 

 輸出結果爲title 。.name方法獲取該標籤的名稱(並非name屬性的值)

  獲取屬性:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

 

 我們嘗試運行以後會發現,結果都爲dromouse,也就是說兩中方式都可以娶到name屬性的值,但是隻匹配第一個標籤。

  獲取內容:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story" name="pname">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

 

 輸入標籤.string方法即可取到該標籤下的內容,得到的輸出結果爲:

我們可以看到我們獲取到的是第一個p標籤下的文字內容。

  嵌套獲取:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story" name="pname">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

運行結果爲:

我們可以嵌套其子節點繼續選擇獲取標籤的內容。

  獲得子節點和子孫節點:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story" name="pname">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.body.contents)

獲取的結果爲一個類型的數據,換行符也佔用一個位置,我們來看一下輸出結果:

 另外還有一個方法也可以獲得子節點.children也可以獲取子節點:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story" name="pname">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(type(soup.body.children))
for i,child in enumerate(soup.body.children):
    print(i,child)

 

輸出結果如下:

用.children方法得到的是一個可以迭代的類型數據。

通過descendas可以獲得其子孫節點:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html><head><title>The Dormouse's story</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and thier names were
<a href ="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
<a href ="http://example.com/lacle" class="sister" id="link2">Lacie</a> and
<a href ="http://example.com/title" class="sister" id="link3">Title</a>; and they lived at the boottom of a well.</p>
<p class="story" name="pname">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(type(soup.body.descendants))
for i,child in enumerate(soup.body.descendants):
    print(i,child)

 我們來看一下運行結果:

獲取父節點和祖先節點:

 

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.parent))
print(soup.a.parent)

我們來看一下運行結果:

獲取其祖先節點:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

我們來看下結果:

兄弟節點:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

我們來看一下結果:

next_siblings獲得後面的兄弟節點,previous_siblings獲得前面的兄弟節點。
以前就是我們用最簡單的方式來獲取了內容,也是標籤選擇器,選擇速度很快的,但是這種選擇器過於單一,不能滿足我們的解析需求,下面我們來看一下標準選擇器。
  標準選擇器:
find_all(name,attrs,recursive,text,**kwargs)可以根據標籤名,屬性,內容查找文檔。
我們來看一下具體的用法。
根據name來查找:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(soup.find_all('p'))
print(soup.find_all('p')[0])

我們看下結果:

我們通過find_all得到了一組數據,通過其索引得到每一項的標籤。也可以用嵌套的方式來查找

attrs方式:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(soup.find_all(attrs={'id':'link3'}))
print(soup.find_all(attrs={'class':'sister'}))
for i in soup.find_all(attrs={'class':'sister'}):
    print(i)

 

運行結果:

我也可以這樣來寫:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(soup.find_all(id='link3'))
print(soup.find_all(class_='sister'))
for i in soup.find_all(class_='sister'):
    print(i)

對於特殊類型的我們可以直接用其屬性來查找,例如id,class等。attrs更便於我們的查找了。

用text選擇:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
            <a href="http://example.com/elsie" class="body" id="link4">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(soup.find_all(text='Title'))

運行結果:

find(name,attrs,recursive,text,**kwargs)可以根據標籤名,屬性,內容查找文檔。和find_all用法完全一致,不同於find返回單個標籤(第一個),find_all返回所有標籤。
還有很多類似的方法:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
            <a href="http://example.com/elsie" class="body" id="link4">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
# find
print(soup.find(class_='sister'))
# find_parents() 和find_parent()
print(soup.find_parents(class_='sister')) # 返回祖先節點
print(soup.find_parent(class_='sister')) # 返回父節點
# find_next_siblings() ,find_next_sibling()
print(soup.find_next_siblings(class_='sister')) # 返回後面所有的兄弟節點
print(soup.find_next_sibling(class_='sister')) # 返回後面第一個兄弟節點
# find_previous_siblings() ,find_previous_sibling()
print(soup.find_previous_siblings(class_='sister')) # 返回前面所有的兄弟節點
print(soup.find_previous_sibling(class_='sister')) # 返回前面第一個兄弟節點
# find_all_next() ,find_next()
print(soup.find_all_next(class_='sister')) # 返回節點後所有符合條件的節點
print(soup.find_next(class_='sister')) # 返回節點後第一個滿足條件的節點
# find_all_previous,find_all_previous
print(soup.find_all_previous(class_='sister')) # 返回節點後所有符合條件的節點
print(soup.find_all_previous(class_='sister')) # 返回節點後第一個滿足條件的節點

在這裏不在一一列舉。

CSS選擇器:通過select()直接傳入CSS選擇器即可完成選擇

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
            <a href="http://example.com/elsie" class="body" id="link4">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
print(soup.select('.body'))
print(soup.select('a'))
print(soup.select('#link3'))

我們select()的方法和jquery差不多的,選擇class的前面加一個"."  選擇id的前面加一個"#" 不加入任何的是標籤選擇器,我們來看下結果:

獲取屬性:

輸入get_text()就可以獲得到裏面的文本了。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story" name="pname">Once upo a time were three little sister;and theru name were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/elsie" class="sister" id="link2">Lacie</a>
            and 
            <a href="http://example.com/elsie" class="sister" id="link3">Title</a>
            <a href="http://example.com/elsie" class="body" id="link4">Title</a>
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup as bs4
soup = bs4(html,'lxml')
for i in soup.select('.sister'):
    print(i.get_text())

運行結果:

 

總結:

  • 推薦使用lxml解析庫,必要時使用html.parser庫
  • 標籤選擇篩選功能弱但是速度快
  • 建議使用find()、find_all()查詢匹配單個結果或者多個結果
  • 如果對CSS選擇器熟悉的建議使用select()
  • 記住常用的獲取屬性和文本值的方法

 

代碼地址:https://gitee.com/dwyui/BeautifulSoup.git

下一期我會來說一下pyQuery的使用,敬請期待。

 

感謝大家的閱讀,不正確的地方,還希望大家來斧正,鞠躬,謝謝🙏。