爬蟲解析庫——BeautifulSoup

時間 2019-11-07

原文原文鏈接

　　解析庫就是在爬蟲時本身制定一個規則，幫助咱們抓取想要的內容時用的。經常使用的解析庫有re模塊的正則、beautifulsoup、pyquery等等。正則徹底能夠幫咱們匹配到咱們想要住區的內容，但正則比較麻煩，因此這裏咱們會用beautifulsoup。html

beautifulsoup

　　Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫。它可以經過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式。Beautiful Soup會幫咱們節省數小時甚至數天的工做時間。Beautiful Soup 3 目前已經中止開發，官網推薦在如今的項目中使用Beautiful Soup 4。前端

安裝：html5

pip install beautifulsoup4

　　Beautiful Soup支持Python標準庫中的HTML解析器，還支持一些第三方的解析器。其中一個是 lxml 。咱們日常在使用中推薦使用lxml。另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,python

pip install lxml      pip install html5lib

　　下表列出了主要的解析器,以及它們的優缺點,官網推薦使用lxml做爲解析器,由於效率更高. 在Python2.7.3以前的版本和Python3中3.2.2以前的版本,必須安裝lxml或html5lib, 由於那些Python版本的標準庫中內置的HTML解析方法不夠穩定.正則表達式

解析器	使用方法	優點	劣勢
Python標準庫	`BeautifulSoup(markup, "html.parser")`	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])`express `BeautifulSoup(markup, "xml")`瀏覽器	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

中文文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.htmlide

基本使用

　　容錯處理：BeautifulSoup文檔的容錯能力指的是在html代碼不完整的狀況下，使用該模塊能夠識別該錯誤。使用BeautifulSoup解析某些沒寫完整標籤的代碼會自動補全該閉合標籤，獲得一個 BeautifulSoup 的對象，並能按照標準的縮進格式的結構輸出。函數

舉個栗子：網站

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml') #具備容錯功能，第二個參數是解析器名，這裏咱們肯定用lxml
res=soup.prettify() #處理好縮進，結構化顯示
print(res)

View Code

遍歷文檔樹操做

　　遍歷文檔樹：即直接經過標籤名字選擇，特色是選擇速度快，但若是存在多個相同的標籤則只返回第一個

一、用法

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.p) #存在多個相同的標籤則只返回第一個

二、獲取標籤的名稱 ====> soup.p.name
三、獲取標籤的屬性 ====> soup.p.attrs
四、獲取標籤的內容 ====> soup.p.string #p下的文本只有一個時，取到，不然爲None
五、嵌套選擇 ====> soup.body.a.string
六、子節點、子孫節點 ====> soup.p.contents soup.p.descendants
七、父節點、祖先節點 ====> soup.a.parent soup.a.parents

八、兄弟節點 ====>

soup.a.next_sibling #下一個兄弟
soup.a.previous_sibling#上一個兄弟
list(soup.a.next_siblings) #下面的兄弟們=>生成器對象
soup.a.previous_siblings)#上面的兄弟們=>生成器對象

具體操做示例：

#遍歷文檔樹：即直接經過標籤名字選擇，特色是選擇速度快，但若是存在多個相同的標籤則只返回第一個
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#一、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
# soup=BeautifulSoup(open('a.html'),'lxml')#打開一個HTML文件

print(soup.p) #<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b></p>
# 即便存在多個相同的標籤也只返回第一個


#二、獲取標籤的名稱
print(soup.p.name)#p

#三、獲取標籤的屬性
print(soup.p.attrs)#{'id': 'my p', 'class': ['title']}

#四、獲取標籤的內容
print(soup.p.string) #The Dormouse's story     p標籤中的文本只有一個時，取到，不然爲None
print(soup.p.strings) #拿到一個生成器對象, 取到p下全部的文本內容
print(soup.p.text) #取到p下全部的文本內容
for line in soup.stripped_strings: #去掉空白
    print(line)
    """
        The Dormouse's story
        The Dormouse's story
        Once upon a time there were three little sisters; and their names were
        Elsie
        ,
        Lacie
        and
        Tillie
        ;
        they lived at the bottom of a well.
        ...
    """


'''
若是tag包含了多個子節點,tag就沒法肯定 .string 方法應該調用哪一個子節點的內容, .string 的輸出結果是 None，若是隻有一個子節點那麼就輸出該子節點的文本，好比下面的這種結構，soup.p.string 返回爲None,但soup.p.strings就能夠找到全部文本
<p id='list-1'>
    哈哈哈哈
    <a class='sss'>
        <span>
            <h1>aaaa</h1>
        </span>
    </a>
    <b>bbbbb</b>
</p>
'''

#五、嵌套選擇
print(soup.head.title.string)#The Dormouse's story
print(soup.body.a.string)#Elsie


#六、子節點、子孫節點
print(soup.p.contents) # [<b class="boldest" id="bbb">The Dormouse's story</b>]    p下全部子節點
print(soup.p.children) #獲得一個迭代器,包含p下全部子節點

for i,child in enumerate(soup.p.children):
    print(i,child)
    """
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        <generator object descendants at 0x0000005FE37D3150>
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        1 The Dormouse's story
    """

print(soup.p.descendants) #獲取子孫節點,p下全部的標籤都會選擇出來,返回一個對象
for i,child in enumerate(soup.p.descendants):
    print(i,child)
    """
        0 <b class="boldest" id="bbb">The Dormouse's story</b>
        1 The Dormouse's story
    """

#七、父節點、祖先節點
print('dddddd',soup.a.parent) #獲取a標籤的父節點
"""
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
"""
print(soup.a.parents) #找到a標籤全部的祖先節點，父親的父親，父親的父親的父親...返回一個對象


#八、兄弟節點

print(soup.a.next_sibling) #下一個兄弟
print(soup.a.previous_sibling) #上一個兄弟

print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器對象
print(soup.a.previous_siblings) #上面的兄弟們=>生成器對象

View Code

搜索文檔樹操做

　　搜索文檔樹的方法主要是運用過濾器、find、CSS選擇器等等，這裏要注意find和find_all的區別。

過濾器的篩選功能相對較弱，但速度較快
find和find_all是日常用的比較多的方法
前端的CSS遊刃有餘的前端大牛能夠選擇使用CSS選擇器

五種過濾器

　　過濾器即用一種方法來得到咱們爬蟲想要抓取的內容。這裏有5種過濾器，分別是字符串、正則、列表、True和自定義方法。下面咱們進行詳述

設咱們從網站獲得了這樣一段html

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

1、字符串過濾器

　　字符串過濾器是依靠標籤名進行過濾的

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
#這裏咱們用了find_all，find_all是找到全部的結果，以列表的形式返回。以後會作詳述

print(soup.find_all('b'))#[<b class="boldest" id="bbb">The Dormouse's story</b>]

print(soup.find_all('a'))
"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
"""

2、正則表達式

　　正則表達在任何地方都適用，只要導入re模塊就可使用正則

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

import re
print(soup.find_all(re.compile('^b')))#找到全部b開頭的標籤，結果是找到了body標籤和b標籤。他會將整個標籤包含標籤內容都返回

"""
[<body>
<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>, 
<b class="boldest" id="bbb">The Dormouse's story</b>]
"""

3、列表過濾器

　　列表過濾器的方法是將字符串過濾器中的參數由字符串變成列表，列表裏面仍是以字符串的形式進行過濾。列表中包含多個字符串，就會從文檔中找到全部符合規則的並返回

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all(['a','b']))#找到文檔中全部的a標籤和b標籤
"""
['<b class="boldest" id="bbb">The Dormouse's story</b>,
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>']

"""

4、True過濾器

　　True過濾器實際上是一種範圍很大的過濾器，它的用法是隻要知足某種條件均可以

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all(name=True))#只要是個標籤就ok
print(soup.find_all(attrs={"id":True}))#找到全部含有id屬性的標籤
print(soup.find_all(name='p',attrs={"id":True}))#找到全部含有id屬性的p標籤

#找到全部標籤並返回其標籤名
for tag in soup.find_all(True):
    print(tag.name)

5、自定義方法

　　自定義方法即自定義的過濾器，有的時候咱們沒有合適的過濾器時就能夠寫一個函數做爲自定義的過濾器，該函數的參數只能是一個。

自定義函數的方法通常不經常使用，但咱們得知道有這個方法，在特殊的狀況下咱們會用到。

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
#自定義函數，找到全部有class屬性但沒有id屬性的p標籤
def has_class_but_no_id(tag):
    res = (tag.name == 'p' and tag.has_attr("class") and not tag.has_attr('id'))
    return res

print(soup.find_all(has_class_but_no_id))

find和find_all

　　find()方法和find_all()方法的用法是同樣的，只不過他們搜尋的方式和返回值不同

===>find()方法是找到文檔中符合條件的第一個元素，直接返回該結果。元素不存在時返回None

===>find_all()方法是找到文檔中全部符合條件的元素，以列表的形式返回。元素不存在時返回空列表

find( name , attrs={} , recursive=True , text=None , **kwargs )
find_all( name , attrs={} , recursive=True , text=None , limit=None , **kwargs )
#find_all比 find多一個參數：limit，下面會提到

下面咱們就來詳細說一下這五個參數

1、name參數

　　name即標籤名，搜索name的過濾器能夠是上述5中過濾器的任何一種

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')
#2.一、name: 搜索name參數的值可使任一類型的過濾器 ,字符串,正則表達式,列表,方法或是 True .
print(soup.find_all(name=re.compile('^t')))#[<title>The Dormouse's story</title>]
print(soup.find(name=re.compile('^t')))#<title>The Dormouse's story</title>

2、attr參數

　　attr就是標籤的屬性，因此該查找方式就是靠屬性進行過濾，過濾器也能夠是任意一種

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')
print(soup.find_all('p',attrs={'class':'story'}))#全部class屬性中有story的p標籤組成的列表，好長的說。。
print(soup.find('p',attrs={'class':'story'}))#第一個符合條件的p標籤

3、recursive參數

　　recursive參數默認爲True，指的是在搜索某標籤時會自動檢索當前標籤的全部子孫節點。若只想搜索直接子節點，不要孫節點，能夠將該參數改成false

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')

print(soup.html.find_all('a'))#列表你懂的
print(soup.html.find('a'))#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.html.find_all('a',recursive=False))#[]
print(soup.html.find('a',recursive=False))#None

4、text參數

　　text即文本，也就是按文本內容搜索。text參數通常不作單獨使用，都是配合着name或者attr用的，做用是進一步縮小搜索的範圍

from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')
#找到文本Elsie，單獨使用沒什麼意義，通常配合前面兩個參數使用
print(soup.find_all(text='Elsie'))#['Elsie']
print(soup.find(text='Elsie'))#'Elsie'
#找到文本是Elsie的a標籤
print(soup.find_all('a',text='Elsie'))#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
print(soup.find('a',text='Elsie'))#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

5、**kwargs

　　鍵值對形式的搜索條件，鍵是name或者某個屬性，值是過濾器的形式。支持除自定義形式意外的4種過濾器

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')
print(soup.find_all(id=re.compile('my')))#[<p class="title" id="my p"><b class="boldest" id="bbb">The Dormouse's story</b></p>]
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d')))#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
print(soup.find_all(id=True)) #查找有id屬性的標籤

###注意！！！按照類名查找時關鍵字是class_，class_=value,value能夠是五種過濾器
print(soup.find_all('a',class_='sister')) #查找類爲sister的a標籤
print(soup.find_all('a',class_='sister ssss')) #查找類爲sister和sss的a標籤，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile('^sis'))) #查找類爲sister的全部標籤

注：有些特殊的標籤名不能用鍵值對的形式搜索，但支持屬性attr的方式搜索。好比HTML5中的data-****標籤

res = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
#print(res.find_all(data-foo="value"))#報錯：SyntaxError: keyword can't be an expression
# 可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))# [<div data-foo="value">foo!</div>]

6、limit參數

　　limit是限制的意思，若是文檔特別大而咱們又不須要全部符合條件的結果的時候會致使搜索很慢。好比咱們只要符合條件的前3個a標籤，而文檔中包含200個a標籤，這種狀況咱們就能夠用到limit參數限制返回的結果的數量，效果與SQL中的limit相似。

　　find_all()中有limit參數而find()中沒有的緣由是由於find()自己就只返回第一個結果，不存在限制的條件。

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')

print(soup.find_all('a',limit=3))

擴展：

　　find()和find_all()幾乎是Beautiful Soup中最經常使用的方法，因此他們具備本身的簡寫方法

from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html_doc,'lxml')

soup.find_all("a")
soup("a")#find_all方法的簡寫版本

soup.find("head").find("title")# <title>The Dormouse's story</title>
soup.head.title#find方法的簡寫版本

soup.title.find_all(text=True)#簡寫了find的版本
soup.title(text=True)#find和find_all均簡寫了的版本

CSS選擇器

　　CSS選擇器的使用方法與CSS定位標籤的方式類似。精髓就是.class 和 #id

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

#一、CSS選擇器
print(soup.p.select('.sister'))
print(soup.select('.sister span'))

print(soup.select('#link1'))
print(soup.select('#link1 span'))

print(soup.select('#list-2 .element.xxx'))

print(soup.select('#list-2')[0].select('.element')) #能夠一直select,但其實不必,select支持鏈式操做，因此一條select就能夠了

# 二、獲取屬性
print(soup.select('#list-2 h1')[0].attrs)

# 三、獲取內容
print(soup.select('#list-2 h1')[0].get_text())