BeautifulSoup模塊

時間 2019-12-10

標籤 beautifulsoup 模塊简体版

原文原文鏈接

1、介紹

　　Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫。它可以經過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式。Beautiful Soup會幫你節省數小時甚至數天的工做時間。Beautiful Soup 3 目前已經中止開發,官網推薦在如今的項目中使用Beautiful Soup 4。css

一、將pip源配置爲國內源

- 須要將pip源設置爲國內源，阿里源、豆瓣源、網易源等
   - windows
    （1）打開文件資源管理器(文件夾地址欄中)
    （2）地址欄上面輸入 %appdata%
    （3）在這裏面新建一個文件夾  pip
    （4）在pip文件夾裏面新建一個文件叫作  pip.ini ,內容寫以下便可
        [global]
        timeout = 6000
        index-url = https://mirrors.aliyun.com/pypi/simple/
        trusted-host = mirrors.aliyun.com
   - linux
    （1）cd ~
    （2）mkdir ~/.pip
    （3）vi ~/.pip/pip.conf
    （4）編輯內容，和windows如出一轍

二、安裝Beautiful Soup

#安裝 Beautiful Soup
pip install beautifulsoup4

#安裝解析器
Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,其中一個是 lxml .根據操做系統不一樣,能夠選擇下列方法來安裝lxml:
$ apt-get install Python-lxml
$ easy_install lxml
$ pip install lxml

另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,能夠選擇下列方法來安裝html5lib:
$ apt-get install Python-html5lib
$ easy_install html5lib
$ pip install html5lib

三、主流解析器對比

　　下表列出了主要的解析器,以及它們的優缺點,官網推薦使用lxml做爲解析器,由於效率更高. 在Python2.7.3以前的版本和Python3中3.2.2以前的版本,必須安裝lxml或html5lib, 由於那些Python版本的標準庫中內置的HTML解析方法不夠穩定。html

解析器	使用方法	優點	劣勢
python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫html5 執行速度適中python 文檔容錯能力強linux	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差正則表達式
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快express 文檔容錯能力強windows	須要安裝C語言庫瀏覽器
lxml XML 解析器	`BeautifulSoup(markup,["lxml","xml"])`網絡 `BeautifulSoup(markup, "xml")`	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

　　詳情查看中文文檔：Beautiful Soup 4.2.0 文檔

2、基本使用

　　容錯處理，文檔的容錯能力指的是在html代碼不完整的狀況下，使用該模塊能夠識別該錯誤。

　　使用BeautifulSoup解析不完整的html代碼，可以獲得一個 BeautifulSoup 的對象，並能按照標準的縮進格式的結構輸出。

　　核心思想：將html文檔轉換爲BeautifulSoup對象，調用該對象中的屬性和方法進行html文檔指定內容定位查找。

一、使用流程

一、導包：from bs4 import BeautifulSoup
2、建立Beautiful對象：
    若是html文檔的來源是本地：
        Beautiful('open('本地的html文件')', 'lxml')
    若是html是來源於網絡：
        Beautiful('網絡請求到的頁面數據', 'lxml')

二、代碼示例

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'lxml') #具備容錯功能
res=soup.prettify()  # 處理好縮進，結構化顯示
print(res)

　　輸出的結果補齊了缺失的html代碼：

 </body>
</html>

3、遍歷文檔樹

　　遍歷文檔樹即直接經過標籤名字選擇，特色是選擇速度快，但若是存在多個相同的標籤則只返回第一個。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse's story總共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""
# 一、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc, features='lxml') #具備容錯功能

# 2.name,根據標籤名查找
# tag = soup.a
# name = tag.name  # 獲取標籤名稱
# print(name)      # 輸出：a
# tag.name = 'span'  # 設置標籤名稱
# print(soup)
"""輸出代碼可看到第一個a標籤修改成了span標籤
<span class="sister0" id="link1">Els<span>f</span>ie</span>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
"""

# 3.attrs,獲取標籤屬性
# tag = soup.a
# attrs = tag.attrs    # 獲取標籤屬性
# print(attrs)
"""
{'class': ['sister0'], 'id': 'link1'}
"""
# tag.attrs = {'ik':123}      # 清空並設置標籤屬性
# tag.attrs['id'] = 'iiiii'   # 添加標籤屬性
# print(soup)
"""
<a id="iiiii" ik="123">Els<span>f</span>ie</a>,
"""

# 4.string/text/get_text，獲取標籤的內容
# print(soup.p.string)   # p下的文本只有一個時取到，不然爲None
# print(soup.p.string)   # 拿到一個生成器對象，取到p下全部的文本內容
# print(soup.p.text)     # 取到p下全部的文本內容
# for line in soup.div.stripped_strings:   # 去掉空白
#     print(line)
"""
若是tag包含了多個子節點,tag就沒法肯定.string 方法應該調用哪一個子節點的內容, .string 的輸出結果是 None，
若是隻有一個子節點那麼就輸出該子節點的文本，好比下面的這種結構，soup.p.string 返回爲None,
但soup.p.strings和soup.p.get_text()就能夠找到全部文本內容
"""

# 5.children,全部子節點
# body = soup.find('body')
# v = body.children  # 獲得一個迭代器，包含body下全部子節點
# for i, child in enumerate(v):   # i只有9個
#     print(i, child)

# 6.descendants,獲取子孫節點
# body = soup.find('body')  # 獲取子孫節點,body下全部的標籤都會選擇出來
# v = body.descendants
# for i, child in enumerate(v):     # i有30個
#     print(i, child)

# 7.嵌套選擇
# print(soup.head.title.string)    # The Dormouse's story
# print(soup.body.a.string)        # None

# 8.parent/parents, 父節點/祖先節點
# parent = soup.a.parent      # 獲取a標籤父節點
# print(parent)    # <div class="story">...</div>
# parents = soup.a.parents    # 獲取a標籤全部祖先節點
# print(parents)   # <generator object parents at 0x10403b200>

# 9.next_sibling/previous_sibling,兄弟節點
n_s = soup.a.next_sibling       # 下一個兄弟
print(n_s)
p_s = soup.a.previous_sibling   # 上一個兄弟
print(p_s)

n_generator = list(soup.a.next_sibling)    # 下面的兄弟們——》生成器對象
print(n_generator)
p_generator = soup.a.previous_sibling      # 上面的兄弟們——》生成器對象
print(p_generator)

4、搜索文檔樹

　　BeautifulSoup定義了不少搜索方法，這裏着重介紹2個：find()和find_all()，其餘方法的參數和用法相似。

一、五種過濾器

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse's story總共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc, features='lxml') #具備容錯功能


# 五種過濾器: 字符串、正則表達式、列表、True、方法
# 一、字符串: 標籤名
# print(soup.find_all('b'))
"""
[<b>The Dormouse's story總共</b>]
"""

# 二、正則表達式
# import re
# print(soup.find_all(re.compile('^b$')))  # 找出b開頭並結尾的標籤
"""
[<b>The Dormouse's story總共</b>]
"""

# 三、列表: 若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和<b>標籤:
# print(soup.find_all(['a', 'b']))
"""
[<b>The Dormouse's story總共</b>, <a class="sister0" id="link1">Els<span>f</span>ie</a>......
"""

# 四、True:能夠匹配任何值，下面代碼查找到全部的tag，可是不會返回字符串節點
# print(soup.find_all(True))
# for tag in soup.find_all(True):
#     print(tag.name)   # html head title等

# 五、方法：若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數,
# 若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False
def has_class_but_no_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")

print(soup.find_all(has_class_but_no_id))
"""
[<div class="title">...</div>, <div class="story">...</div>, <p class="story">...</p>]
"""

二、find_all(self, name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)

# 2、find_all()
# 一、name: 搜索name參數的值可使任一類型的 過濾器 ,字符竄,正則表達式,列表,方法或是 True .
# print(soup.find_all(name=re.compile('^t')))
"""
[<title>The Dormouse's story</title>]
"""

# 二、keyword: key=value的形式，value能夠是過濾器：字符串,正則表達式,列表,True.
print(soup.find_all(id=re.compile('my')))
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d')))  # 注意類要用class_
print(soup.find_all(id=True))   # 查找有id屬性的標籤

# 有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
# data_soup.find_all(data-foo="value") #報錯：SyntaxError: keyword can't be an expression
# 可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))
# [<div data-foo="value">foo!</div>]

# 三、按照類名查找，注意關鍵字是class_，class_=value,value能夠是五種選擇器之一
print(soup.find_all('a',class_='sister'))        # 查找類爲sister的a標籤
print(soup.find_all('a',class_='sister ssss'))   # 查找類爲sister和sss的a標籤，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile('^sis')))  # 查找類爲sister的全部標籤

# 四、attrs
print(soup.find_all('p',attrs={'class':'story'}))

# 五、text: 值能夠是：字符，列表，True，正則
print(soup.find_all(text='Elsie'))
print(soup.find_all('a',text='Elsie'))

# 六、limit參數:若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.
# 效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果
print(soup.find_all('a',limit=2))   # 獲取前兩個符合條件的a標籤

# 七、recursive:調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的全部子孫節點,
# 若是隻想搜索tag的直接子節點,可使用參數 recursive=False .
print(soup.html.find_all('a'))
print(soup.html.find_all('a',recursive=False))

'''
像調用 find_all() 同樣調用tag
find_all() 幾乎是Beautiful Soup中最經常使用的搜索方法,因此咱們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象能夠被看成一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:
soup.find_all("a")
soup("a")
這兩行代碼也是等價的:
soup.title.find_all(text=True)
soup.title(text=True)
'''

三、find(self, name=None, attrs={}, recursive=True, text=None,**kwargs)

　　find_all() 方法將返回文檔中符合條件的全部tag，儘管有時候咱們只想獲得一個結果。

　　好比文檔中只有一個<body>標籤,那麼使用 find_all() 方法來查找<body>標籤就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法。

　　下面兩行代碼是等價的：

print(soup.find_all("title", limit=1))
# [<title>The Dormouse's story</title>]
print(soup.find("title"))
# <title>The Dormouse's story</title>

　　惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果（返回符合條件的第一個標籤）.

　　find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .

print(soup.find("nosuchtag"))
"""
None
"""

　　soup.head.title 是 tag的名字方法的簡寫.這個簡寫的原理就是屢次調用當前tag的 find() 方法:

print(soup.head.title)
# <title>The Dormouse's story</title>
print(soup.find("head").find("title"))
# <title>The Dormouse's story</title>

四、其餘詳見官方文檔

　　Beautiful Soup 4.2.0 文檔

五、CSS選擇器

　　該模塊提供了select方法來支持css，詳見官網：CSS選擇器

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title">
    <b>The Dormouse's story</b>
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elsie</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    <div class='panel-1'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'><h1 class='yyyy'>Foo</h1></li>
            <li class='element xxx'>Bar</li>
            <li class='element'>Jay</li>
        </ul>
    </div>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

# 一、CSS選擇器
# print(soup.p.select('.sister'))
# print(soup.select('.sister span'))
"""
[<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<span>Elsie</span>]
"""

# print(soup.select('#link1'))
# print(soup.select('#link1 span'))
"""
[<a class="sister" href="http://example.com/elsie" id="link1"><span>Elsie</span></a>]
[<span>Elsie</span>]
"""

# print(soup.select('#list-2 .element.xxx'))
"""
[<li class="element xxx">Bar</li>]
"""

# 能夠一直select,但其實不必,一條select就能夠了
# print(soup.select('#list-2')[0].select('.element'))
"""
[<li class="element"><h1 class="yyyy">Foo</h1></li>, <li class="element xxx">Bar</li>, <li class="element">Jay</li>]
"""

# 二、獲取屬性
# print(soup.select('#list-2 h1')[0].attrs)
"""
{'class': ['yyyy']}
"""

# 三、獲取內容
print(soup.select('#list-2 h1')[0].get_text())
"""
Foo
"""

　　注意：

（1）常見的選擇器有：標籤選擇器、類選擇器、id選擇器、層級選擇器。

（2）select選擇器返回永遠是列表，須要經過下標提取指定的對象。

（3）經過 class_ 參數搜索有指定CSS類名的tag

　　按照CSS類名搜索tag的功能很是實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 作參數會致使語法錯誤.從Beautiful Soup的4.1.1版本開始,能夠經過 class_ 參數搜索有指定CSS類名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

　　class_ 參數一樣接受不一樣類型的 過濾器 ,字符串,正則表達式,方法或 True :

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5、修改文檔樹

　　Beautiful Soup的強項是文檔樹的搜索,但同時也能夠方便的修改文檔樹。

　　詳見官網：修改文檔樹。

6、bs4項目演練

　　需求：爬取古詩文網中三國小說裏的標題和內容。古詩文網三國演義網址

一、項目代碼實現

import requests
from bs4 import BeautifulSoup


url = "http://www.shicimingju.com/book/sanguoyanyi.html"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}


def get_content(url):
    """
    根據url獲取頁面中指定的標題所對應的文章內容
    :param url:
    :return:
    """
    content_page = requests.get(url=url, headers=headers).text
    # 指定文章內容解析
    soup = BeautifulSoup(content_page, 'lxml')
    # 經過 class_ 參數搜索有指定CSS類名的tag
    div = soup.find('div', class_='chapter_content')
    return div.text


page_text = requests.get(url=url, headers=headers).text

# 數據解析
soup = BeautifulSoup(page_text, 'lxml')
a_list = soup.select('.book-mulu > ul > li > a')  # 層級表達式定位到li標籤下的a標籤
# print(a_list)
"""a_list存儲的一系列的a標籤對象
[<a href="/book/sanguoyanyi/1.html">第一回·宴桃園豪傑三結義  斬黃巾英雄首立功</a>,
...,
<a href="/book/sanguoyanyi/120.html">第一百二十回·薦杜預老將獻新謀  降孫皓三分歸一統</a>]
"""
# print(type(a_list[0]))
"""
<class 'bs4.element.Tag'>
# 注意：Tag類型的對象能夠繼續調用響應的解析屬性和方法進行局部數據的解析
"""

# 持久化存儲
fp = open('./sanguo.txt', 'w', encoding='utf-8')
for a in a_list:
    # 獲取章節標題
    title = a.string    # Tag類型的對象能夠繼續調用響應的解析屬性和方法進行局部數據的解析
    # 獲取章節url
    content_url = 'http://www.shicimingju.com' + a['href']
    # 獲取章節內容
    content = get_content(content_url)
    # print(content)
    fp.write(title + ':' + content + "\n\n\n")
    print('寫入一個章節內容')