爬蟲（五）—— 解析庫（二）beautiful soup解析庫

時間 2019-12-10

原文原文鏈接

目錄css

解析庫——beautiful soup

1、BeautifulSoup簡介

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫。它可以經過轉換器實現慣用的文檔導航、查找、修改文檔的方式。Beautiful Soup 3 目前已經中止開發，官網推薦在如今的項目中使用Beautiful Soup 4，移植到BS4。html

2、安裝模塊

# 安裝 Beautiful Soup
pip install beautifulsoup4

# 安裝lxml解析器
# Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,其中一個是 lxml .根據操做系統不一樣,能夠選擇下列方法來安裝lxml:
pip install lxml

下表列出了主要的解析器，以及它們的優缺點，官網推薦使用 lxml 做爲解析器，由於效率更高。在Python2.7.3以前的版本和Python3中3.2.2以前的版本，必須安裝lxml或html5lib，由於那些Python版本的標準庫中內置的HTML解析方法不夠穩定。html5

解析器	使用方法	優點	劣勢
Python標準庫	`BeautifulSoup(markup, "html.parser")`	Python的內置標準庫執行速度適中文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, ["lxml", "xml"])``BeautifulSoup(markup, "xml")	速度快惟一支持XML的解析器	須要安裝C語言庫
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容錯性以瀏覽器的方式解析文檔生成HTML5格式的文檔	速度慢不依賴外部擴展

3、Beautiful Soup的基本使用

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 容錯處理,文檔的容錯能力指的是在html代碼不完整的狀況下,使用該模塊能夠識別該錯誤。
# 使用BeautifulSoup解析上述代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')     # 具備容錯功能
# soup=BeautifulSoup(open('a.html','r',encoding='UTF-8'), 'lxml') 
html_doc=soup.prettify()    # 處理好縮進、錯誤，結構化顯
print(html_doc)

4、Beautiful Soup查找元素

一、查找文本、屬性——「 . 」的形式逐層查找

# 1. 獲取標籤的名稱   ********
print(soup.p.name)

# 2. 獲取標籤的屬性    ********
print(soup.p.attrs)

# 3. 獲取標籤的內容
print(soup.p.string)     # p下的文本只有一個時，取到，不然爲None
print(soup.p.strings)      # 拿到一個生成器對象, 取到p下全部的文本內容，能夠轉換爲list
print(soup.p.text)       # 取到p下全部的文本內容    *******
for line in soup.stripped_strings:    # 去掉空白
    print(line)

# 4. 嵌套選擇
print(soup.head.title.string)
print(soup.body.a.string)


# 5. 子節點、子孫節點
print(soup.p.contents)      # p下全部子節點
print(soup.p.children)      # 獲得一個迭代器,包含p下全部子節點
for i,child in enumerate(soup.p.children):
    print(i,child)

print(soup.p.descendants)   # 獲取子孫節點,p下全部的標籤都會選擇出來
for i,child in enumerate(soup.p.descendants):
    print(i,child)

# 6. 父節點、祖先節點
print(soup.a.parent)        # 獲取a標籤的父節點
print(soup.a.parents)       # 找到a標籤全部的祖先節點，父親的父親，父親的父親的父親...


# 7. 兄弟節點
print(soup.a.next_sibling)      # 下一個兄弟
print(soup.a.previous_sibling)      # 上一個兄弟

print(list(soup.a.next_siblings))       # 下面的兄弟們=>生成器對象
print(soup.a.previous_siblings)         # 上面的兄弟們=>生成器對象from bs4 import BeautifulSoup

soup = BeautifulSoup('html_doc','lxml')
html_doc=soup.prettify()

print(soup.a.text)     # 獲取a標籤下的文本內容
print(soup.a.attrs['href'])    # 獲取a標籤的href屬性
print(soup.a.name)      # 獲取a標籤的name

二、搜索文檔樹—— find() / find_all()

（1）5種過濾器

1. 字符串    ----   ''
2. 正則表達式   -----   re.compile()
3. 列表
4. True     # name=True    查找存在name等的標籤
5. 方法

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

# find_all(find(self, name=None, attrs={}, recursive=True, text=None, **kwargs))

# 1.字符串
print(soup.find_all('b'))

# 2.正則表達式
# 利用re.compile()使用正則
import re
print(soup.find_all(name=re.compile('^b')))   # 找出b開頭的標籤，結果有body和b標籤

# 3.列表：
# 若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和<b>標籤:
print(soup.find_all(['a','b']))

# 4.True
# 能夠匹配任何值,下面代碼查找到全部的tag,可是不會返回字符串節點
print(soup.find_all(attrs={'id':True}))    # 找到全部有id屬性的標籤
print(soup.find_all(True))
for tag in soup.find_all(True):
    print(tag.name)

# 5.方法
# 若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數 ,若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))

# 匿名函數
print(soup.find_all(lambda tag: True if tag.has_attr("class") and tag.has_attr("id") else False))

（2）find_all方法參數

find_all(find(self, name=None, attrs={}, recursive=True, text=None, **kwargs))

# 一、name: 搜索name參數的值可使任一類型的 過濾器 ,字符竄,正則表達式,列表,方法或是 True .
print(soup.find_all(name=re.compile('^t')))

# 二、keyword: key=value的形式，value能夠是過濾器：字符串 , 正則表達式 , 列表, True .
print(soup.find_all(id=re.compile('my')))
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d'))) #注意類要用class_
print(soup.find_all(id=True))    # 查找有id屬性的標籤

# 有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
# data_soup.find_all(data-foo="value") #報錯：SyntaxError: keyword can't be an expression
# 可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))
# [<div data-foo="value">foo!</div>]

# 三、按照類名查找，注意關鍵字是class_，class_=value,value能夠是五種選擇器之一
print(soup.find_all(name='a',class_='sister')) #查找類爲sister的a標籤
print(soup.find_all('a',class_='sister ssss')) #查找類爲sister和sss的a標籤，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile('^sis'))) #查找類爲sister的全部標籤

# 四、attrs
print(soup.find_all('p',attrs={'class':'story'}))

# 五、text: 值能夠是：字符，列表，True，正則
print(soup.find_all(text='Elsie'))
print(soup.find_all('a',text='Elsie'))

# 六、limit參數:若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果
print(soup.find_all('a',limit=2))

# 七、recursive：調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的全部子孫節點,若是隻想搜索tag的直接子節點,可使用參數 recursive=False
print(soup.html.find_all('a'))
print(soup.html.find_all('a',recursive=False))

三、CSS選擇器—— select('#id')

# 1. CSS選擇器
# select 返回的是一個列表
print(soup.p.select('.sister'))
print(soup.select('.sister span'))
print(soup.select('#link1'))
print(soup.select('#link1 span'))
print(soup.select('#list-2 .element.xxx'))

print(soup.select('#list-2')[0].select('.element'))  # 能夠一直select,但其實不必,一條select就能夠了

# 2. 獲取屬性
print(soup.select('#list-2 h1')[0].attrs)

# 3. 獲取內容
print(soup.select('#list-2 h1')[0].get_text())

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。