04 Beautiful Soup

時間 2019-11-17

標籤 beautiful soup 简体版

原文原文鏈接

Beautiful Soup

簡介

簡單來講，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋以下：css

'''
Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。
它是一個工具箱，經過解析文檔爲用戶提供須要抓取的數據，由於簡單，因此不須要多少代碼就能夠寫出一個完整的應用程序。
'''

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工做時間.你可能在尋找 Beautiful Soup3 的文檔,Beautiful Soup 3 目前已經中止開發,官網推薦在如今的項目中使用Beautiful Soup 4。html

安裝

pip install beautifulsoup4

解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，若是咱們不安裝它，則 Python 會使用 Python默認的解析器html5

lxml 解析器更增強大，速度更快，推薦安裝。python

pip install lxml

另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,能夠選擇下列方法來安裝html5lib:正則表達式

pip install html5lib

解析器對比：瀏覽器

BeautifulSoup使用

BS導入

1. 導包：from bs4 import BeautifulSoup
2. 能夠將一個html文檔，轉化爲BeautifulSoup對象，而後經過對象的方法或者屬性去查找指定的節點內容
    2.1 本地文件：soup = BeautifulSoup(open('本地文件'), 'lxml')

    2.2 網絡數據：soup = BeautifulSoup('字符串類型或者字節類型', 'lxml')

屬性

<1>根據標籤名查找
        - soup.a   只能找到第一個符合要求的標籤，返回標籤

<2>獲取屬性
        - soup.a.attrs  返回一個字典,獲取a全部的屬性和屬性值
        - soup.a.attrs['href']   獲取href屬性
        - soup.a['href']   也可簡寫爲這種形式

<3>獲取內容
        - soup.a.string
        - soup.a.text
        - soup.a.get_text()    與text無區別
       【注意】若是標籤還有標籤，那麼string獲取到的結果爲None，而其它兩個，能夠獲取文本內容

<4>find：找到第一個符合要求的標籤
        - soup.find('a')  找到第一個符合要求的
        - soup.find('a', title="xxx")
        - soup.find('a', alt="xxx")
        - soup.find('a', class_="xxx")
        - soup.find('a', id="xxx")

<5>find_all：找到全部符合要求的標籤
        - soup.find_all('a')
        - soup.find_all(['a','b']) 找到全部的a和b標籤
        - soup.find_all('a', limit=2)  限制前兩個

<6>根據選擇器選擇指定的內容
               select:soup.select('#feng')
        - 常見的選擇器：標籤選擇器(a)、類選擇器(.)、id選擇器(#)、層級選擇器
            - 層級選擇器：
                div .dudu #lala .meme .xixi  下面好多級
                div > p > a > .lala          只能是下面一級
        【注意】select選擇器返回永遠是列表，須要經過下標提取指定的對象

方法

doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

測試數據

find_all()

找到全部符合要求的標籤
返回一個列表
find_all(name=None, attrs={}, recursive=True, text=None,limit=None, **kwargs)

1 name

五種過濾器：字符串、正則表達式、列表、True和方法網絡

# 字符串:即標籤名
print(soup.find_all('b'))  # [<b class="boldest" id="bbb">The Dormouse's story</b>]

# 正則表達式
print(soup.find_all(re.compile("^b")))  # 找出b開頭的標籤，結果有body和b標籤

# 列表：若是傳入列表參數，BeautifulSoup會與列表中任一元素匹配的內容返回
print(soup.find_all(['a', 'b']))  # 找到文檔中全部<a>標籤和<b>標籤

# True: 能夠匹配任何值
print(soup.find_all(True))  # 找出全部的tag
for tag in soup.find_all(True):
    print(tag.name)             # html head title body p b p a a a p

# 方法: 若是沒有合適過濾器，能夠定義一個方法，方法只接受一個元素參數，若是這個方法返回True, 表示當前元素匹配而且被找到，若是不是則返回False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

2 按照類名查找

class關鍵字爲class_, class_=value,value能夠是五種選擇器之一ide

print(soup.find_all('a', class_='sister'))  # 查找class爲sister的a標籤
print(soup.find_all('a', id='link3'))  # 查找id爲link3的a標籤

3 attrs

print(soup.find_all('p', attrs={'class': 'story'}))  # 查找class爲story的p標籤

4 text

值能夠是字符、列表、True和正則函數

print(soup.find_all(text='Elsie'))  # ['Elsie']
print(soup.find_all('a', text='Elsie'))  # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

5 limit

限制返回結果的數量工具

print(soup.find_all('a', limit=2))

6 recursive

默認爲True,即搜索當前tag的全部子孫節點，若是隻想搜索tag的直接子節點，可使用參數recursive=False

print(soup.html.find_all('a'))
# 局部查找
print(soup.html.find_all('a', recursive=False))

find()

find()參數與和find_all徹底同樣
soup.find('a') 等同於soup.a，只能找到每個符合要求的標籤

selector選擇器

selector等同於css選擇器

返回列表

print(soup.select('.sister'))  # 查找class爲sister的標籤
print(soup.select("#link2"))  # 查找id爲link2的標籤
print(soup.select('.c1 a'))  # 查找class爲c1標籤下的a標籤

相關標籤/搜索

soup

beautiful

python3.6+request+beautiful

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。