BeautifulSoup 使用指北 - 0x03_搜索解析樹

時間 2020-01-01

標籤 beautifulsoup 用指 0x03 搜索解析简体版

原文原文鏈接

GitHub@ orca-j35，全部筆記均託管於 python_notes 倉庫。
歡迎任何形式的轉載，但請務必註明出處。
參考: https://www.crummy.com/softwa...

概述

BeautifulSoup 中定義了許多搜索解析樹的方法，但這些方法都很是相似，它們大多采用與 find_all() 相同的參數: name、attrs、string、limit 和 **kwargs，可是僅有 find() 和 find_all() 支持 recursive 參數。php

這裏着重介紹 find() 和 find_all()，其它"搜索方法"也這兩個相似。css

Three sisters

本節會以 "three sister" 做爲示例:html

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html_doc, 'html.parser')

過濾器

過濾器(filter)用於在解析樹中篩選目標節點，被用做"搜索方法"的實參。python

字符串

字符串可用做過濾器，BeautifulSoup 可利用字符串來篩選節點，並保留符合條件節點:git

使用字符串篩選 tag 時，會保留與字符串同名 tag 節點，且總會過濾掉 HTML 文本節點
使用字符串篩選 HTML 屬性時，會保留屬性值與字符串相同的 tag 節點，且總會過濾掉 HTML 文本節點
使用字符串篩選 HTML 文本時，會保留與字符串相同的文本節點

與 str 字符串相似，咱們還可將 bytes 對象用做過濾器，區別在於 BeautifulSoup 會假定編碼模式爲 UTF-8。github

示例:正則表達式

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名爲b的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all('b')])
print([f"{type(i)}::{i.name}" for i in soup.find_all(b'b')])
# 查找id值爲link1的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link1')])
# 查找文本值爲Elsie的文本節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(text='Elsie')])

輸出:express

["<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None"]

正則表達式

正則表達式對象可用做過濾器，BeautifulSoup 會利用正則表達式對象的 search() 方法來篩選節點，並保留符合條件節點:less

使用正則表達式對象篩選 tag 時，會利用正則表達式的 search() 方法來篩選 tag 節點的名稱，並保留符合條件的 tag 節點。由於文本節點的 .name 屬性值爲 None，所以總會過濾掉 HTML 文本節點
使用正則表達式對象篩選 HTML 屬性時，會利用正則表達式的 search() 方法來篩選指定屬性的值，並保留符合條件的 tag 節點。由於文本節點不包含任何 HTML 屬性，所以總會過濾掉 HTML 文本節點
使用正則表達式對象篩選 HTML 文本時，會利用正則表達式的 search() 方法來篩選文本節點，並保留符合條件的文本節點。

示例:函數

import re

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名稱中包含字母b的節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(re.compile(r'b'))])
# 查找class值以t開頭的tag
print(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile(r'^t'))])
# 查找文本值以E開頭的文本節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(text=re.compile(r'^E'))])

輸出:

["<class 'bs4.element.Tag'>::body", "<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.NavigableString'>::None"]

列表

列表 list 可用做過濾器，列表中的項能夠是:

字符串
正則表達式對象
可調用對象，詳見函數

BeautifulSoup 會利用列表中的項來篩選節點，並保留符合條件節點:

使用列表篩選 tag 時，若 tag 名與列表中的某一項匹配，則會保留該 tag 節點，且總會過濾掉 HTML 文本節點
使用列表篩選 HTML 屬性時，若屬性值與列表中的某一項匹配，則會保留該 tag 節點，且總會過濾掉 HTML 文本節點
使用列表篩選 HTML 文本時，若文本與列表中的某一項匹配，則會保留該文本節點

示例

import re
def func(tag):
    return tag.get('id') == "link1"

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找與列表匹配的tag節點
tag = soup.find_all(['title', re.compile('b$'), func])
pprint([f"{type(i)}::{i.name}" for i in tag])
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(text=["Elsie", "Tillie"])])

輸出:

["<class 'bs4.element.Tag'>::title",
 "<class 'bs4.element.Tag'>::b",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None"]

True

布爾值 True 可用做過濾器:

使用 True 篩選 tag 時，會保留全部 tag 節點，且過濾掉全部 HTML 文本節點
使用 True 篩選 HTML 屬性時，會保留全部具有該屬性的 tag 節點，且過濾掉全部 HTML 文本節點
使用 True 篩選 HTML 文本時，會保留全部文本節點

soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(text=True)])

輸出:

["<class 'bs4.element.Tag'>::html",
 "<class 'bs4.element.Tag'>::head",
 "<class 'bs4.element.Tag'>::title",
 "<class 'bs4.element.Tag'>::body",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::b",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None"]

函數

過濾器能夠是某個函數(或任何可調用對象):

以 tag 節點爲篩選對象時，過濾器函數需以 tag 節點做爲參數，若是函數返回 True，則保留該 tag 節點，不然拋棄該節點。

示例 - 篩選出含 class 屬性，但不含 id 屬性的 tag 節點:

def has_class_but_no_id(tag):
    # Here’s a function that returns True if a tag defines the 「class」 attribute but doesn’t define the 「id」 attribute
    return tag.has_attr('class') and not tag.has_attr('id')


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(has_class_but_no_id)
pprint([f"{type(i)}::{i.name}" for i in tag])

輸出:

["<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p"]

針對 HTML 屬性進行篩選時，過濾函數需以屬性值做爲參數，而非整個 tag 節點。若是 tag 節點包含目標屬性，則會向過濾函數傳遞 None，不然傳遞實際值。若是函數返回 True，則保留該 tag 節點，不然拋棄該節點。

def not_lacie(href):
    # Here’s a function that finds all a tags whose href attribute does not match a regular expression
    return href and not re.compile("lacie").search(href)


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(href=not_lacie)
for i in tag:
    print(f"{type(i)}::{i.name}::{i}")

輸出:

<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

針對 HTML 文本進行篩選時，過濾需以文本值做爲參數，而非整個 tag 節點。若是函數返回 True，則保留該 tag 節點，不然拋棄該節點。

def func(text):
    return text == "Lacie"

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(text=func)])

輸出:

["<class 'bs4.element.NavigableString'>::Lacie"]

過濾函數能夠被設計的很是複雜，好比:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

def surrounded_by_strings(tag):
    # returns True if a tag is surrounded by string objects
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(surrounded_by_strings)
pprint([f"{type(i)}::{i.name}" for i in tag])
# 注意空白符對輸出結果的影響

輸出:

["<class 'bs4.element.Tag'>::body",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::p"]

find_all()🔨

🔨find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

該方法會檢索當前 tag 對象的全部子孫節點，並提取與給定條件匹配的全部節點對象，而後返回一個包含這些節點對象的列表。

name 參數

name 是用來篩選 tag 名稱的過濾器，find_all() 會保留與 name 過濾器匹配的 tag 對象。使用 name 參數時，會自動過濾 HTML 文本節點，由於文本節點的 .name 字段爲 None。

前面提到的五種過濾器都可用做 name 參數，即字符串、正則表達式、列表、True、函數(可調用對象)。

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i.name}" for i in soup.find_all('title')])
#> ["<class 'bs4.element.Tag'>::title"]

**kwargs 參數

函數定義中未包含的關鍵字參數將被視做 HTML 屬性過濾器，find_all() 會保留屬性值與 var-keyword 匹配的 tag 對象。使用 var-keyword 時，會自動過濾 HTML 文本節點，由於文本節不含有 HTML 屬性。

前面提到的五種過濾器都可用做 var-keyword 的值，即字符串、正則表達式、列表、True、函數(可調用對象)。

soup = BeautifulSoup(html_doc, 'html.parser')
# 搜索id值爲link2的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link2')])
# 搜索href值以字母'e'結尾的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(href=re.compile(r"e$"))])
# 搜索具有id屬性的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
# 過濾多個HTML屬性
print([
    f"{type(i)}::{i.name}"
    for i in soup.find_all(class_="sister", href=re.compile(r"tillie"))
])

輸出:

["<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a"]

string

var-keyword 參數 string 與 text 參數等效:

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(string=re.compile("sisters"))])
#> ["<class 'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n        "]
print([f"{type(i)}::{i}" for i in soup.find_all(text=re.compile("sisters"))])
#> ["<class 'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n        "]

string 是在 Beautiful Soup 4.4.0 中新加入的，在以前的版本中只能使用 text 參數。

例外

HTML 5 中的部分屬性並不符合 Python 的命名規則，不能用做 var-keyword 參數，此時須要使用 attrs 參數來過濾這些屬性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
#> SyntaxError: keyword can't be an expression

print([
    f"{type(i)}::{i.name}"
    for i in data_soup.find_all(attrs={"data-foo": "value"})
])
#> ["<class 'bs4.element.Tag'>::div"

var-keyword 參數不能用於過濾 HTML tag 的 name 屬性，由於在 find_all() 的函數定義中已佔用了變量名 name。若是要過濾 name 屬性，可以使用 attrs 參數來完成。

soup = BeautifulSoup(html_doc, 'html.parser')
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
print([f"{type(i)}::{i.name}" for i in name_soup.find_all(name="email")])
print([
    f"{type(i)}::{i.name}" for i in name_soup.find_all(attrs={"name": "email"})
])

輸出:

[]
["<class 'bs4.element.Tag'>::input"]

按 CSS 類搜索

CSS 的 class 屬性是 Python 的保留關鍵字，從 BeautifulSoup 4.1.2 開始，可以使用 var-keyword 參數 class_ 來篩選 CSS 的 class 屬性。使用 var-keyword 時，會自動過濾 HTML 文本節點，由於文本節不含有 HTML 屬性。

前面提到的五種過濾器都可用做 class_ 的值，即字符串、正則表達式、列表、True、函數(可調用對象)。

# 搜索class時sister的a標籤
soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all("a", class_="sister")])

# 搜索class中包含itl字段的標籤
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile("itl"))])

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
# 搜索class值長度爲6的標籤
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=has_six_characters)])

pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=['title', "story"])])

輸出:

["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p"]

CSS 的 class 屬性可能會包含多個值，若是 class_ 僅匹配單個值，則會篩選出全部包含此 CSS class 的 tag 標籤；若是 class_ 匹配多個值時，會嚴格按照 CSS class 的順序進行匹配，即便內容徹底同樣，但順序不一致也會匹配失敗:

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all(class_='body'))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all(class_='strikeout'))
#> [<p class="body strikeout"></p>]

print(css_soup.find_all("p", class_="body strikeout"))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all("p", class_="strikeout body"))
#> []

所以，當你想要依據多個 CSS class 來搜索須要的 tag 標籤時，爲了避免免因順序不一致而搜索失敗，應使用 CSS 選擇器:

print(css_soup.select("p.strikeout.body"))
#> [<p class="body strikeout"></p>]

在 BeautifulSoup 4.1.2 以前不能使用 class_ 參數，此時可經過 attrs 參數來完成搜索:

soup = BeautifulSoup(html_doc, 'html.parser')
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(attrs={"class": "sister"})])

pprint([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])

輸出:

["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]

attrs 參數

能夠向 attrs 傳遞如下兩種類型的實參值:

過濾器 - 此時 .find_all() 會查找 CSS class 的值與該過濾器匹配的 tag 標籤，前面提到的五種過濾器都可使用。

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("p", "title"))
#> [<p class="title"><b>The Dormouse's story</b></p>]

print([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])
#> ["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]

映射對象 - .find_all() 會把映射對象中的鍵值對視做 HTML 屬性名和屬性值，並找出擁有配匹屬性的 tag 標籤，前面提到的五種過濾器都可用做映射對象的值。

soup = BeautifulSoup(html_doc, 'html.parser')

pprint([
    f"{type(i)}::{i.name}" for i in soup.find_all(attrs={
        "class": "sister",
        "id": "link1",
    })
])
#> ["<class 'bs4.element.Tag'>::a"]

text/string 參數

The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text

text 是用來篩選文本標籤的過濾器，find_all() 會保留與 text 過濾器匹配的文本標籤，前面提到的五種過濾器都可用做 text 的實參。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all(string="Elsie"))
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(string=re.compile("Dormouse")))


def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)


print(soup.find_all(string=is_the_only_string_within_a_tag))

輸出:

['Elsie']
['Elsie', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']

在查找 tag 標籤時，text 被視做篩選條件，find_all() 會篩選出 .string 字段與 text 過濾器匹配的 tag 標籤:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i}' for i in soup.find_all("a", string="Elsie")])
#> ['<class \'bs4.element.Tag\'>::<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>']

limit 參數

默認狀況下 find_all() 會返回全部匹配到的標籤對象，若是並不須要獲取所有標籤對象，可以使用 limit 參數來控制對象的數量，此時 BeautifulSoup 會在搜索到 limit 個標籤對象後中止搜索。

soup = BeautifulSoup(html_doc, 'html.parser')
# There are three links in the 「three sisters」 document,
# but this code only finds the first two
print([f'{type(i)}::{i.name}' for i in soup.find_all("a", limit=2)])
#> ["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]

recursive 參數

默認狀況下 find_all() 會檢索當前 tag 對象的全部子孫節點，並提取與給定條件匹配的全部節點對象，而後返回一個包含這些節點對象的列表。若是不想遞歸檢索全部子孫節點，可以使用 recursive 進行限制: 當 recursive=False 時，只會檢索直接子節點:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i.name}' for i in soup.find_all("title")])
#> ["<class 'bs4.element.Tag'>::title"]
print(
    [f'{type(i)}::{i.name}' for i in soup.find_all("title", recursive=False)])
#> []

調用 `Tag` 對象

在使用 BeautifulSoup 時，find_all() 是最經常使用的檢索方法，所以開發人員爲 find_all() 提供了更簡便的調用方法——咱們在調用 Tag 對象時，即是在調用其 find_all() 方法，源代碼以下:

def __call__(self, *args, **kwargs):
    """Calling a tag like a function is the same as calling its
        find_all() method. Eg. tag('a') returns a list of all the A tags
        found within this tag."""
    return self.find_all(*args, **kwargs)

示例 :

soup("a") # 等效於soup.find_all("a")
soup.title(string=True) # 等效於soup.title.find_all(string=True)

find()🔨

🔨find(name, attrs, recursive, string, **kwargs)

find() 方法會只會返回第一個被匹配到的標籤對象，若是沒有與之匹配的標籤則會返回 None。在解析樹中使用節點名稱導航時，實際上就是在使用 find() 方法。

其它搜索方法

在理解下面這些方法時，請交叉參考筆記﹝BeautifulSoup - 解析樹.md﹞中的"在解析樹中導航"一節，以便理解解析樹的結構。
本節中不會詳細解釋各個方法的含義，只會給出函數簽名和文檔參考鏈接。

find_parents()&find_parent()🔨

🔨find_parents(name, attrs, string, limit, **kwargs)

🔨find_parent(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_next_siblings()&find_next_sibling()🔨

🔨find_next_siblings(name, attrs, string, limit, **kwargs)

🔨find_next_sibling(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_previous_siblings()&find_previous_sibling()🔨

🔨find_previous_siblings(name, attrs, string, limit, **kwargs)

🔨find_previous_sibling(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_all_next()&find_next()🔨

🔨find_all_next(name, attrs, string, limit, **kwargs)

🔨find_next(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_all_previous()&find_previous()🔨

🔨find_all_previous(name, attrs, string, limit, **kwargs)

🔨find_previous(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

CSS 選擇器

從 4.7.0 版本開始，BeautifulSoup 將經過 SoupSieve 項目支持大多數 CSS4 選擇器。若是你經過 pip 來安裝 BeautifulSoup，則會自動安裝 SoupSieve。

SoupSieve 的官方文檔中詳細介紹了 API 和目前已支持的 CSS 選擇器，API 不僅包含本節介紹的 .select()，還包含如下方法:

.select_one()
.iselect()
.closest()
.match()
.filter()
.comments()
.icomments()
.escape()
.compile()
.purge()

總之，如需全面瞭解 SoupSieve 相關信息，請參考其官方文檔。

在瞭解 CSS 時，推薦使用"jQuery 選擇器檢測器"來觀察不一樣的選擇器的效果，還可交叉參考筆記﹝PyQuery.md﹞和如下鏈接:

select()🔨

.select() 方法適用於 BeautifulSoup 對象和 Tag 對象。

在 4.7.0 版本以後， .select() 會使用 SoupSieve 來提取與 CSS 選擇器匹配的全部節點對象，而後返回一個包含這些節點對象的列表。

在 4.7.0 版本以前，雖然也可使用 .select()，可是在舊版本中僅支持最多見的 CSS 選擇器。

元素選擇器:

print(soup.select("title"))
#> [<title>The Dormouse's story</title>]

print(soup.select("p:nth-of-type(3)"))
#> [<p class="story">...</p>]

嵌套選擇器:

print(soup.select("body a"))
#> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#>  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#>  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select("html head title"))
#> [<title>The Dormouse's story</title>]

更多示例詳見: https://www.crummy.com/softwa...