beautifulsoup之CSS選擇器

時間 2019-11-09

原文原文鏈接

BeautifulSoup支持大部分的CSS選擇器，其語法爲：向tag或soup對象的.select()方法中傳入字符串參數，選擇的結果以列表形式返回。html

　　tag.select("string")編碼

　　BeautifulSoup.select("string")spa

源代碼示例：code

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title" name="dromouse">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="mysis" href="http://example.com/elsie" id="link1">
                <b>the first b tag<b>
                Elsie
            </a>,
            <a class="mysis" href="http://example.com/lacie" id="link2" myname="kong">
                Lacie
            </a>and
            <a class="mysis" href="http://example.com/tillie" id="link3">
                Tillie
            </a>;and they lived at the bottom of a well.
        </p>
        <p class="story">
            myStory
            <a>the end a tag</a>
        </p>
        <a>the p tag sibling</a>
    </body>
</html>
"""

soup = BeautifulSoup(html,'lxml')

　　一、經過標籤選擇orm

# 選擇全部title標籤
soup.select("title")
# 選擇全部p標籤中的第三個標籤
soup.select("p:nth-of-type(3)") 至關於soup.select(p)[2]
# 選擇body標籤下的全部a標籤
soup.select("body a")
# 選擇body標籤下的直接a子標籤
soup.select("body > a")
# 選擇id=link1後的全部兄弟節點標籤
soup.select("#link1 ~ .mysis")
# 選擇id=link1後的下一個兄弟節點標籤
soup.select("#link1 + .mysis")

　　二、經過類名查找xml

# 選擇a標籤，其類屬性爲mysis的標籤
soup.select("a.mysis")

　　三、經過id查找htm

# 選擇a標籤，其id屬性爲link1的標籤
soup.select("a#link1")

　　四、經過【屬性】查找，固然也適用於class對象

# 選擇a標籤，其屬性中存在myname的全部標籤
soup.select("a[myname]")
# 選擇a標籤，其屬性href=http://example.com/lacie的全部標籤
soup.select("a[href='http://example.com/lacie']")
# 選擇a標籤，其href屬性以http開頭
soup.select('a[href^="http"]')
# 選擇a標籤，其href屬性以lacie結尾
soup.select('a[href$="lacie"]')
# 選擇a標籤，其href屬性包含.com
soup.select('a[href*=".com"]')
# 從html中排除某標籤，此時soup中再也不有script標籤
[s.extract() for s in soup('script')] 
# 若是想排除多個呢
[s.extract() for s in soup(['script','fram']

　　五、獲取文本及屬性blog

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
'''
以列表的形式返回
'''
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.select('p.story')
s[0].get_text()  # p節點及子孫節點的文本內容
s[0].get_text("|")  # 指定文本內容的分隔符
s[0].get_text("|", strip=True)  # 去除文本內容先後的空白
print(s[0].get("class"))  # p節點的class屬性值列表（除class外都是返回字符串）

　　六、UnicodeDammit.detwingle() 方法只能解碼包含在UTF-8編碼中的Windows-1252編碼內容,three

new_doc = UnicodeDammit.detwingle(doc)
print(new_doc.decode("utf8"))
# ☃☃☃「I like snowmen!」

在建立 BeautifulSoup 或 UnicodeDammit 對象前必定要先對文檔調用 UnicodeDammit.detwingle() 確保文檔的編碼方式正確.若是嘗試去解析一段包含Windows-1252編碼的UTF-8文檔,就會獲得一堆亂碼,好比: â˜ƒâ˜ƒâ˜ƒ「I like snowmen!」.

　　七、其它：

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
'''
以列表的形式返回
'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.select('title')  # title標籤
soup.select("p:nth-of-type(3)")  # 第三個p節點
soup.select('body a')  # body下的全部子孫a節點
soup.select('p > a')  # 全部p節點下的全部a直接節點
soup.select('p > #link1')  # 全部p節點下的id=link1的直接子節點
soup.select('#link1 ~ .sister')  # id爲link1的節點後面class=sister的全部兄弟節點
soup.select('#link1 + .sister')  # id爲link1的節點後面class=sister的第一個兄弟節點
soup.select('.sister')  # class=sister的全部節點
soup.select('[class="sister"]')  # class=sister的全部節點
soup.select("#link1")  # id=link1的節點
soup.select("a#link1")  # a節點，且id=link1的節點
soup.select('a[href]')  # 全部的a節點，有href屬性
soup.select('a[href="http://example.com/elsie"]')  # 指定href屬性值的全部a節點
soup.select('a[href^="http://example.com/"]')  # href屬性以指定值開頭的全部a節點
soup.select('a[href$="tillie"]')  # href屬性以指定值結尾的全部a節點
soup.select('a[href*=".com/el"]')  # 支持正則匹配

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。