Python 爬蟲利器 Beautiful Soup 4 之文檔樹的搜索

時間 2019-12-05

標籤 python 爬蟲利器 beautiful soup 文檔搜索欄目 Python 简体版

原文原文鏈接

前面兩篇介紹的是 Beautiful Soup 4 的基本對象類型和文檔樹的遍歷, 本篇介紹 Beautiful Soup 4 的文檔搜索

搜索文檔樹主要使用兩個方法 find() 和 find_all()css

find_all():

find_all 是用於搜索節點中全部符合過濾條件的節點html

那麼它支持哪些過濾器呢?html5

過濾器的類型:

字符串
正則表達式
列表
True
方法

字符串:python

查找文檔中全部的<b>標籤正則表達式

soup.find_all('b')

正則表達式:code

找出全部以b開頭的標籤htm

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

列表:對象

找到文檔中全部<a>標籤和<b>標籤element

soup.find_all(["a", "b"])

True:文檔

True 能夠匹配任何值, 可是不會返回字符串節點

for tag in soup.find_all(True):
    print(tag.name)

方法:

能夠定義一個方法, 方法只接受一個元素參數, 若是這個方法返回 True 表示當前元素匹配而且被找到, 若是不是則反回 False

這裏是官方文檔上面的例子:

下面代碼找到全部被文字包含的節點內容

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name

find_all 的方法原型:

find_all( name , attrs , recursive , text , **kwargs )

name 參數:

name 參數能夠查找全部名字爲 name 的 tag, 字符串對象會被自動忽略掉

soup.find_all("p") 查找全部的 p 標籤

keyword 參數:

soup.find_all(id='link2',class_='title') , 這個將會查找到同時知足這兩個屬性的標籤，這裏的class必須用class_傳入參數，由於class是python中的關鍵詞

有些屬性不能經過以上方法直接搜索，好比html5中的data-*屬性，不過能夠經過attrs參數指定一個字典參數來搜索包含特殊屬性的標籤

data_soup.find_all(attrs={"data-foo": "value"})

text 參數:

經過 text 參數能夠搜索文檔中的字符串內容, 與 name 參數的可選值同樣, text 參數接受字符串 , 正則表達式 , 列表, True

soup.find_all("a", text="Elsie")

limit 參數:

find_all() 方法返回所有的搜索結構, 若是文檔樹很大那麼搜索會很慢, 若是咱們不須要所有結果, 能夠使用 limit 參數限制返回結果的數量, 效果與SQL中的limit關鍵字相似

soup.find_all("a", limit=2)

recursive 參數:

調用tag的 find_all() 方法時, Beautiful Soup 會檢索當前 tag 的全部子孫節點,若是隻想搜索 tag 的直接子節點, 能夠使用參數 recursive=False

soup.html.find_all("title", recursive=False)

find():

find( name , attrs , recursive , text , **kwargs )

find_all() 方法將返回文檔中符合條件的全部 tag, 儘管有時候咱們只想獲得一個結果, 好比文檔中只有一個<body>標籤,那麼使用 find_all() 方法來查找<body>標籤就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法, 下面兩行代碼是等價的:

soup.find_all('title', limit=1)
soup.find('title')

惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表, 而 find() 方法直接返回結果

find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時, 返回 None

CSS選擇器:

Beautiful Soup支持大部分的CSS選擇器:

soup.select("body a")
soup.select("html head title")
soup.select("p > #link1")
soup.select(".sister")

更多詳細用法戳: 官方文檔 css 選擇器

參考自 Beautiful Soup 4 官方文檔.

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。