跟着知識追尋者學BeautifulSoup,你學不會打不還口,罵不還手

一 前言

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫;其強大的提取能力讓知識追尋者放棄了使用正則匹配查找HTML節點;Beautifu Soup 其能直接經過HTML標籤獲取相應的節點,或者經過函數直接得到節點,大大提升了編程人員的開發效率;看完本篇學不會Beautiful Soup ,滿天神佛都救不了你;以爲知識追尋者的文章有點意思,關注加點贊謝謝;javascript

二 Beautiful Soup 簡單使用

Beautiful Soup 的解釋器以下:html

解釋器 使用示例
Python標準庫 BeautifulSoup(markup, "html.parser")
lxml HTML 解析器 BeautifulSoup(markup, "lxml")
lxml XML 解析器 BeautifulSoup(markup, "xml")
html5lib BeautifulSoup(markup, "html5lib")

本篇的解釋器讀者可使用Python標準庫或者lxml HTML 解析器均可以;下午中獲取標籤其實都是獲取標籤對象,讀者謹記;html5

簡要歸納下屬性的說明:java

屬性 含義
soup.tag.name 獲取標籤tag名稱
soup.tag.string 獲取標籤tag文本內容
soup.tag 獲取標籤tag
soup.tag.attrs 獲取標籤tag全部屬性
soup.tag.attrs['class'] 獲取標籤指定class的屬性
soup.tag1.tag2 獲取子標籤tag2
soup.tag.contents 獲取tag全部直接子標籤以列表輸出
soup.tag.children 獲取直接子標籤,返回生成器
soup.tag.descendants 獲取全部子標籤,返回生成器
soup.tag.parent 獲取直接父節點
soup.tag.parents 獲取祖先節點,返回生成器
soup.tag.next_sibling 獲取後一個兄弟節點
soup.tag.previous_sibling 獲取前一個兄弟節點
soup.tag.next_siblings 獲取後一個兄弟節點,返回生成器
soup.tag.previous_siblings 獲取前一個兄弟節點,返回生成器

2.1 格式化HTML

  1. 實例化一個Beautiful Soup 實例,入參是HTML,和html.parser
  2. 調用prettify()方法會格式化HTML文檔
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())

輸出結果下,是否是很美觀,結構是否是很清楚;並且還補全了缺失的標籤</form> , </div>node

<div class="filter-box d-flex align-items-center">
 <form action="" id="seeOriginal">
  <dl class="filter-sort-box d-flex align-items-center">
   <dt>
    排序:
   </dt>
   <dd>
    <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">
     默認
    </a>
   </dd>
   <dd>
    <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
     <svg aria-hidden="true" class="icon">
      <use xlink:href="#csdnc-rss">
      </use>
     </svg>
     RSS訂閱
    </a>
   </dd>
  </dl>
 </form>
</div>

2.2 獲取標籤節點

  1. 調用soup.dt 會直接得到第一個匹配到dt標籤對象;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出節點 <dt>排序:</dt>
print(soup.dt)

2.3 獲取節點文本

soup.dt.string 得到dt標籤包含的內容;編程

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出文本內容 排序:
print(soup.dt.string)

2.4獲取節點名稱

soup.dt.name 直接得到標籤dt的名稱;svg

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出dt
print(soup.dt.name)

2.5 得到節點對象種類

直接得到標籤後使用type方法能夠顯示出標籤類型是 <class 'bs4.element.Tag'>函數

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
dt = soup.dt
# <class 'bs4.element.Tag'>
print(type(dt))

2.6 獲取全部屬性

soup.a.attrs 獲取匹配到第一個a標籤的全部屬性;flex

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.attrs)

輸出默認匹配第一個a標籤的所有屬性spa

{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}

2.7 獲取特定屬性

soup.a.attrs['href'],獲取匹配到第一個a標籤的href屬性內容

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出javascript:void(0);
print(soup.a.attrs['href'])

2.8 獲取子節點

soup.form.dd 會得到form標籤下第一個dd標籤

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.dd)

輸出

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>

2.9 獲取全部直接子節點

soup.form.contents 將會以列表的形式輸出form全部的子標籤;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.contents)

輸出結果:

['\n', <dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>]

2.10 獲取直接子節點生成器

soup.svg.children 會得到dd全部子節點的生成器;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.svg.children):
    print(index, child)

輸出結果:

0 

1 <use xlink:href="#csdnc-rss"></use>
2

2.11 獲取全部子節點生成器

soup.dl.descendants 會獲取dl 標籤全部的子節點(more than direct child node),

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.dl.descendants):
    print(index, child)

輸出結果:

0 

1 <dt>排序:</dt>
2 排序:
3 

4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a>
6 默認
7 

8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
10 

11 <svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>
12 

13 <use xlink:href="#csdnc-rss"></use>
14 

15 RSS訂閱
16 

17

2.12 獲取直接父節點

soup.a.parent 或獲取第一個匹配到a標籤的父標籤對象;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.parent)

輸出結果:

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>

2.13 獲取祖先節點生成器

soup.a.parents 會得到第一個匹配到a標籤的全部父節點,也就是祖先節點,返回生成器;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for node in soup.a.parents:
    if node is None:
        print(node)
    else:
        print(node.name)

輸出結果:

dd
dl
form
div
[document]

2.14 獲取兄弟節點

兄弟節點有個坑,一般是返回空白,就不作過多講解

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.dt.next_sibling)

輸出是空白;其它兄弟節點屬性就不寫了,感受沒啥意義,不是空白就是None;

三 搜索文檔

學完第二節內容,讀者們其實就是打了個基礎,重點是這章節;

函數 含義
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) 查找全部匹配節點
find(name=None, attrs={}, recursive=True, text=None, **kwargs) 查找第一個匹配節點
find_parent(name=None, attrs={}, **kwargs) 返回當前節點的父輩節
find_parents(name=None, attrs={}, **kwargs) 返回當前節點的祖先節點
find_next_sibling(name=None, attrs={}, text=None, **kwargs) 返回符合條件的後面的第一個tag節點
find_next_siblings(name=None, attrs={}, text=None, **kwargs) 返回全部符合條件的後面的兄弟節點
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的前面的兄弟節點
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) 返回全部符合條件的前面的兄弟節點
find_next(name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的節點
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) 返回全部符合條件的節點
find_previous(name=None, attrs={}, text=None, **kwargs) 返回第一個符合條件的節點
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) 返回全部符合條件的節點
  1. name 表示輸出的tag名稱
  2. attrs 表示指定屬性查找
  3. recursive 表示是否遞歸全部子節點,默認是;設置爲false返回直接子節點
  4. limit 表示 限制 輸出數量
  5. **kwargs 能夠指定常常出現的屬性搜索,好比 id = 'zszxz'
  6. text 是過濾條件

本節着重講解find_all方法,find方法於find_all一致,學一個就會用另外一個;

3.1 name參數示例

soup.find_all(name='dd') 會得到全部dd標籤對象,而且返回列表;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(name='dd'))

輸出結果

[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>]

注:soup.find_all(name='dd') 與 soup.find_all('dd') 一致;

3.2 attrs 屬性示例

soup.find_all(attrs={'id':'seeOriginal'}) 獲取 屬性 id = seeOriginal 全部標籤對象

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(attrs={'id':'seeOriginal'}))

輸出

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

3.3 recursive 示例

soup.find_all('dl',recursive=False) 會查找dl標籤子節點,當recursive 設置爲False以後就找不到了;

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dl',recursive=False))

輸出空列表[]

3.4limit示例

soup.find_all('dd',limit=1) 會限制輸出結果爲一條

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dd',limit=1))

輸出

[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>]

3.5 kwargs 示例之屬性匹配

soup.find_all(id='seeOriginal') 直接指定id屬性查找

# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(id='seeOriginal'))

輸出

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

3.6 kwargs 示例之正則匹配

soup.find_all(href=re.compile("java.*?")) 匹配屬性 href 正則 java開頭的屬性標籤;

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(href=re.compile("java.*?")))

輸出結果

[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a>]

3.7 按CSS搜索

soup.find_all("a", class_="btn") 查找a標籤,class屬性帶有btn

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all("a", class_="btn"))

輸出結果

[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>]

四CSS選擇器

Beautiful Soup 還直接支持CSS選擇器搜索,下面列出了常常使用的方法示例;

# -*- coding: utf-8 -*-
import re

import requests
from bs4 import BeautifulSoup

html = """
    <div class="filter-box d-flex align-items-center">
    <form action="" id=seeOriginal>
    <dl class="filter-sort-box d-flex align-items-center">
    <dt>排序:</dt>
    <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd>
    <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
		<svg class="icon" aria-hidden="true">
			<use xlink:href="#csdnc-rss"></use>
		</svg>RSS訂閱</a>
    </dd>
  </dl>"""

# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 選取 dl 標籤下面的 dt標籤
lt = soup.select('dl dt')
print(lt)
dd = soup.select('dl dd')
print(dd[0])
# id 選擇器搜索
id = soup.select('#seeOriginal')
print(id)
# class選擇器 搜索
cla = soup.select('.btn-filter-sort')
print(cla[0])

分別輸出以下

soup.select('dl dt')

[<dt>排序:</dt>]

soup.select('dl dd')[0]

<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>

soup.select('#seeOriginal')

[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]

soup.select('.btn-filter-sort')[0]

<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a>

原文出處:https://www.cnblogs.com/zszxz/p/12208673.html

相關文章
相關標籤/搜索