Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫;其強大的提取能力讓知識追尋者放棄了使用正則匹配查找HTML節點;Beautifu Soup 其能直接經過HTML標籤獲取相應的節點,或者經過函數直接得到節點,大大提升了編程人員的開發效率;看完本篇學不會Beautiful Soup ,滿天神佛都救不了你;以爲知識追尋者的文章有點意思,關注加點贊謝謝;javascript
Beautiful Soup 的解釋器以下:css
解釋器 | 使用示例 |
---|---|
Python標準庫 | BeautifulSoup(markup, "html.parser") |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
lxml XML 解析器 | BeautifulSoup(markup, "xml") |
html5lib | BeautifulSoup(markup, "html5lib") |
本篇的解釋器讀者可使用Python標準庫或者lxml HTML 解析器均可以;下午中獲取標籤其實都是獲取標籤對象,讀者謹記;html
簡要歸納下屬性的說明:html5
屬性 | 含義 |
---|---|
soup.tag.name | 獲取標籤tag名稱 |
soup.tag.string | 獲取標籤tag文本內容 |
soup.tag | 獲取標籤tag |
soup.tag.attrs | 獲取標籤tag全部屬性 |
soup.tag.attrs['class'] | 獲取標籤指定class的屬性 |
soup.tag1.tag2 | 獲取子標籤tag2 |
soup.tag.contents | 獲取tag全部直接子標籤以列表輸出 |
soup.tag.children | 獲取直接子標籤,返回生成器 |
soup.tag.descendants | 獲取全部子標籤,返回生成器 |
soup.tag.parent | 獲取直接父節點 |
soup.tag.parents | 獲取祖先節點,返回生成器 |
soup.tag.next_sibling | 獲取後一個兄弟節點 |
soup.tag.previous_sibling | 獲取前一個兄弟節點 |
soup.tag.next_siblings | 獲取後一個兄弟節點,返回生成器 |
soup.tag.previous_siblings | 獲取前一個兄弟節點,返回生成器 |
prettify()
方法會格式化HTML文檔# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.prettify())
輸出結果下,是否是很美觀,結構是否是很清楚;並且還補全了缺失的標籤</form>
, </div>
;java
<div class="filter-box d-flex align-items-center"> <form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt> 排序: </dt> <dd> <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self"> 默認 </a> </dd> <dd> <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"> </use> </svg> RSS訂閱 </a> </dd> </dl> </form> </div>
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 輸出節點 <dt>排序:</dt> print(soup.dt)
soup.dt.string 得到dt標籤包含的內容;node
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 輸出文本內容 排序: print(soup.dt.string)
soup.dt.name 直接得到標籤dt的名稱;編程
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 輸出dt print(soup.dt.name)
直接得到標籤後使用type方法能夠顯示出標籤類型是<class 'bs4.element.Tag'>
svg
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') dt = soup.dt # <class 'bs4.element.Tag'> print(type(dt))
soup.a.attrs 獲取匹配到第一個a標籤的全部屬性;函數
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.a.attrs)
輸出默認匹配第一個a標籤的所有屬性flex
{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}
soup.a.attrs['href'],獲取匹配到第一個a標籤的href屬性內容
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 輸出javascript:void(0); print(soup.a.attrs['href'])
soup.form.dd 會得到form標籤下第一個dd標籤
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.form.dd)
輸出
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
soup.form.contents 將會以列表的形式輸出form全部的子標籤;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.form.contents)
輸出結果:
['\n', <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>]
soup.svg.children 會得到dd全部子節點的生成器;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for index, child in enumerate(soup.svg.children): print(index, child)
輸出結果:
0 1 <use xlink:href="#csdnc-rss"></use> 2
soup.dl.descendants 會獲取dl 標籤全部的子節點(more than direct child node),
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for index, child in enumerate(soup.dl.descendants): print(index, child)
輸出結果:
0 1 <dt>排序:</dt> 2 排序: 3 4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd> 5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a> 6 默認 7 8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> 9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> 10 11 <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg> 12 13 <use xlink:href="#csdnc-rss"></use> 14 15 RSS訂閱 16 17
soup.a.parent 或獲取第一個匹配到a標籤的父標籤對象;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.a.parent)
輸出結果:
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
soup.a.parents 會得到第一個匹配到a標籤的全部父節點,也就是祖先節點,返回生成器;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') for node in soup.a.parents: if node is None: print(node) else: print(node.name)
輸出結果:
dd dl form div [document]
兄弟節點有個坑,一般是返回空白,就不作過多講解
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.dt.next_sibling)
輸出是空白;其它兄弟節點屬性就不寫了,感受沒啥意義,不是空白就是None;
學完第二節內容,讀者們其實就是打了個基礎,重點是這章節;
函數 | 含義 |
---|---|
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) | 查找全部匹配節點 |
find(name=None, attrs={}, recursive=True, text=None, **kwargs) | 查找第一個匹配節點 |
find_parent(name=None, attrs={}, **kwargs) | 返回當前節點的父輩節 |
find_parents(name=None, attrs={}, **kwargs) | 返回當前節點的祖先節點 |
find_next_sibling(name=None, attrs={}, text=None, **kwargs) | 返回符合條件的後面的第一個tag節點 |
find_next_siblings(name=None, attrs={}, text=None, **kwargs) | 返回全部符合條件的後面的兄弟節點 |
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的前面的兄弟節點 |
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) | 返回全部符合條件的前面的兄弟節點 |
find_next(name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的節點 |
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) | 返回全部符合條件的節點 |
find_previous(name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的節點 |
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) | 返回全部符合條件的節點 |
本節着重講解find_all方法,find方法於find_all一致,學一個就會用另外一個;
soup.find_all(name='dd') 會得到全部dd標籤對象,而且返回列表;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(name='dd'))
輸出結果
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd>]
注:soup.find_all(name='dd') 與 soup.find_all('dd') 一致;
soup.find_all(attrs={'id':'seeOriginal'}) 獲取 屬性 id = seeOriginal 全部標籤對象
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(attrs={'id':'seeOriginal'}))
輸出
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl></form>]
soup.find_all('dl',recursive=False)
會查找dl標籤子節點,當recursive 設置爲False以後就找不到了;
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all('dl',recursive=False))
輸出空列表[]
soup.find_all('dd',limit=1)
會限制輸出結果爲一條
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all('dd',limit=1))
輸出
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>]
soup.find_all(id='seeOriginal')
直接指定id屬性查找
# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(id='seeOriginal'))
輸出
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl></form>]
soup.find_all(href=re.compile("java.*?"))
匹配屬性 href 正則 java開頭的屬性標籤;
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all(href=re.compile("java.*?")))
輸出結果
[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a>]
soup.find_all("a", class_="btn")
查找a標籤,class屬性帶有btn
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') print(soup.find_all("a", class_="btn"))
輸出結果
[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a>]
Beautiful Soup 還直接支持CSS選擇器搜索,下面列出了常常使用的方法示例;
# -*- coding: utf-8 -*- import re import requests from bs4 import BeautifulSoup html = """ <div class="filter-box d-flex align-items-center"> <form action="" id=seeOriginal> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg class="icon" aria-hidden="true"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl>""" # 初始化 soup soup = BeautifulSoup(html,'html.parser') # 選取 dl 標籤下面的 dt標籤 lt = soup.select('dl dt') print(lt) dd = soup.select('dl dd') print(dd[0]) # id 選擇器搜索 id = soup.select('#seeOriginal') print(id) # class選擇器 搜索 cla = soup.select('.btn-filter-sort') print(cla[0])
分別輸出以下
soup.select('dl dt')
[<dt>排序:</dt>]
soup.select('dl dd')[0]
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd>
soup.select('#seeOriginal')
[<form action="" id="seeOriginal"> <dl class="filter-sort-box d-flex align-items-center"> <dt>排序:</dt> <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a></dd> <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list"> <svg aria-hidden="true" class="icon"> <use xlink:href="#csdnc-rss"></use> </svg>RSS訂閱</a> </dd> </dl></form>]
soup.select('.btn-filter-sort')[0]
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">默認</a>