bs4解析庫

時間 2019-11-30

標籤 bs4 解析简体版

原文原文鏈接

beautifulsoup4

bs4解析庫是靈活又方便的網頁解析庫，處理高效，支持多種解析器。利用它不用編寫正則表達式便可方便地實現網頁的提取html

要解析的html標籤

from bs4 import BeautifulSoup # 要解析的html標籤
html_str = """ <li data_group="server" class="content"> <a href="/commands.html" class="index" name="a1">第一個a標籤 <a href="/commands.html" class="index2" name="a2">第二個a標籤 <a href="/commands/flushdb.html"> <span class="first"> 這是第一個span標籤 <span class="second"> 這是第二個span標籤,第一個下的子span標籤 </span> </span> <span class="third">這是第三個span標籤</span> <h3>這是一個h3</h3> </a> </li> """

1. 找標籤:正則表達式

# 1. find_all 找到全部的li標籤 結果爲一個結果集
li_find_all = BeautifulSoup(html_str, "lxml").find_all("li") print(type(li_find_all))  # <class 'bs4.element.ResultSet'> # 2. find 找到第一個li標籤 結果爲一個標籤對象
li_find = BeautifulSoup(html_str, "lxml").find("li") print(type(li_find))    # <class 'bs4.element.Tag'> # 添加限制條件 class id
li = BeautifulSoup(html_str, "lxml").find_all("li", class_="content", data_group="server") li1 = BeautifulSoup(html_str, "lxml").find_all("li", attrs={"class":"content", "data_group":"server"})

2. 找標籤屬性和name:spa

# 找到a標籤的屬性和name
a = BeautifulSoup(html_str, "lxml").find("a") print(a.get("href"), a.name, type(a.get("href")))    # /commands.html a <class 'str'>
print(a.attrs, type(a.attrs), a.text, a.string,a.get_text(), type(a.string)) # {'href': '/commands.html', 'class': ['index'], 'name': 'a1'} <class 'dict'> 第一個a標籤 <class 'bs4.element.NavigableString'>

3. 處理子標籤和後代標籤:code

# 找到li下的後代標籤
li_find = BeautifulSoup(html_str, "lxml").find("li") print(li_find.children)    # <list_iterator object at 0x00000132C0915320>
""" for i in li_find.children: print(type(i),i) """
# 找到li下的子標籤 返回第一個找到的標籤
print(li_find.a, type(li_find.a)) # <a class="index" href="/commands.html" name="a1">第一個a標籤</a> <class 'bs4.element.Tag'>

4. 處理兄弟標籤:server

# 處理a標籤的兄弟
a = BeautifulSoup(html_str, "lxml").find("a", class_="index2") print(a.next_siblings, type(a.next_siblings))  # <generator object next_siblings at 0x000001B14AA712B0> <class 'generator'>
""" for i in a.next_siblings: print(i, type(i), "\n") 1. <a class="index" href="/commands.html" name="a1">第一個a標籤 </a> <class 'bs4.element.Tag'> 2. <a href="/commands/flushdb.html"> <span class="first"> 這是第一個span標籤 <span class="second"> 這是第二個span標籤,第一個下的子span標籤 </span> </span> <span class="third">這是第三個span標籤</span> <h3>這是一個h3</h3> </a> <class 'bs4.element.Tag'> """
# print("next--", a.last ,type(a.next)) # 一組兄弟標籤中的下一個標籤next_sibling() 下的全部標籤next_siblings() # 一組兄弟標籤中的上一個標籤previous_sibling() 上的全部標籤previous_siblings() # 找到一組兄弟標籤下的最後一個標籤:
a = [x for x in a.next_siblings][-1] print("aaaaaa", a, type(a))

5. 處理父標籤:xml

# 1.parent # 返回的父標籤及其子標籤
span = BeautifulSoup(html_str, "lxml").find("span", class_="second") print(span.parent, type(span.parent)) # 2. parents 一層一層返回
""" span = BeautifulSoup(html_str, "lxml").find("span", class_="second") for i in span.parents: print(i) """

6. 標籤的其它一些處理方法htm

# 1. prettify方法 # 這個方法就是在每一個標籤後加入一個\n 打印出來是十分規範的h5代碼 一目瞭然 # 也能夠對某個標籤作格式化處理
a = BeautifulSoup(html_str, "lxml").find("a") print(a.prettify()) # 2.contents方法
li = BeautifulSoup(html_str, "lxml") print(li.contents, type(li.contents)) print(li.childrent, type(li.children)) """ li_find.contents 返回的是一個列表 查找的標籤下的子標籤 包括'\n' li_find.children 返回的是一個迭代器, 迭代器的內容與li_find.contents同樣 """

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。