Beautiful Soup模塊使用

時間 2019-11-07

標籤 beautiful soup 模塊使用简体版

原文原文鏈接

1.Beautiful Soup模塊的介紹

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫，簡單來講，它能將HTML的標籤文件解析成樹形結構，而後方便地獲取到指定標籤的對應屬性，還能夠方便的實現全站點的內容爬取和解析；html
Beautiful Soup支持Python標準庫中的HTML解析器，還支持一些第三方的解析器，若是咱們不安裝它，則 Python 會使用 Python默認的解析器； lxml 是python的一個解析庫，支持HTML和XML的解析，html5lib解析器可以以瀏覽器的方式解析，且生成HTML5文檔；html5

pip install beautifulsoup4
pip install html5lib
pip install lxml

2. Beautiful Soup模塊解析HTML文檔

假如如今有一段不完整的HTML代碼，咱們如今要使用Beautiful Soup模塊來解析這段HTML代碼python

data = '''                                         
<html><head><title>The Dormouse's story</title></he
<body>                                             
<p class="title"><b id="title">The Dormouse's story</b></p>   
<p class="story">Once upon a time there were three 
<a href="http://example.com/elsie" class="sister" i
<a href="http://example.com/lacie" class="sister" i
<a href="http://example.com/tillie" class="sister" 
and they lived at the bottom of a well.</p>        
<p class="story">...</p>                           
'''

首先須要導入BeautifulSoup模塊，再實例化BeautifulSoup對象

from bs4 import BeautifulSoup           
soup = BeautifulSoup(data,'lxml')

而後經過BeautifulSoup提供的方法就能夠拿到HTML的元素、屬性、連接、文本等，BeautifulSoup模塊能夠將不完整的HTML文檔，格式化爲完整的HTML文檔，好比咱們打印print(soup.prettify())看一下輸出什麼？瀏覽器

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b id="title">
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three
   <a a="" and="" at="" bottom="" class="sister" href="http://example.com/elsie" i="" lived="" of="" the="" they="" well.="">
    <p class="story">
     ...
    </p>
   </a>
  </p>
 </body>
</html>

獲取標籤，如title標籤，a標籤等

print('title = {}'.format(soup.title))             
# 輸出：title = <title>The Dormouse's story</title>

print('a={}'.format(soup.a))

獲取標籤的名稱，如title標籤，body標籤等

print('title_name = {}'.format(soup.title.name))
# 輸出：title_name = title

print('body_name = {}'.format(soup.body.name))
# 輸出：body_name = body

獲取標籤的內容，如title標籤

print('title_string = {}'.format(soup.title.string))
#  輸出：title_string = The Dormouse's story

若是想要獲取某個標籤的父標籤的名稱，能夠使用parent，如title標籤，能夠獲得父標籤head標籤，且會自定補齊不完整的標籤；

print('title_pareat_name = {}'.format(soup.title.parent))
# 輸出：title_pareat_name = <head><title>The Dormouse's story</title>
</head>

獲取第一個p標籤

print('p = {}'.format(soup.p))

# 輸出：p = <p class="title"><b>The Dormouse's story</b></p>

獲取第一個p標籤的class的值，獲取第一個a標籤的class值

print('p_class = {}'.format(soup.p["class"]))
# 輸出：p_class = ['title']

print('a_class = {}'.format(soup.a["class"]))
# 輸出：a_class = ['sister']

獲取全部的標籤

#  獲取全部的a標籤
print('a = {}'.format(soup.find_all('a')))

#  獲取全部的p標籤
print('p = {}'.format(soup.find_all('p')))

獲取id爲link3的標籤

print('a_link = {}'.format(soup.find(id='title')))

# 輸出：a_link = <b id="title">The Dormouse's story</b>

3.BeautifulSoup中的對象

BeautifulSoup對象分爲四類，分別是Tag(獲取標籤), NavigableString(獲取標籤內容) , BeautifulSoup(根標籤), Comment(標籤內的全部的文本) ；

語法：編碼

soup.標籤名：獲取HTML中的標籤；code
soup.標籤名.name：獲取HTML中標籤的名稱；orm
soup.標籤名.attrs：獲取標籤的全部屬性；xml
soup.標籤名.string：獲取HTML中標籤的文本內容；htm
soup.標籤名.parent：獲取HTML中標籤的父標籤；對象
prettify()方法：能夠將Beautiful Soup的文檔樹格式化後以Unicode編碼輸出，每一個XML/HTML標籤都獨佔一行；

4.遍歷文檔

contents：獲取全部子節點，返回一個列表，能夠經過下標取值；

soup = BeautifulSoup(html,"lxml")

# 返回一個列表
print(soup.p.contents)
# 拿到第一個子節點
print(soup.p.contents[0])

children：返回子節點的生成器對象；

for tag in soup.p.children:
    print(tag)

soup.strings：獲取全部節點的內容，包括空格；

soup = BeautifulSoup(html,"lxml")
for content in soup.strings:
    print(repr(content))

soup.stripped_strings：獲取全部節點的內容，不包括空格；

soup = BeautifulSoup(html,"lxml")
for tag in soup.stripped_strings:
    print(repr(tag))

5.查找標籤

find_all()：查找全部指定標籤名稱的子節點（可同時查找多個標籤），並判斷是否符合過濾器的條件，返回一個列表；

soup = BeautifulSoup(html,"lxml")
print(soup.find_all('a'))
print(soup.find_all(['a','p']))
print(soup.find_all(re.compile('^a')))

find()：和find_all()差很少，可是find_all() 方法的返回結果是值包含一個元素的列表，而 find() 方法直接返回結果；

soup = BeautifulSoup(html,"lxml")
print(soup.find('a'))

更多關於Beautiful Soup模塊的知識能夠查看：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

參考：https://www.9xkd.com/user/plan-view.html?id=1742870460

1. Beautiful Soup模塊的使用
2. Beautiful Soup模塊
3. Beautiful Soup用法
4. Beautiful Soup
5. 安裝外部模塊 Beautiful Soup and requests
6. Learn Beautiful Soup(3)——使用Beautiful Soup進行查找
7. Beautiful Soup的用法
8. python Beautiful Soup庫
9. Beautiful Soup庫
10. Beautiful Soup Documentation
更多相關文章...
• Lua 模塊與包 - Lua 教程
• DTD - XML 構建模塊 - DTD 教程
• 委託模式
• Composer 安裝與使用

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。