Beautiful Soup模塊

時間 2019-12-05

標籤 beautiful soup 模塊简体版

原文原文鏈接

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫,它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工做時間.html

快速開始，以以下html做爲例子.python

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出:正則表達式

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

幾個簡單的瀏覽結構化數據的方法:函數

#打印出title標籤的信息
soup.title
<title>The Dormouse's story</title>
#打印出title標籤的標籤名稱
soup.title.name
'title'
#打印出title標籤的內容
soup.title.string
"The Dormouse's story"
#打印出title標籤的內存地址
soup.title.strings
<generator object _all_strings at 0x0000025B5572A780>
#打印出title標籤的父標籤
soup.title.parent.name
'head'
#打印出第一個p標籤的信息
soup.p
<p class="title"><b>The Dormouse's story</b></p>
#取出p標籤的值
soup.p['class'] 或者soup.p.get('class')
['title']
#打印出第一個a標籤的信息
soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#獲取全部的a標籤，返回一個列表.
soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#返回id=link3的的標籤內容
soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

從文檔中找到全部<a>標籤的連接:url

for link in soup.find_all('a'):
    print(link.get('href'))
    
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

從文檔中獲取全部文字內容:spa

print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

獲取標籤屬性code

soup.a.attrs
{'id': 'link1', 'class': ['sister'], 'href': 'http://example.com/elsie'}

使用BeautifulSoup庫的 find()、findAll()和find_all()函數orm

在構造好BeautifulSoup對象後，藉助find()和findAll()這兩個函數，能夠經過標籤的不一樣屬性輕鬆地把繁多的html內容過濾爲你所想要的。htm

這兩個函數的使用很靈活，能夠：經過tag的id屬性搜索標籤、經過tag的class屬性搜索標籤、經過字典的形式搜索標籤內容返回的爲一個列表、經過正則表達式匹配搜索等等對象

基本使用格式：

經過tag的id屬性搜索標籤

t = soup.find(attrs={"id":"aa"})

搜索a標籤中class屬性是sister的全部標籤內容

t= soup.findAll('a',{'class':'sister'})

find_all() 方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件.

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

BeautifulSoup的使用

在用requests庫從網頁上獲得了網頁數據後，就要開始使用BeautifulSoup了。

一個示例：

#!/usr/bin/python
#coding:utf-8

import requests
from bs4 import BeautifulSoup

url = requests.get("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book")

#獲取頁面代碼
#print(url.text)

#建立BeautifulSoup對象
soup = BeautifulSoup(url.text,"html.parser")
#print(soup.prettify())

#book_div 查找出div標籤中id屬性是book的內容
book_div = soup.find('div',{'id':'book'})
#print(book_div)
#book_div的另外一種寫法，獲取結果同樣

# book_div = soup.find(attrs={"id":"book"})
# print('book_div的內容',book_div)

#經過class="title"獲取全部的book  a標籤
book_a = book_div.findAll(attrs={"class":"title"})
print(book_a)
#
# for循環是遍歷book_a全部的a標籤,book.string是輸出a標籤中的內容.

for book in book_a:
    print(book.string)