Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫,簡單來講,它能將HTML的標籤文件解析成樹形結構,而後方便地獲取到指定標籤的對應屬性,還能夠方便的實現全站點的內容爬取和解析;html
Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,若是咱們不安裝它,則 Python 會使用 Python默認的解析器; lxml 是python的一個解析庫,支持HTML和XML的解析,html5lib解析器可以以瀏覽器的方式解析,且生成HTML5文檔;html5
pip install beautifulsoup4 pip install html5lib pip install lxml
假如如今有一段不完整的HTML代碼,咱們如今要使用Beautiful Soup模塊來解析這段HTML代碼python
data = ''' <html><head><title>The Dormouse's story</title></he <body> <p class="title"><b id="title">The Dormouse's story</b></p> <p class="story">Once upon a time there were three <a href="http://example.com/elsie" class="sister" i <a href="http://example.com/lacie" class="sister" i <a href="http://example.com/tillie" class="sister" and they lived at the bottom of a well.</p> <p class="story">...</p> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(data,'lxml')
而後經過BeautifulSoup提供的方法就能夠拿到HTML的元素、屬性、連接、文本等,BeautifulSoup模塊能夠將不完整的HTML文檔,格式化爲完整的HTML文檔 ,好比咱們打印print(soup.prettify())
看一下輸出什麼?瀏覽器
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b id="title"> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three <a a="" and="" at="" bottom="" class="sister" href="http://example.com/elsie" i="" lived="" of="" the="" they="" well.=""> <p class="story"> ... </p> </a> </p> </body> </html>
print('title = {}'.format(soup.title)) # 輸出:title = <title>The Dormouse's story</title> print('a={}'.format(soup.a))
print('title_name = {}'.format(soup.title.name)) # 輸出:title_name = title print('body_name = {}'.format(soup.body.name)) # 輸出:body_name = body
print('title_string = {}'.format(soup.title.string)) # 輸出:title_string = The Dormouse's story
print('title_pareat_name = {}'.format(soup.title.parent)) # 輸出:title_pareat_name = <head><title>The Dormouse's story</title> </head>
print('p = {}'.format(soup.p)) # 輸出:p = <p class="title"><b>The Dormouse's story</b></p>
print('p_class = {}'.format(soup.p["class"])) # 輸出:p_class = ['title'] print('a_class = {}'.format(soup.a["class"])) # 輸出:a_class = ['sister']
# 獲取全部的a標籤 print('a = {}'.format(soup.find_all('a'))) # 獲取全部的p標籤 print('p = {}'.format(soup.find_all('p')))
print('a_link = {}'.format(soup.find(id='title'))) # 輸出:a_link = <b id="title">The Dormouse's story</b>
Tag(獲取標籤)
, NavigableString(獲取標籤內容)
, BeautifulSoup(根標籤)
, Comment(標籤內的全部的文本)
;語法:編碼
soup.標籤名
:獲取HTML中的標籤;code
soup.標籤名.name
:獲取HTML中標籤的名稱;orm
soup.標籤名.attrs
:獲取標籤的全部屬性;xml
soup.標籤名.string
:獲取HTML中標籤的文本內容;htm
soup.標籤名.parent
:獲取HTML中標籤的父標籤;對象
prettify()方法
:能夠將Beautiful Soup的文檔樹格式化後以Unicode編碼輸出,每一個XML/HTML標籤都獨佔一行;
contents
:獲取全部子節點,返回一個列表,能夠經過下標取值;soup = BeautifulSoup(html,"lxml") # 返回一個列表 print(soup.p.contents) # 拿到第一個子節點 print(soup.p.contents[0])
children
:返回子節點的生成器對象;for tag in soup.p.children: print(tag)
soup.strings
:獲取全部節點的內容,包括空格;soup = BeautifulSoup(html,"lxml") for content in soup.strings: print(repr(content))
soup.stripped_strings
:獲取全部節點的內容,不包括空格;soup = BeautifulSoup(html,"lxml") for tag in soup.stripped_strings: print(repr(tag))
find_all()
:查找全部指定標籤名稱的子節點(可同時查找多個標籤),並判斷是否符合過濾器的條件,返回一個列表;soup = BeautifulSoup(html,"lxml") print(soup.find_all('a')) print(soup.find_all(['a','p'])) print(soup.find_all(re.compile('^a')))
find()
:和find_all()差很少,可是find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果;soup = BeautifulSoup(html,"lxml") print(soup.find('a'))