Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫,它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工做時間.html
快速開始,以以下html做爲例子.python
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup
的對象,並能按照標準的縮進格式的結構輸出:正則表達式
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc,'html.parser') print(soup.prettify()) <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
幾個簡單的瀏覽結構化數據的方法:函數
#打印出title標籤的信息
soup.title <title>The Dormouse's story</title> #打印出title標籤的標籤名稱 soup.title.name 'title' #打印出title標籤的內容 soup.title.string "The Dormouse's story" #打印出title標籤的內存地址 soup.title.strings <generator object _all_strings at 0x0000025B5572A780> #打印出title標籤的父標籤 soup.title.parent.name 'head' #打印出第一個p標籤的信息 soup.p <p class="title"><b>The Dormouse's story</b></p> #取出p標籤的值 soup.p['class'] 或者soup.p.get('class') ['title'] #打印出第一個a標籤的信息 soup.a <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> #獲取全部的a標籤,返回一個列表. soup.find_all('a') [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] #返回id=link3的的標籤內容 soup.find(id='link3') <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
從文檔中找到全部<a>標籤的連接:url
for link in soup.find_all('a'): print(link.get('href')) http://example.com/elsie http://example.com/lacie http://example.com/tillie
從文檔中獲取全部文字內容:spa
print(soup.get_text()) The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
獲取標籤屬性code
soup.a.attrs {'id': 'link1', 'class': ['sister'], 'href': 'http://example.com/elsie'}
使用BeautifulSoup庫的 find()、findAll()和find_all()函數orm
在構造好BeautifulSoup對象後,藉助find()和findAll()這兩個函數,能夠經過標籤的不一樣屬性輕鬆地把繁多的html內容過濾爲你所想要的。htm
這兩個函數的使用很靈活,能夠: 經過tag的id屬性搜索標籤、經過tag的class屬性搜索標籤、經過字典的形式搜索標籤內容返回的爲一個列表、經過正則表達式匹配搜索等等對象
基本使用格式:
經過tag的id屬性搜索標籤
t = soup.find(attrs={"id":"aa"})
搜索a標籤中class屬性是sister的全部標籤內容
t= soup.findAll('a',{'class':'sister'})
find_all()
方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件.
soup.find_all("title") # [<title>The Dormouse's story</title>] soup.find_all("p", "title") # [<p class="title"><b>The Dormouse's story</b></p>] soup.find_all("a") # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(id="link2") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
BeautifulSoup的使用
在用requests庫從網頁上獲得了網頁數據後,就要開始使用BeautifulSoup了。
一個示例:
#!/usr/bin/python #coding:utf-8 import requests from bs4 import BeautifulSoup url = requests.get("http://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/?focus=book") #獲取頁面代碼 #print(url.text) #建立BeautifulSoup對象 soup = BeautifulSoup(url.text,"html.parser") #print(soup.prettify()) #book_div 查找出div標籤中id屬性是book的內容 book_div = soup.find('div',{'id':'book'}) #print(book_div) #book_div的另外一種寫法,獲取結果同樣 # book_div = soup.find(attrs={"id":"book"}) # print('book_div的內容',book_div) #經過class="title"獲取全部的book a標籤 book_a = book_div.findAll(attrs={"class":"title"}) print(book_a) # # for循環是遍歷book_a全部的a標籤,book.string是輸出a標籤中的內容. for book in book_a: print(book.string)
執行結果:
參考文檔: https://www.cnblogs.com/sunnywss/p/6644542.html
https://www.cnblogs.com/dan-baishucaizi/p/8494913.html
http://www.cnblogs.com/hearzeus/p/5151449.html
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/