【Python爬蟲學習筆記（3）】Beautiful Soup庫相關知識點總結

時間 2019-11-17

標籤 python 爬蟲學習筆記 beautiful soup 相關知識總結欄目 Python 简体版

原文原文鏈接

1. Beautiful Soup簡介html

 Beautiful Soup是將數據從HTML和XML文件中解析出來的一個python庫，它可以提供一種符合習慣的方法去遍歷搜索和修改解析樹，這將大大減小爬蟲程序的運行時間。python

 Beautiful Soup自動將輸入文檔轉換爲Unicode編碼，輸出文檔轉換爲utf-8編碼。你不須要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時，Beautiful Soup就不能自動識別編碼方式了。而後，你僅僅須要說明一下原始編碼方式就能夠了。正則表達式

 Beautiful Soup已成爲和lxml、html6lib同樣出色的python解釋器，爲用戶靈活地提供不一樣的解析策略或強勁的速度。函數

2. Beautiful Soup安裝測試

 利用pip能夠迅速安裝，目前最新版本爲BeautifulSoup4。編碼
1 $ pip install beautifulsoup4
安裝後，import一下bs4就可使用了。spa
1 from bs4 import BeautifulSoup
3. 建立Beautiful Soup對象code

咱們利用如下測試文件來進行以後的總結。orm
 1 html = """
 2 <html><head><title>The Dormouse's story</title></head>
 3 <body>
 4 The Dormouse's story
 5 Once upon a time there were three little sisters; and their names were
 6 <a href="http://example.com/elsie" class="sister" id="link1"></a>,
 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 9 and they lived at the bottom of a well.
10 ...
11 """
import以後，建立一個BeautifulSoup對象以下參數能夠是一個抓取到的unicode格式的網頁html，也能夠是一個已經保存到本地的html文件test.html。xml
1 soup = BeautifulSoup(html)
2 soup = BeautifulSoup(open('test.html'))
建立後查看是否建立成功。注意：有時須要在後面加上encode('utf-8')來進行編碼才能將soup對象正確顯示出來。
1 print soup.prettify()
4. 四種Beautiful Soup對象類型

Beautiful Soup一共有四大對象種類，包括Tag，NavigableString，BeautifulSoup和Comment。

4.1 Tag

Tag對象

Tag就是html文件中的標籤以及標籤之間的內容，例如如下就是一個Tag。
1 <title>The Dormouse's story</title>
能夠這樣獲得title這個Tag，第二行爲運行結果。
1 print soup.title
2 #<title>The Dormouse's story</title>
注意：若是獲得的是'bs4.element.Tag'類型的對象能夠繼續進行後續的.操做，即能進行soup對象所能進行的操做，因此須要確保一個對象是'bs4.element.Tag'類型後再進行後續對其的操做，例如後面將介紹的.find方法是Tag對象才擁有的。
1 print type(soup.title)
2 #<class 'bs4.element.Tag'>
Tag方法

.name

Tag對象的.name方法獲得的是該Tag的標籤自己名稱。
1 print soup.title.name
2 #title
.attrs

Tag對象的.attrs將獲得標籤中全部屬性的字典。
1 print soup.p.attrs
2 #{'class': ['title'], 'name': 'dromouse'}
能夠對Tag對象進行字典能夠進行的操做，例如修改，刪除，讀取等。
 1 print soup.p['class']#讀取（方法一）
 2 #['title']
 3 print soup.p.get('class')#讀取（方法二）
 4 #['title']
 5 
 6 soup.p['class']="newClass"#修改
 7 print soup.p
 8 #The Dormouse's story
 9 
10 del soup.p['class']#刪除
11 print soup.p
12 #The Dormouse's story
4.2 NavigableString

標籤內部的內容由.string方法能夠獲得，且這些內容爲'bs4.element.NavigableString'類型的對象。
1 print soup.p.string
2 #The Dormouse's story
3 
4 print type(soup.p.string)
5 #<class 'bs4.element.NavigableString'>
4.3 BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的所有內容.大部分時候,能夠把它看成 Tag 對象，是一個特殊的 Tag。
1 print type(soup.name)
2 #<type 'unicode'>
3 print soup.name 
4 # [document]
5 print soup.attrs 
6 #{} 空字典
4.4 Comment

前三種類型幾乎涵蓋了在HTML或者XML中全部的內容，可是Comment類型是須要關心的一種，和CData，ProcessingInstruction，Declaration，Doctype同樣，它是NavigableString類型的一個子類，經過如下代碼能夠簡單瞭解它的功能。
 1 markup = ""#標籤中內容爲註釋
 2 soup = BeautifulSoup(markup)
 3 comment = soup.b.string
 4 type(comment)
 5 # <class 'bs4.element.Comment'>
 6 comment
 7 # u'Hey, buddy. Want to buy a used parser'
 8 print(soup.b.prettify())
 9 # 
10 # 
11 # 
注意：標籤裏的內容其實是註釋，可是若是咱們利用 .string 來輸出它的內容，咱們發現它已經把註釋符號去掉了，因此這可能會給咱們帶來沒必要要的麻煩，須要在使用或者進行一些操做以前進行類型判斷。
1 if type(soup.b.string)==bs4.element.Comment:
2 ...
5. 樹的遍歷

5.1 子孫節點

.content

Tag對象的.content方法能夠獲得其子節點的一個列表表示。
1 print soup.head.contents 
2 #[<title>The Dormouse's story</title>]
固然，既然是列表能夠用索引直接獲得某一項。
1 print soup.head.contents[0]
2 #<title>The Dormouse's story</title>
.children

Tag對象的.children方法獲得一個其子節點的迭代器，能夠遍歷之獲取其中的元素。
1 for child in soup.body.children:
2 print child
.descendants

與.content和.children只獲得直接子節點不一樣，.descendants能對全部子孫節點迭代循環，將標籤層層剝離獲得全部子節點，一樣經過遍歷的方法獲得每一個子孫節點。
1 for child in soup.descendants:
2 print child
5.2 父親節點

.parent

Tag對象的.parent方法能獲得其直接父節點。

.parents

用.parents屬性能夠遞歸獲得元素的全部父節點。
1 content = soup.head.title.string
2 for parent in content.parents:
3 print parent.name
4 #title
5 #head
6 #html
7 #[document]
5.3 兄弟節點

.next_sibling和.next_siblings

 .next_sibling獲得Tag對象平級的下一個節點，若是不存在則返回None。.next_siblings獲得Tag對象平級的下面全部兄弟節點。

.previous_sibling和.previous_siblings

 .previous_sibling獲得Tag對象平級的上一個節點，若是不存在則返回None。.next_siblings獲得Tag對象平級的上面全部兄弟節點。

 注意:因爲在HTML文檔中的空白和換行也被視做是一個節點，因此可能獲得的兄弟節點（或者子節點父節點）會是空白類型或者字符串類型而不是Tag，因此在進行下一步操做時必定要先用type函數進行類型的判斷。

5.4 先後節點

.next_element和.next_elements

 與 .next_sibling和.next_siblings 不一樣，它並非針對於兄弟節點，而是在全部節點，不分層次獲得下一個節點和全部的後續節點。.next_elements的結果經過遍歷訪問。

.previous_element和.previous_elements

 這兩個方法將不分層次獲得上一個節點和全部以前的節點。.previous_elements的結果經過遍歷訪問。

5.4 節點內容

.string

 若是一個標籤裏面沒有標籤了，那麼 .string 就會返回標籤裏面的內容。若是標籤裏面只有惟一的一個標籤了，那麼 .string 也會返回最裏面的內容。
1 print soup.head.string
2 #The Dormouse's story
3 print soup.title.string
4 #The Dormouse's story
而若是Tag包含了多個子節點，Tag就沒法肯定.string 方法應該調用哪一個子節點的內容，輸出結果是 None。

.strings和.stripped_strings

當一個Tag對象有多個子節點時，能夠用.strings方法再經過遍歷得到全部子節點的內容。
 1 for string in soup.strings:
 2 print(repr(string))
 3 # u"The Dormouse's story"
 4 # u'\n\n'
 5 # u"The Dormouse's story"
 6 # u'\n\n'
 7 # u'Once upon a time there were three little sisters; and their names were\n'
 8 # u'Elsie'
 9 # u',\n'
10 # u'Lacie'
11 # u' and\n'
12 # u'Tillie'
13 # u';\nand they lived at the bottom of a well.'
14 # u'\n\n'
15 # u'...'
16 # u'\n'
用.stripped_strings方法能夠獲得過濾掉空格和空行的內容。

.get_text（)

若是你僅僅想要獲得文檔或者標籤的文本部分，可使用.get_text（)方法，它能以一個單一的一個Unicode串的形式返回文檔中或者Tag對象下的全部文本。
1 markup = '<a href="http://example.com/">\nI linked to example.com\n</a>'
2 soup = BeautifulSoup(markup)
3 
4 soup.get_text()
5 #u'\nI linked to example.com\n'
6 soup.i.get_text()
7 #u'example.com'
你能夠指定一個字符串來鏈接文本的位。
1 soup.get_text("|")
2 #u'\nI linked to |example.com|\n'
進一步，經過strip去除掉文本每一個位的頭尾空白。
1 soup.get_text("|", strip=True)
2 #u'I linked to|example.com'
用列表推導式以及.stripped_strings方法羅列出文本內容。
1 [text for text in soup.stripped_strings]
2 #[u'I linked to', u'example.com']
6. 樹的搜索

6.1 find_all(name, attrs, recursive, string, limit, **kwargs)

該方法將搜索當前Tag對象的全部子節點，而且按照過濾條件獲得篩選後對象的列表。

name參數

1）傳字符串

最簡單的方法是傳入標籤名的字符串，能夠獲得全部以該字符串爲標籤名的一個列表。
1 print soup.find_all('a')
2 #[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
2）傳正則表達式

能夠經過傳正則表達式獲得符合表達式規則的Tag對象。
1 import re
2 for tag in soup.find_all(re.compile("^b")):
3 print(tag.name)
4 # body
5 # b
3）傳列表

能夠傳入一個字符串的列表，將匹配列表中標籤的Tag所有返回。
1 soup.find_all(["a", "b"])
2 # [The Dormouse's story,
3 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
4 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
5 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
4）傳True

True參數將匹配文檔中全部的節點，可是不包括文本字符串。
 1 for tag in soup.find_all(True):
 2 print(tag.name)
 3 # html
 4 # head
 5 # title
 6 # body
 7 # p
 8 # b
 9 # p
10 # a
11 # a
12 # a
13 # p
5）傳入函數

能夠根據函數返回值的True/False來獲得匹配的節點。
1 def has_class_but_no_id(tag):
2 return tag.has_attr('class') and not tag.has_attr('id')
3 
4 soup.find_all(has_class_but_no_id)
5 # [The Dormouse's story,
6 # Once upon a time there were...,
7 # ...]
關鍵字參數

能夠傳入一個或者多個關鍵字，BeautifulSoup會搜索當前Tag下的每個節點的該關鍵字及其對應的值。
1 soup.find_all(href=re.compile("elsie"), id='link1')
2 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
特殊：若是但願用class及其值做爲過濾條件，因爲class是python的關鍵字，因此須要做以下處理。
1 soup.find_all("a", class_="sister")
2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
3 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
4 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
另外，有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性，能夠這樣來進行過濾。
1 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
2 data_soup.find_all(attrs={"data-foo": "value"})
3 # [<div data-foo="value">foo!</div>]
text參數

能夠在文檔中搜索一些字符串內容，與name參數的可選值同樣，能夠傳字符串，列表，正則表達式和True。
1 soup.find_all(text="Elsie")
2 # [u'Elsie']
3 
4 soup.find_all(text=["Tillie", "Elsie", "Lacie"])
5 # [u'Elsie', u'Lacie', u'Tillie']
6 
7 soup.find_all(text=re.compile("Dormouse"))
8 [u"The Dormouse's story", u"The Dormouse's story"]
limit參數

可用該參數限制返回的節點數目，例子中自己有3個符合的節點，僅輸出兩個。
1 soup.find_all("a", limit=2)
2 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
3 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
recursive參數

將該參數設爲False可限制只搜索當前Tag的直接子節點，能夠節省不少搜索時間。
1 soup.html.find_all("title")
2 # [<title>The Dormouse's story</title>]
3 soup.html.find_all("title", recursive=False)
4 # []
6.2. find( name , attrs , recursive , text , **kwargs )

它與 find_all() 方法惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果

6.3. find_parents()和find_parent()

find_all() 和 find() 只搜索當前節點的全部子節點,孫子節點等. find_parents() 和 find_parent() 用來搜索當前節點的父輩節點,搜索方法與普通tag的搜索方法相同,搜索文檔搜索文檔包含的內容

6.4. find_next_siblings()和find_next_sibling()

這2個方法經過 .next_siblings 屬性對當 tag 的全部後面解析的兄弟 tag 節點進行迭代, find_next_siblings() 方法返回全部符合條件的後面的兄弟節點,find_next_sibling() 只返回符合條件的後面的第一個tag節點

6.5. find_previous_siblings()和find_previous_sibling()

這2個方法經過 .previous_siblings 屬性對當前 tag 的前面解析的兄弟 tag 節點進行迭代, find_previous_siblings()方法返回全部符合條件的前面的兄弟節點, find_previous_sibling() 方法返回第一個符合條件的前面的兄弟節點。

6.6. find_all_next()和find_next()

這2個方法經過 .next_elements 屬性對當前 tag 的以後的 tag 和字符串進行迭代, find_all_next() 方法返回全部符合條件的節點, find_next() 方法返回第一個符合條件的節點

6.7. find_all_previous()和find_previous()

這2個方法經過 .previous_elements 屬性對當前節點前面的 tag 和字符串進行迭代, find_all_previous() 方法返回全部符合條件的節點, find_previous()方法返回第一個符合條件的節點

參考資料：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#

轉載請註明：

http://www.cnblogs.com/wuwenyan/p/4773427.html