BeautifulSoup模塊

時間 2019-12-15

標籤 beautifulsoup 模塊简体版

原文原文鏈接

BeautifulSoup是一個模塊，該模塊用於接收一個HTML或XML字符串，而後將其進行格式化，以後遍可使用他提供的方法進行快速查找指定元素，從而使得在HTML或XML中查找指定元素變得簡單。html

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id13python

安裝

pip install beautifulsoup4

使用示例app

from bs4 import BeautifulSoup
 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse's story總共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""
 
soup = BeautifulSoup(html_doc, features="lxml")　　# 第一個參數是網頁內容的字符串形式，第二個參數是用來選擇解析庫
# 找到第一個a標籤
tag1 = soup.find(name='a')
# 找到全部的a標籤
tag2 = soup.find_all(name='a')
# 找到id＝link2的標籤
tag3 = soup.select('#link2')

使用

將一段文檔傳入BeautifulSoup的構造方法，就能獲得一個文檔的對象，能夠傳入一段字符串或一個文件句柄。ide

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('index.html'))
# 或者
soup = BeautifulSoup('<html> data </html>')

而後文檔被轉換成Unicode的，而且HTML的實例都被轉換成Unicode的編碼編碼

BeautifulSoup("Sacr&eacute; bleu!")
<html><head></head><body>Sacré bleu!</body></html>

而後,Beautiful Soup選擇最合適的解析器來解析這段文檔,若是手動指定解析器那麼Beautiful Soup會選擇指定的解析器來解析文檔spa

BeautifulSoup默認支持Python的標準HTML解析庫，可是它也支持一些第三方的解析庫：3d

注：平時咱們可使用python內置的html.parser，可是用的多的仍是lxml的兩個解析庫，速度快，容錯也高code

通過解析庫的解析，就會將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種: Tag , NavigableString , BeautifulSoup , Comment .orm

（1）Tagxml

Tag至關於html中的一個標籤。

這裏有比較重要的兩個屬性：

name：標籤的名字
attrs：每一個標籤中的所有屬性

（2）NavigableString

字符串常被包含在tag內.Beautiful Soup用 NavigableString 類來包裝tag中的字符串。

tag中包含的字符串不能編輯,可是能夠被替換成其它的字符串,用 replace_with() 方法。

若是想在Beautiful Soup以外使用 NavigableString 對象,須要調用 unicode() 方法,將該對象轉換成普通的Unicode字符串,不然就算Beautiful Soup已方法已經執行結束,該對象的輸出也會帶有對象的引用地址.這樣會浪費內存。

（3）BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的所有內容.大部分時候,能夠把它看成 Tag 對象,它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法.

由於 BeautifulSoup 對象並非真正的HTML或XML的tag,因此它沒有name和attribute屬性.但有時查看它的 .name 屬性是很方便的,因此 BeautifulSoup 對象包含了一個值爲「[document]」的特殊屬性 .name

（4）Comment

Tag , NavigableString , BeautifulSoup 幾乎覆蓋了html和xml中的全部內容,可是還有一些特殊對象.容易讓人擔憂的內容是文檔的註釋部分:

Comment 對象就是一個特殊類型的 NavigableString 對象用來記錄文檔的。

好比：

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

可是當它出如今HTML文檔中時, Comment 對象會使用特殊的格式輸出:

 
   print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b> 
  

經常使用的屬性和方法

1.name：獲取標籤名稱

tag = soup.find('a')
name = tag.name # 獲取
print(name)
tag.name = 'span' # 設置
print(soup)

View Code

2.attr：獲取標籤屬性，值爲集合形式

# tag = soup.find('a')
# attrs = tag.attrs    # 獲取
# print(attrs)
# tag.attrs = {'ik':123} # 設置
# tag.attrs['id'] = 'iiiii' # 設置
# print(soup
#另外一種方式，用get獲取
#soup.select('a')[0].get('href')

View Code

3.children：獲取全部子標籤

# body = soup.find('body')
# v = body.children

View Code

4.descendants：獲取全部子子孫孫的標籤

# body = soup.find('body')
# v = body.descendants

View Code

5.clear：將標籤的全部子標籤所有清空（保留標籤名）

# tag = soup.find('body')
# tag.clear()
# print(soup)

View Code

6.decompose：遞歸刪除全部的標籤

# body = soup.find('body')
# body.decompose()
# print(soup)

View Code

7.extract：遞歸刪除全部的標籤，並獲取刪除的標籤

# body = soup.find('body')
# v = body.extract()
# print(soup)

View Code

8.decode：轉換爲字符串（含當前標籤）；decode_contents（不含當前標籤)

# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

View Code

9.encode：轉換爲字節(含當前標籤）；encode_contents（不包含當前標籤）

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

View Code

10.find：獲取匹配的第一個標籤

# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

View Code

11.find_all：獲取匹配的全部標籤

# tags = soup.find_all('a')
# print(tags)
 
# tags = soup.find_all('a',limit=1)
# print(tags)
 
# tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags)
 
 
# ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v)
 
# v = soup.find_all(class_=['sister0', 'sister'])
# print(v)
 
# v = soup.find_all(text=['Tillie'])
# print(v, type(v[0]))
 
 
# v = soup.find_all(id=['link1','link2'])
# print(v)
 
# v = soup.find_all(href=['link1','link2'])
# print(v)
 
# ####### 正則 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v)
 
# rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v)
 
# rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v)
 
# ####### 方法篩選 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v)
 
 
# ## get,獲取標籤屬性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

View Code

12.has_attr：檢查標籤是否具備該屬性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

View Code

13.get_text：獲取標籤內部文本內容

# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

View Code

14.index：檢查標籤在某標籤中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v)
 
# tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

View Code

15. is_empty_element,是不是空標籤(是否能夠是空)或者自閉合標籤，

判斷是不是以下標籤：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

View Code

16. 當前的關聯標籤

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings
 
#
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings
 
#
# tag.parent
# tag.parents

View Code

17. 查找某標籤的關聯標籤

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)
 
# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)
 
# tag.find_parent(...)
# tag.find_parents(...)
 
# 參數同find_all

View Code

18.select,select_one, CSS選擇器

soup.select("title")
 
soup.select("p nth-of-type(3)")
 
soup.select("body a")
 
soup.select("html head title")
 
tag = soup.select("span,a")
 
soup.select("head > title")
 
soup.select("p > a")
 
soup.select("p > a:nth-of-type(2)")
 
soup.select("p > #link1")
 
soup.select("body > a")
 
soup.select("#link1 ~ .sister")
 
soup.select("#link1 + .sister")
 
soup.select(".sister")
 
soup.select("[class~=sister]")
 
soup.select("#link1")
 
soup.select("a#link2")
 
soup.select('a[href]')
 
soup.select('a[href="http://example.com/elsie"]')
 
soup.select('a[href^="http://example.com/"]')
 
soup.select('a[href$="tillie"]')
 
soup.select('a[href*=".com/el"]')
 
 
from bs4.element import Tag
 
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags)
 
from bs4.element import Tag
def default_candidate_generator(tag):
    for child in tag.descendants:
        if not isinstance(child, Tag):
            continue
        if not child.has_attr('href'):
            continue
        yield child
 
tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

View Code

19. 標籤的內容

# tag = soup.find('span')
# print(tag.string)          # 獲取
# tag.string = 'new content' # 設置
# print(soup)
 
# tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup)
 
# tag = soup.find('body')
# v = tag.stripped_strings  # 遞歸內部獲取全部標籤的文本
# print(v)

View Code

20.append在當前標籤內部追加一個標籤

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一個新來的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

View Code

21.insert在當前標籤內部指定位置插入一個標籤

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一個新來的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

View Code

22. insert_after,insert_before 在當前標籤後面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一個新來的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

View Code

23. replace_with 在當前標籤替換爲指定標籤

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一個新來的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

View Code

24. 建立標籤之間的關係

# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

View Code

25. wrap，將指定標籤把當前標籤包裹起來

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一個新來的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup)
 
# tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

View Code

26. unwrap，去掉當前標籤，將保留其包裹的標籤

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

View Code

小例子

from bs4.element import Tag

tags = soup.find("body").children
for tag in tags:
    if type(tag) == Tag:
        print(tag)
    else:
        print("文本。。")