BeautifulSoup模塊

時間 2019-12-15

標籤 beautifulsoup 模塊简体版

原文原文鏈接

1.BeautifulSoup模塊用於接收一個HTML或XML字符串，而後將其進行格式化，以後遍能夠使用他提供的方法進行快速查找指定元素，從而使得在HTML或XML中查找指定元素變得簡單。
2.安裝BeautifulSoup模塊
pip3 install beautifulsoup4
3.使用方式
建立htmlhtml

html_doc ="""
            <html>
                <head>
                    <title>BeautifulSoup示例</title>
                </head>
            <body>
                <div>
                    <a href='http://www.dongdong.com'>東東<p>東東內容</p></a>
                </div>
                <a id='xixi'>西西</a>
                <div>
                    <p>南南內容</p>
                </div>
                <p>北北內容</p>
            </body>
            </html>
        """

建立beautifulsoup對象app

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整個html
print(soup.prettify())                                    #打印soup對象的內容，格式化輸出

name標籤名稱
(1)經過soup對象找到全部a標籤spa

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整個html
tag = soup.find('a')                                      #找到a標籤
print(tag)

輸出：
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
(2)經過a標籤找到a標籤的名稱code

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")    #soup是整個html
tag = soup.find('a')                                      #找到a標籤                                              
name = tag.name                                           #獲取a標籤的名稱

輸出：
a
(3)經過a標籤的名稱修改a標籤的名稱htm

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")      #soup是整個html
tag = soup.find('a')                                      #找到a標籤                                              
name = tag.name                                           #獲取a標籤的名字                                               
tag.name = 'span'                                         #把a標籤的名稱改成span
print(tag)

輸出：
<span href="http://www.dongdong.com">東東<p>東東內容</p></span>
attr標籤屬性
(1)經過attrs獲取a標籤屬性對象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs              #獲取屬性
print(attrs)

輸出：
{'href': 'http://www.dongdong.com'}
(2)經過attrs修改a標籤屬性blog

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs                                   #獲取屬性
tag.attrs = {'href':'http://www.nannan.com'}   #修改屬性
print(tag)

輸出：
<a href="http://www.nannan.com">東東<p>東東內容</p></a>
(3)經過attrs給標籤裏添加屬性love="石頭"遞歸

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")      
tag = soup.find('a')                                      
tag.attrs['love'] = '石頭'
print(tag)

輸出：
<a href="http://www.dongdong.com" love="石頭">東東<p>東東內容</p></a>
(4)經過attrs把a標籤裏的屬性href刪除掉索引

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
attrs = tag.attrs                                   #獲取屬性
del tag.attrs['href']
print(tag)

輸出：
<a>東東<p>東東內容</p></a>
標籤和內容
(1)經過children找全部body裏全部子標籤ip

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').children
print(list(tags))

輸出：
['\n', <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>, '\n', <a id="xixi">西西</a>, '\n', <div>
<p>南南內容</p>
</div>, '\n', <p>北北內容</p>, '\n']
(2)經過children找全部body裏全部子標籤,再經過tags把每個標籤拿到再經過type(tag)把標籤和內容分別取出來

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').children      ###經過tags把每個標籤拿到再經過type(tag)把標籤和內容分別取出來
from bs4.element import Tag
for tag in tags:
    if type(tag) == Tag:         #判斷若是type(tag) == Tag是標籤
        print('我是標籤：',tag, type(tag))
    else:                       #不然是文本
        print('文本....')

輸出：
文本....
我是標籤： <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div> <class 'bs4.element.Tag'>
文本....
我是標籤： <a id="xixi">西西</a> <class 'bs4.element.Tag'>
文本....
我是標籤： <div>
<p>南南內容</p>
</div> <class 'bs4.element.Tag'>
文本....
我是標籤： <p>北北內容</p> <class 'bs4.element.Tag'>
文本....
(3)經過descendants找全部body裏全部子子孫孫標籤(遞歸一個一個找)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tags = soup.find('body').descendants
print(list(tags))

輸出：
['\n', <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>, '\n', <a href="http://www.dongdong.com">東東<p>東東內容</p></a>, '東東', <p>東東內容</p>, '東東內容', '\n', '\n', <a id="xixi">西西</a>, '西西', '\n', <div>
<p>南南內容</p>
</div>, '\n', <p>南南內容</p>, '南南內容', '\n', '\n', <p>北北內容</p>, '北北內容', '\n']
(4)經過把body標籤裏面的孩子都清空(保留標籤名)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body')
tag.clear()
print(soup)

輸出：
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body></body>
</html>
(5)decompose遞歸的刪除全部的標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
body.decompose()
print(soup)

輸出：
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
</html>
(6)extract遞歸的刪除全部的標籤，並獲取刪除的標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.extract()
print(v)

輸出：
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
(7)decode把對象轉換爲字符串(含當前標籤)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.decode()
print(v)

輸出：
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
(8)decode_contents把對象轉換爲字符串(不含當前標籤)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
body = soup.find('body')
v = body.decode_contents()
print(v)

輸出：
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
(10)find獲取匹配的第一個標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body').find('p',recursive=False)                            #recursive=True是否遞歸去找
print(tag)

輸出：
<p>北北內容</p>
(11)get_text獲取標籤內部文本內容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('a')
print(tag)
v = tag.get_text()
print(v)

輸出：
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
東東東東內容
(12)index檢查標籤在某標籤中的索引位置

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
tag = soup.find('body')
v = tag.index(tag.find('div'))
print(v)

輸出：
1
(13)index檢查標籤在某標籤中的索引位置

tag = soup.find('body')
for i,v in enumerate(tag):
    print(i,v)

輸出：
0
1 <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
2
3 <a id="xixi">西西</a>
4
5 <div>
<p>南南內容</p>
</div>
6
7 <p>北北內容</p>
8
(14)append在當前標籤內部追加一個標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
from bs4.element import Tag
obj = Tag(name='i',attrs={'id': 'it'})
obj.string = '我是一個新來的'
tag = soup.find('body')
tag.append(obj)
print(soup)

輸出：
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
<i id="it">我是一個新來的</i></body>
</html>
(15)insert在當前標籤內部指定位置插入一個標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")
from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一個新來的'
tag = soup.find('body')
tag.insert(2, obj)
print(soup)

輸出：
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div><i id="it">我是一個新來的</i>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
</html>
(16)replace_with 在當前標籤替換爲指定標籤

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, features="html.parser")

from bs4.element import Tag
obj = Tag(name='i', attrs={'id': 'it'})
obj.string = '我是一個新來的'
tag = soup.find('div')
tag.replace_with(obj)
print(soup)