1.BeautifulSoup模塊用於接收一個HTML或XML字符串,而後將其進行格式化,以後遍能夠使用他提供的方法進行快速查找指定元素,從而使得在HTML或XML中查找指定元素變得簡單。
2.安裝BeautifulSoup模塊
pip3 install beautifulsoup4
3.使用方式
建立htmlhtml
html_doc =""" <html> <head> <title>BeautifulSoup示例</title> </head> <body> <div> <a href='http://www.dongdong.com'>東東<p>東東內容</p></a> </div> <a id='xixi'>西西</a> <div> <p>南南內容</p> </div> <p>北北內容</p> </body> </html> """
建立beautifulsoup對象app
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整個html print(soup.prettify()) #打印soup對象的內容,格式化輸出
name標籤名稱
(1)經過soup對象找到全部a標籤spa
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整個html tag = soup.find('a') #找到a標籤 print(tag)
輸出:
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
(2)經過a標籤找到a標籤的名稱code
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整個html tag = soup.find('a') #找到a標籤 name = tag.name #獲取a標籤的名稱
輸出:
a
(3)經過a標籤的名稱修改a標籤的名稱htm
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") #soup是整個html tag = soup.find('a') #找到a標籤 name = tag.name #獲取a標籤的名字 tag.name = 'span' #把a標籤的名稱改成span print(tag)
輸出:
<span href="http://www.dongdong.com">東東<p>東東內容</p></span>
attr標籤屬性
(1)經過attrs獲取a標籤屬性對象
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #獲取屬性 print(attrs)
輸出:
{'href': 'http://www.dongdong.com'}
(2)經過attrs修改a標籤屬性blog
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #獲取屬性 tag.attrs = {'href':'http://www.nannan.com'} #修改屬性 print(tag)
輸出:
<a href="http://www.nannan.com">東東<p>東東內容</p></a>
(3)經過attrs給標籤裏添加屬性love="石頭"遞歸
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') tag.attrs['love'] = '石頭' print(tag)
輸出:
<a href="http://www.dongdong.com" love="石頭">東東<p>東東內容</p></a>
(4)經過attrs把a標籤裏的屬性href刪除掉索引
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') attrs = tag.attrs #獲取屬性 del tag.attrs['href'] print(tag)
輸出:
<a>東東<p>東東內容</p></a>
標籤和內容
(1)經過children找全部body裏全部子標籤ip
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').children print(list(tags))
輸出:
['\n', <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>, '\n', <a id="xixi">西西</a>, '\n', <div>
<p>南南內容</p>
</div>, '\n', <p>北北內容</p>, '\n']
(2)經過children找全部body裏全部子標籤,再經過tags把每個標籤拿到再經過type(tag)把標籤和內容分別取出來
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').children ###經過tags把每個標籤拿到再經過type(tag)把標籤和內容分別取出來 from bs4.element import Tag for tag in tags: if type(tag) == Tag: #判斷若是type(tag) == Tag是標籤 print('我是標籤:',tag, type(tag)) else: #不然是文本 print('文本....')
輸出:
文本....
我是標籤: <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div> <class 'bs4.element.Tag'>
文本....
我是標籤: <a id="xixi">西西</a> <class 'bs4.element.Tag'>
文本....
我是標籤: <div>
<p>南南內容</p>
</div> <class 'bs4.element.Tag'>
文本....
我是標籤: <p>北北內容</p> <class 'bs4.element.Tag'>
文本....
(3)經過descendants找全部body裏全部子子孫孫標籤(遞歸一個一個找)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tags = soup.find('body').descendants print(list(tags))
輸出:
['\n', <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>, '\n', <a href="http://www.dongdong.com">東東<p>東東內容</p></a>, '東東', <p>東東內容</p>, '東東內容', '\n', '\n', <a id="xixi">西西</a>, '西西', '\n', <div>
<p>南南內容</p>
</div>, '\n', <p>南南內容</p>, '南南內容', '\n', '\n', <p>北北內容</p>, '北北內容', '\n']
(4)經過把body標籤裏面的孩子都清空(保留標籤名)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body') tag.clear() print(soup)
輸出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body></body>
</html>
(5)decompose遞歸的刪除全部的標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') body.decompose() print(soup)
輸出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
</html>
(6)extract遞歸的刪除全部的標籤,並獲取刪除的標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.extract() print(v)
輸出:
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
(7)decode把對象轉換爲字符串(含當前標籤)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.decode() print(v)
輸出:
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
(8)decode_contents把對象轉換爲字符串(不含當前標籤)
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") body = soup.find('body') v = body.decode_contents() print(v)
輸出:
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
(10)find獲取匹配的第一個標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body').find('p',recursive=False) #recursive=True是否遞歸去找 print(tag)
輸出:
<p>北北內容</p>
(11)get_text獲取標籤內部文本內容
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('a') print(tag) v = tag.get_text() print(v)
輸出:
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
東東東東內容
(12)index檢查標籤在某標籤中的索引位置
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") tag = soup.find('body') v = tag.index(tag.find('div')) print(v)
輸出:
1
(13)index檢查標籤在某標籤中的索引位置
tag = soup.find('body') for i,v in enumerate(tag): print(i,v)
輸出:
0
1 <div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
2
3 <a id="xixi">西西</a>
4
5 <div>
<p>南南內容</p>
</div>
6
7 <p>北北內容</p>
8
(14)append在當前標籤內部追加一個標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i',attrs={'id': 'it'}) obj.string = '我是一個新來的' tag = soup.find('body') tag.append(obj) print(soup)
輸出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
<i id="it">我是一個新來的</i></body>
</html>
(15)insert在當前標籤內部指定位置插入一個標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一個新來的' tag = soup.find('body') tag.insert(2, obj) print(soup)
輸出:
<html>
<head>
<title>BeautifulSoup示例</title>
</head>
<body>
<div>
<a href="http://www.dongdong.com">東東<p>東東內容</p></a>
</div><i id="it">我是一個新來的</i>
<a id="xixi">西西</a>
<div>
<p>南南內容</p>
</div>
<p>北北內容</p>
</body>
</html>
(16)replace_with 在當前標籤替換爲指定標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, features="html.parser") from bs4.element import Tag obj = Tag(name='i', attrs={'id': 'it'}) obj.string = '我是一個新來的' tag = soup.find('div') tag.replace_with(obj) print(soup)
輸出:<html><head><title>BeautifulSoup示例</title></head><body><i id="it">我是一個新來的</i><a id="xixi">西西</a><div><p>南南內容</p></div><p>北北內容</p></body></html>