使用BeautifulSoup刪除html中的script、註釋

時間 2019-12-11

標籤使用 beautifulsoup 刪除 html script 註釋欄目 HTML 简体版

原文原文鏈接

##示例1：去除scripthtml

#! /usr/bin/env python
# -*- coding: utf-8 -*- 

from BeautifulSoup import BeautifulSoup
html = '''
<script>a</script>
baba
<script>b</script>
<h1>hi, world</h1>
'''
soup = BeautifulSoup('<script>a</script>baba<script>b</script><h1>')
[s.extract() for s in soup('script')]
print soup

輸出：python

baba<h1></h1>

可使用這種方法去除其餘標籤、以及其中內容。code

也能夠將htm

[s.extract() for s in soup('script')]

替換爲：ip

[s.extract() for s in soup.findAll('script')]

##示例2：去除註釋utf-8

#! /usr/bin/env python
# -*- coding: utf-8 -*- 

from BeautifulSoup import BeautifulSoup, Comment
data = """<div class="foo">
cat dog sheep goat
<!--
<p>test</p>
-->
</div>"""

soup = BeautifulSoup(data)

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

print soup.prettify()

輸出結果：element

<div class="foo">
 cat dog sheep goat
</div>

相關標籤/搜索