##示例1:去除scripthtml
#! /usr/bin/env python # -*- coding: utf-8 -*- from BeautifulSoup import BeautifulSoup html = ''' <script>a</script> baba <script>b</script> <h1>hi, world</h1> ''' soup = BeautifulSoup('<script>a</script>baba<script>b</script><h1>') [s.extract() for s in soup('script')] print soup
輸出:python
baba<h1></h1>
可使用這種方法去除其餘標籤、以及其中內容。code
也能夠將htm
[s.extract() for s in soup('script')]
替換爲:ip
[s.extract() for s in soup.findAll('script')]
##示例2:去除註釋utf-8
#! /usr/bin/env python # -*- coding: utf-8 -*- from BeautifulSoup import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()
輸出結果:element
<div class="foo"> cat dog sheep goat </div>