【Python】提取網頁正文內容的相關模塊與技術

時間 2019-11-07

標籤 Python 提取網頁正文內容相關模塊技術欄目 Python 简体版

原文原文鏈接

一、正文抽取地址html

https://github.com/buriy/python-readabilitypython

【安裝】git

pip install readability-lxmlgithub

【測試】web

python -m readability.readability -ujson

http://www.douban.com/note/320982627/app

【PATH依賴】python2.7

export PYTHONPATH=/usr/local/lib/python2.7/site-packagestornado

必須是2.7的PYTHON，因此必須這麼搞，還得看看怎麼讓PYTHON2.7和PYTHON3.3共存工具

=============================================================================

二、官方例子

from readability.readability import Documentimport urllibhtml = urllib.urlopen(url).read()readable_article =Document(html).summary()readable_title = Document(html).short_title()

==============================================================================

三、清理HTML

項目地址

https://github.com/aaronsw/html2text

【安裝】

pip install html2text

【代碼】

# -*- coding: utf-8 -*-import html2textprint html2text.html2text(u'<html><body><div><div class="note" id="link-report">（1）網頁去噪網頁去噪須要去掉與網頁內表達內容不相關的文字，如廣告，評論等等。如今對於博客、新聞類的網頁去噪已經有不少的應用，好比經常使用的印象筆記、有道筆記就用到了相關的技術。由於項目的須要，也須要對網頁進行去噪，留下有用的內容。因此在網上找了相關的網頁去噪的開源項目。（2）參考連接主要參考的連接是這篇「網頁正文抽取工具」，應該是抓取的新浪weibo上的相關的微博內容。裏面介紹了給出了項目的地址，有Java、C++、C#、Perl、Python的。由於項目是Python寫的，因此初步選定使用 Decruft ， Python readability ， Python boilerpipe ，Pyhon Goose這幾種。（3）實踐操做Python readability的使用：from readability.readability import Documentimport urllibhtml = urllib.urlopen(url).read()readable_article = Document(html).summary()readable_title = Document(html).short_title()最後抽取出來的readable_article是帶HTML標籤的文本。還須要進行clean html操做。若是須要獲得純文本內容，還須要作其餘工做「decruft is a fork of python-readability to make it faster. It also has some logic corrections and improvements along the way.」（引自：<a rel="nofollow" href="http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/" target="_blank">http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/</a>）decruft是Python readability的fork版本，其主要提升了readability的速度。decruft的源碼是放在Goolge上的，發現他只有0.1版本，並且是10年9月的，可是Python-readability一直在更新的，其核心的readability.py是7個月前更新的，因此不能保證decruft的性能要比如今的readability好，我沒有下載decruft進行試驗，有興趣能夠本身試驗一下。Python-boilerpipe：是Boilerpipe的Python版本的Warpper，在使用的時候須要依賴jpype, chardet. 在構造Extractor的時候能夠定製本身須要的抽取器，具體有：DefaultExtractorArticleExtractorArticleSentencesExtractorKeepEverythingExtractorKeepEverythingWithMinKWordsExtractorLargestContentExtractorNumWordsRulesExtractorCanolaExtractor這個項目能夠本身選擇抽取出的正文內容格式：能夠是純文本的，也能夠是攜帶HTML的。Python-Goose：通過試驗，決定使用Goose，能夠在這個網址上測試 <a rel="nofollow" href="http://jimplush.com/blog/goose" target="_blank">http://jimplush.com/blog/goose</a> Goose的抽取效果。Goose還可以得到Meta description。Goose最後能夠得到抽取後的純文本。</div></div></body></html>')