BeautifulSoup安裝及其應用

時間 2019-11-06

標籤 beautifulsoup 安裝及其應用简体版

原文原文鏈接

BeautifulSoup 安裝及其使用javascript

BeautifulSoup 是個好東東。css

官網見這裏： http://www.crummy.com/software/BeautifulSoup/html

下載地址見這裏：http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/ ，附件有4.1.2的安裝源碼html5

文檔見這裏： http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html ，是中文翻譯的，不過文檔有點舊，是 3.0 的文檔版本，看起來沒有什麼意思。java

我推薦你們看個： http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ，這個是 python 的官網英文版，看起來要舒服，清晰不少。python

在 python 下，你想按照 jquery 格式來讀取網頁，免除網頁格式、標籤的不規範的困擾，那麼 BeautifulSoup 是個不錯的選擇。按照官網所說， BeautifulSoup 是 Screen-Scraping 應用，旨在節省你們處理 HTML 標籤，而且從網絡中得到信息的工程。 BeautifulSoup 有這麼幾個優勢，使得其功能尤爲強大：jquery

1 ： Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application 。關鍵詞： python 風格、提供簡單方法程序員

2 ： Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding 。關鍵詞：編碼轉換，使用 Python 的同窗都會認同Python 編碼格式的繁瑣， BeautifulSoup 能簡化這一點。ajax

3 ： Beautiful Soup sits on top of popular Python parsers like lxml and html5lib , allowing you to try out different parsing strategies or trade speed for flexibility 。關鍵詞：兼容其它 html 解析器，可以讓你隨心替換。api

看完這幾個特性，想必有人心動了吧，咱們先看下 BeautifulSoup 的安裝：

安裝方法：

1 ： apt-get install python-bs4

2 ： easy_install beautifulsoup4

3 ： pip install beautifulsoup4

4 ：源碼安裝： python setup.py install

根據不一樣的操做系統，選用不一樣的安裝方法，這些方法都能安裝成功，不一樣點在於安裝的工具不一樣。我本身的系統採用的是第四種安裝方法，下面我來簡要介紹下第四種安裝方法：

Python代碼

curl http://www.crummy.com/software/BeautifulSoup/bs4/download/4.1/beautifulsoup4-4.1.2.tar.gz >> beautifulsoup4-4.1.2.tar.gz
tar zxvf beautifulsoup4-4.1.2.tar.gz
cd beautifulsoup4-4.1.2
python setup.py install

Ok ，你就能看到安裝信息，提示安裝成功。

安裝成功，確定想火燒眉毛的使用，你打開 python command 窗口，你很 happy 的輸入：

Python代碼

from beautifulsoup import beautifulsoup

sorry ， ImportError ，爲何會有這個 import error ，我都安裝好了的。打開官網，從新看下說明，原來安裝的是BeautifulSoup 4.1 版本，這個 import 是 3.x 的說法。從新打開 command ，輸入：

Python代碼

import bs4
from bs4 import BeautifulSoup

咦，沒有輸出提示。恭喜你， BeautifulSoup 包引入成功。

看文上篇博客， http://isilic.iteye.com/blog/1733560 ，想試下 dir 命令，看看 BeautifulSoup 提供了哪些方法：

Python代碼

dir(BeautifulSoup)

看到一堆的方法，有點頭大，將方法列出來會方便看許多。

Python代碼

>>> for method in dir(BeautifulSoup):
... print method
...

請仔細看下其中的 findXxx ， nextXxx ， previousXxx 方法，這些方法提供了 html 頁面的遍歷、回溯、查找、匹配功能；這些功能已經可以提供獲取頁面信息的方法了。

咱們以百度首頁爲例，試用下 BeautifulSoup 的強大功能。

Python代碼

>>> import urllib2
>>> page=urllib2.urlopen('http://www.baidu.com')
>>> soup=BeautifulSoup(page)
>>> print soup.title
>>> soup.title.string

看到結果顯示不錯， helloworld 的教程讓人內心真是舒服啊。

想進一步試用功能，我想找出百度首頁上全部的連接，這個貌似很難，須要各類正則匹配，各類處理；等等，咱們如今是在談論這個 BeautifulSoup ，看看 BeautifulSoup 怎麼實現這個功能。

Python代碼

>>> for lind in soup.find_all('a'):
... print lind['href']
...

看到輸出了嗎？是否是很簡單。

對於熟悉 Jquery 和 CSS 的同窗，這種操做就是個折磨，須要不停的根據選擇出來的結果進行遍歷。看到上面的輸出，看到有不少的 # 這些非正常的 URL ，如今想把這些 URL 所有過濾掉，使用 select 語法就很簡單了。

Python代碼

>>> for link in soup.select('a[href^=http]'):
... print link['href'];
...

有人說我根據判斷出來的 URL 作處理不行嘛，固然能夠，我這裏只是想試下 select 的語法，至於 select 中的語法定義，你們能夠自行度之。準確的說，這個 select 語法都能從新開篇文章了。

再進一步，鏈接中的 / 或者 /duty 連接都是有含義的，是相對於本站的絕對地址，這些 / 開頭的怎麼不被過濾掉？若是是絕對地址的話，又該怎麼防止被過濾掉？ href 標籤裏面是個 javascript 又該怎麼過濾？若是考慮 css 文件和js 文件的話，怎麼把這些文件的 url 也給找出來？還有更進一步的，怎麼分析出 js 中 ajax 的請求地址？這些都是能夠進一步擴展的一些要求。

好吧，我認可後面這些 URL 過濾已經超出了 BeautifulSoup 的能力範圍了，可是單純考慮功能的話，這些都是要考慮的內容，這些疑問你們考慮下實現原理就行，若是能作進一步的學習的話，算是本文額外的功勞了。

下面簡單過下 BeautifulSoup 的用法：

Python代碼

DEFAULT_BUILDER_FEATURES
FORMATTERS
ROOT_TAG_NAME
STRIP_ASCII_SPACES：BeautifulSoup的內置屬性
__call__
__class__
__contains__
__delattr__
__delitem__
__dict__
__doc__
__eq__
__format__
__getattr__
__getattribute__
__getitem__
__hash__
__init__
__iter__
__len__
__module__
__ne__
__new__
__nonzero__
__reduce__
__reduce_ex__
__repr__
__setattr__
__setitem__
__sizeof__
__str__
__subclasshook__
__unicode__
__weakref__
_all_strings
_attr_value_as_string
_attribute_checker
_feed
_find_all
_find_one
_lastRecursiveChild
_last_descendant
_popToTag：BeautifulSoup的內置方法，關於這些方法使用須要瞭解Python更深些的內容。
append：修改element tree
attribselect_re
childGenerator
children
clear：清除標籤內容
decode
decode_contents
decompose
descendants
encode
encode_contents
endData
extract：這個方法很關鍵，後面有介紹
fetchNextSiblings下一兄弟元素
fetchParents：父元素集
fetchPrevious：前一元素
fetchPreviousSiblings：前一兄弟元素：這幾個可以對當前元素的父級別元素和兄弟級別進行查找。
find：只找到limit爲1的結果
findAll
findAllNext
findAllPrevious
findChild
findChildren：子集合
findNext：下一元素
findNextSibling：下一個兄弟
findNextSiblings：下一羣兄弟
findParent：父元素
findParents：全部的父元素集合
findPrevious
findPreviousSibling
findPreviousSiblings：對當前元素和子元素進行遍歷查找。
find_all_next
find_all_previous
find_next
find_next_sibling
find_next_siblings
find_parent
find_parents
find_previous
find_previous_sibling
find_previous_siblings：這些下劃線方法命名是bs4方法，推薦使用這類
format_string
get
getText
get_text：獲得文檔標籤內的內容，不包括標籤和標籤屬性
handle_data
handle_endtag
handle_starttag
has_attr
has_key
index
insert
insert_after
insert_before：修改element tree
isSelfClosing
is_empty_element
new_string
new_tag
next
nextGenerator
nextSibling
nextSiblingGenerator
next_elements
next_siblings
object_was_parsed
parentGenerator
parents
parserClass
popTag
prettify：格式化HTML文檔
previous
previousGenerator
previousSibling
previousSiblingGenerator
previous_elements
previous_siblings
pushTag
recursiveChildGenerator
renderContents
replaceWith
replaceWithChildren
replace_with
replace_with_children：修改element tree 元素內容
reset
select：適用於jquery和css的語法選擇。
setup
string
strings
stripped_strings
tag_name_re
text
unwrap
wrap

須要注意的是，在BeautifulSoup中的方法有些有兩種寫法，有些是駝峯格式的寫法，有些是下劃線格式的寫法，可是看其方法的含義是同樣的，這主要是BeautifulSoup爲了兼容3.x的寫法。前者是3.x的寫法，後者是4.x的寫法，推薦使用後者，也就是下劃線的方法。

根據這些方法，應該可以獲得遍歷、抽取、修改、規範化文檔的一系列方法。你們若是能在工做中使用BeautifulSoup ，必定會理解更深。

BeautifulSoup 支持不一樣的 parser ，默認是 Html 格式解析，還有 xml parser 、 lxml parser 、 html5lib parser 、html.parser ，這些 parser 都須要響應的解析器支持。

html，這個是默認的解析器

Python代碼

BeautifulSoup("<a></a>")
# <html><head></head><body><a></a></body></html>

xml格式解析器

Python代碼

BeautifulSoup("<a></a>", "xml")
# <?xml version="1.0" encoding="utf-8"?>
# <a></a>

lxml格式解析器

Python代碼

BeautifulSoup("<a>", "lxml")
# <html><body><a></a></body></html>

html5lib格式解析器

Python代碼

BeautifulSoup("<a>", "html5lib")
# <html><head></head><body><a></a></body></html>

html.parser解析器

Python代碼

BeautifulSoup("<a>", "html.parser")
# <a></a>

其中 parser 的區別你們看下這幾個例子就知道了。

在使用 BeautifulSoup 解析文檔的時候，會將整個文檔以一顆大又密集的數據載入到內存中，若是你只是從數據結構中得到一個字符串，內存中保存一堆數據感受就不划算了。而且若是你要得到指向某個 Tag 的內容，這個 Tag又會指向其它的 Tag 對象，所以你須要保存這棵樹的全部部分，也就是說整棵樹都在內存中。 extract 方法能夠破壞掉這些連接，它會將樹的鏈接部分斷開，若是你獲得某個 Tag ，這個 Tag 的剩餘部分會離開這棵樹而被垃圾收集器捕獲；固然，你也能夠實現其它的功能：如文檔中的某一塊你自己就不關心，你能夠直接把它 extract 出樹結構，扔給垃圾收集器，優化內存使用的同時還能完成本身的功能。

正如 BeautifulSoup 的做者 Leonard 所說，寫 BeautifulSoup 是爲了幫助別人節省時間，減少工做量。一旦習慣使用上 BeautifulSoup 後，一些站點的內容很快就能搞定。這個就是開源的精神，將工做盡量的自動化，減少工做量；從某個程度上來講，程序員應該是比較懶惰的，可是這種懶惰正好又促進了軟件行業的進步。

導入模塊時按照原始博文http://isilic.iteye.com/blog/1741918 老是不對，我嘗試import bs4 from bs4 import beautifulsoup 就可一了

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。