網頁分析工具beautifulsoup學習

時間 2019-11-08

原文原文鏈接

Beautiful Soup是一個用來解析HTML和XML的python庫，它能夠按照你喜歡的方式去解析文件，查找並修改解析樹。它能夠很好的處理不規範標記並生成剖析樹(parse tree). 它提供簡單又經常使用的導航(navigating)，搜索以及修改剖析樹的操做。 html

安裝beautifulsoup python

#安裝版本3
apt-get install python-beautifulsoup
#安裝版本4
apt-get install python-bs4 python-bs4-doc

既然是練習，就使用文檔上的例子進行練習，文檔的HTML採用如下內容： shell

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p><span style="font-size:14px;"> 
</span>

Beautiful Soup模塊中有一個BeautifulSoup對象，它會返回結構化的文檔。 express

剖析樹： Beautiful Soup剖析一個文檔後生成的數據結構。數據結構

剖析對象 (BeautifulSoup或 BeautifulStoneSoup的實例)是深層嵌套(deeply-nested), 精心構思的數據結構，能夠與XML和HTML結構相互協調。剖析對象包括2個類型的對象，Tag對象，用於操縱像<TITLE> ，<B>這樣的標籤；NavigableString對象，用於操縱字符串，如"Page title"和"This is paragraph"。
函數

t=BeautifulSoup.BeautifulSoup(file,from_encoding="UTF-8") 類解析HTML文檔，返回句柄t，能夠指定編碼，默認是unicode，可使用str將beautiful soup文檔轉化爲字符串，str(t)，或者使用prettify，prettify方法添加了一些換行和空格以便讓文檔看起來更清晰。若是原始文檔含有編碼聲明，Beautiful Soup會將原始的編碼聲明改成新的編碼。也就是說，你載入一個HTML文檔到BeautifulSoup後，再輸出它，不只HTML被清理過了，並且能夠明顯的看到它已經被轉換爲UTF-8
編碼

練習1 返回標準結構化的HTML文檔 spa

#!/usr/bin/python
#coding=utf-8
from bs4 import BeautifulSoup
html_doc = '上面html的內容'
soup = BeautifulSoup(html_doc)
print(soup.prettify())
 
結果：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

能夠看到以前橫排的HTML內容已經加入一些換行，看起來更清晰一些了。可使用返回的soup句柄訪問具體的數據結構。 code

print soup.title
> <title>The Dormouse's story</title>

print soup.title.name
> title

print soup.title.string
> The Dormouse's story

print soup.p
> <p class="title"><b>The Dormouse's story</b></p>

print soup.p["class"]
> ['title']

print soup.find_all('a')
> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Tag對象至關於XML或者HTML文檔中的tag標籤，Tags含有一些attributes和methods。Tag和NavigableString對象有不少有用的成員。NavigableString對象沒有屬性，只有Tag 對象有屬性。每個Tag都有一個名稱，能夠經過tag.name訪問該tag的名稱，也能夠改變tag的名稱。一個tag可能含有多個attributes，例如<b class="boldest">，tag標籤b包含一個class屬性，它的值爲boldest，能夠經過相似字典的方法訪問tag的屬性，也可以經過attrs訪問tag的屬性。有些屬性含有多個值，一般像class、rel等這些標籤都有多個值，對於多個值的屬性，BeautifulSoup會把它們看成列表對待。 orm

soup = BeautifulSoup(html_doc)
print soup.a['href']
>  

print soup.a.attrs
> {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

soup=BeautifulSoup('<p class="story dns">dns</p>')
print soup.p['class']
> ['story', 'dns']

NavigableString對象包含一個tag標籤包含的text內容，可使用tag.string訪問該tag的內容，也可修改一個tag的內容，使用tag.string.repace_with('string...')

soup = BeautifulSoup(html_doc)
print soup.title.string
> The Dormouse's story

BeautifulSoup對象至關於整個文檔的對象，大多時候你能夠把它看成tag對象，這意味着它支持大部分在Navigating the tree和Searching the tree中的方法。

Navigating the tree--Navigating剖析樹:

在tag中可能包含其餘tag和string，被包含的tag叫作子tag，Beautiful Soup提供了一些不一樣的方法來訪問這些子tag。string不支持這些特性，由於string沒有子string。

你可使用簡單的方法訪問解析樹中的tag，只要指定tag名稱便可，如：soup.tag，你也能夠在解析樹的某一個大的tag中經過不斷的解析訪問其下的子tag，如訪問body標籤下的b標籤，soup.body.b。這些方法只會返回第一次遇到的結果，若是要在全文中查找某一個標籤的全部結果，可使用Searching the tree中的find_all()方法。

print soup.head
> <head><title>The Dormouse's story</title></head>
print soup.head.title
> <title>The Dormouse's story</title>
print soup.find_all('a')
> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

父tag包含的子tag存放在一個叫作contents的列表中，string是不含有contents屬性的。除了經過列表訪問子tag，你還能夠經過children迭代訪問子tag。

soup = BeautifulSoup(html_doc)
html_tag=soup.html.body.contents[0]

print html_tag
> <p class="title"><b>The Dormouse's story</b></p>

for child in soup.html.body.children:
    print child
> <p class="title"><b>The Dormouse's story</b></p>
> <p class="story">Once upon a time there were three little sisters; and their names were
> <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
> <p class="story">...</p>

.contents 和 .children屬性只考慮父tag的直接子tag，而對於子tag中的string不認爲是tag，head有一個子tag title。而.descendants屬性則會認爲在title中的string也算一個。

若是一個tag只有一個子tag，而且子tag是NavigableString，則能夠經過.string訪問該子tag。若是一個tag的子tag是另一個tag，而這個子tag含有一個.string，這時父tag會認爲.string是子tag

soup = BeautifulSoup(html_doc)
print soup.title.string
> The Dormouse's story

print soup.head.string
> The Dormouse's story

若是一個tag中包含有多個string則能夠經過strings來訪問全部的string。既然容許從父tag查找子tag，那也能夠從子tag回溯查找父tag了。每一個tag和string都有父tag。能夠經過.parent屬性訪問該tag的父tag。亦能夠經過.parents訪問該tag的所有父tag。

soup = BeautifulSoup(html_doc)
print soup.title.parent
> <head><title>The Dormouse's story</title></head>
print soup.title.parent.name
> head

在文檔開始的HTML例子中，第二個p標籤下面有三個a標籤，並且都處於同一級別，咱們叫這三個a標籤爲siblings，能夠經過.next_sibling和.previous_sibling屬性向前或者向後訪問處於同一級別的標籤。

soup=BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print soup.b.next_sibling
> <c>text2</c>

print soup.c.previous_sibling
> <b>text1</b>

b標籤有 next_sibling卻沒有 previous_sibling，c標籤有 previous_sibling卻沒有 next_sibling，注意：text1和text2不構成sibling關係，由於他們沒有共同的父tag。你還可使用.next_siblings和.previous_siblings屬性遍歷指定標籤下的全部 sibling標籤。

Searching the tree--searching剖析樹：

Beautiful Soup在搜索剖析樹中定義了一些好用的搜索方法，能夠用這些方法在文檔中過濾出你感興趣的部分。

string：最簡單的過濾規則就是string，在查找方法中傳遞一個string，將會在文檔中精確的查找這個string標籤。soup.find_all('b')

regular expression：你也能夠傳遞一個正則對象，Beautiful Soup會使用match()方法去匹配該正則。soup.find_all(re.compile("t"))

list：你也能夠傳入一個list，這樣就會匹配其中的任何一個元素。soup.find_all(["a", "b"])

true：這是一個特殊值，表示匹配任何標籤。

下面分析搜索樹中的一個方法：find_all()，find_all()方法在文檔中查找符合過濾規則的全部標籤。

find_all(name, attrs, recursive, text, limit, **kwargs)

name：給name傳遞一個值，Beautiful Soup會認爲這個值是某個標籤的名稱。name的值能夠是以上介紹的幾種方法。

recursive：Beautiful Soup在某個tag下面匹配過濾規則時，會檢遞歸的檢查該tag下的全部子tag，若是你只想匹配直接子tag，能夠設置recursive=false。

text：給text指定一個值，用他來搜索strings，而不是搜索tag，雖然text是用來搜索string的，可是也能夠和tag混合使用。soup.find_all("a", text="Elsie")

limit：find_all()會返回全部匹配tag或者text的內容，若是你不須要全部的匹配的內容，而是隻須要前幾個，可使用limit參數限制。

kwargs：設置標籤的屬性值，以字典的形式出現，能夠傳入多個值。soup.find_all(href=re.compile("elsie"), id='link1')

attrs：若是你有一個文檔，它有一個標籤訂義了一個name屬性,會怎麼樣？你不能使用name爲keyword參數，由於Beautiful Soup 已經定義了一個name參數使用。你也不能用一個Python的保留字例如for做爲關鍵字參數。Beautiful Soup提供了一個特殊的參數attrs，你可使用它來應付這些狀況。attrs是一個字典，用起來就和keyword參數同樣。

find()：該函數找到匹配的第一個tag返回。

find_next_siblings()：這個函數使用.next_siblings迭代剩餘的siblings。它會返回全部匹配的siblings。而find_next_siblings只會返回第一個匹配的。

find_all_next()：該函數使用.next_elements迭代在該標籤以後的全部tag和strings，它返回全部匹配的結果，而find_next()值返回第一個匹配的。