爬蟲-9.BeautifulSoup

時間 2019-12-20

標籤爬蟲 9.beautifulsoup beautifulsoup 欄目網絡爬蟲简体版

原文原文鏈接

CSS 選擇器：BeautifulSoup4

和 lxml 同樣，Beautiful Soup 也是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數據。css

lxml 只會局部遍歷，而Beautiful Soup 是基於HTML DOM的，會載入整個文檔，解析整個DOM樹，所以時間和內存開銷都會大不少，因此性能要低於lxml。
BeautifulSoup 用來解析 HTML 比較簡單，API很是人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持 lxml 的 XML解析器。
Beautiful Soup 3 目前已經中止開發，推薦如今的項目使用Beautiful Soup 4。使用 pip 安裝便可：pip install beautifulsoup4html

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

構建對象：正則表達式

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#建立 Beautiful Soup 對象
soup = BeautifulSoup(html)
#能夠經過soup = BeautifulSoup(html,「lxml」)方式指定lxml解析器

#打開本地 HTML 文件的方式來建立對象
#soup = BeautifulSoup(open('xxx.html'))

#格式化輸出 soup 對象的內容
print(soup.prettify())

四大對象種類

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種:性能

Tag

NavigableString

BeautifulSoup

Comment

1.Tag

Tag通俗點講就是 HTML 中的一個個標籤加上裏面包括的內容code

使用 Beautiful Soup 來獲取 Tags:orm

soup = BeautifulSoup(html)

print (soup.title)
# <title>The Dormouse's story</title>

print(soup.head)   #只打印第一個
# <head><title>The Dormouse's story</title></head>

print(soup.a)     #只打印第一個
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(type(soup.a))  #打印類型，結果爲Tag
# <class 'bs4.element.Tag'>

咱們能夠利用 soup 加標籤名輕鬆地獲取這些標籤的內容，這些對象的類型是bs4.element.Tag。可是注意，它查找的是在全部內容中的第一個符合要求的標籤。若是要查詢全部的標籤，後面會進行介紹。xml

對於 Tag，它有兩個重要的屬性，是 name 和 attrshtm

print soup.name
# [document] #soup 對象自己比較特殊，它的 name 即爲 [document]

print soup.head.name
# head #對於其餘內部標籤，輸出的值便爲標籤自己的名稱

print soup.p.attrs
# {'class': ['title'], 'name': 'dromouse'}
# 在這裏，咱們把 p 標籤的全部屬性打印輸出了出來，獲得的類型是一個字典。

print soup.p['class'] # soup.p.get('class')
# ['title'] #還能夠利用get方法，傳入屬性的名稱，兩者是等價的

soup.p['class'] = "newClass"
print soup.p # 能夠對這些屬性和內容等等進行修改
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

del soup.p['class'] # 還能夠對這個屬性進行刪除
print soup.p
# <p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

獲得標籤的內容，須要獲取標籤內部的文字，能夠使用.string便可對象

print(soup.p.string)
# The Dormouse's story

print(type(soup.p.string))
<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的內容，大部分時候，能夠把它看成Tag對象，是一個特殊的Tag：遞歸

print(type(soup.name))
# <type 'unicode'>

print(soup.name)
# [document]

print(soup.attrs)
# {}

4.Comment

Comment 對象是一個特殊的NavigableString 對象，其輸出的內容不包括註釋

print soup.a
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print soup.a.string
# Elsie 

print type(soup.a.string)
# <class 'bs4.element.Comment'>

a 標籤裏的內容其實是註釋，可是若是咱們利用 .string 來輸出它的內容時，註釋符號已經去掉了。

遍歷文檔樹

1.直接子節點：.contents .children 屬性

.content
tag的.content屬性能夠將tag的子節點以列表的方式輸出

print soup.head.contents 
#[<title>The Dormouse's story</title>]

輸出方式爲列表，咱們能夠用列表索引來獲取它的某一個元素

print(soup.head.contents[0])
#[<title>The Dormouse's story</title>]

.children
返回的是列表生成式對象

print soup.head.children
#<listiterator object at 0x7f71457f5710>

2.全部子孫節點：.descendants 屬性

.contents 和 .children 屬性僅包含tag的直接子節點，.descendants 屬性能夠對全部tag的子孫節點進行遞歸循環，和 children相似，咱們也須要遍歷獲取其中的內容。

for child in soup.descendants:
    print child

3.節點內容：.string 屬性

若是tag只有一個 NavigableString 類型子節點,那麼這個tag能夠使用 .string 獲得子節點。若是一個tag僅有一個子節點,那麼這個tag也能夠使用 .string 方法,輸出結果與當前惟一子節點的 .string 結果相同。

通俗點說就是：若是一個標籤裏面沒有標籤了，那麼 .string 就會返回標籤裏面的內容。若是標籤裏面只有惟一的一個標籤了，那麼 .string 也會返回最裏面的內容。例如：

print soup.head.string
#The Dormouse's story
print soup.title.string
#The Dormouse's story

搜索文檔樹

1.find_all(name, attrs, recursive, text, kwargs)
1 name 參數**
name 參數能夠查找全部名字爲 name 的tag,字符串對象會被自動忽略掉

A.傳字符串
最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中全部的<b>標籤:

soup.find_all('b')
#返回符合結果的列表
# [<b>The Dormouse's story</b>]

B.傳正則表達式
若是傳入正則表達式做爲參數,Beautiful Soup會經過正則表達式的 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和<b>標籤都應該被找到

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

C.傳列表
若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和<b>標籤:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2 keyword 參數

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3 text 參數
經過 text 參數能夠搜搜文檔中的字符串內容，與 name 參數的可選值同樣, text 參數接受字符串 , 正則表達式 , 列表

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

CSS選擇器

這就是另外一種與 find_all 方法有殊途同歸之妙的查找方法.

寫 CSS 時，標籤名不加任何修飾，類名前加. ,id名前加#

在這裏咱們也能夠利用相似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

經過標籤名查找

soup.select('title') 
#[<title>The Dormouse's story</title>]

soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('b')
#[<b>The Dormouse's story</b>]

經過類名查找

soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

經過 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

組合查找

組合查找即和寫 class 文件時，標籤名與類名、id名進行的組合原理是同樣的，例如查找 p 標籤中，id 等於 link1的內容，兩者須要用空格分開

soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

屬性查找

查找時還能夠加入屬性元素，屬性須要用中括號括起來，注意屬性和標籤屬於同一節點，因此中間不能加空格，不然會沒法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

獲取內容

以上的 select 方法返回的結果都是列表形式，能夠遍歷形式輸出，而後用 get_text() 方法來獲取它的內容。

soup = BeautifulSoup(html, 'lxml')

soup.select('title')[0].get_text()
#或者
for title in soup.select('title'):
    print title.get_text()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。