Python爬蟲筆記4-BeautifulSoup使用

時間 2019-11-08

標籤 python 爬蟲筆記 beautifulsoup 使用欄目 Python 简体版

原文原文鏈接

BeautifulSoup介紹

與lxml同樣，BeautifulSoup也是一個HTML/XML的解析器，主要功能也是如何解析和提取HTML/XML數據。html

幾種解析工具的對比python

工具	速度	難度
正則表達式	最快	困難
BeautifulSoup	慢	最簡單
lxml	快	簡單

lxml 只會局部遍歷，而Beautiful Soup 是基於HTML DOM的，會載入整個文檔，解析整個DOM樹，所以時間和內存開銷都會大不少，因此性能要低於lxml。

安裝
個人環境是Python 3.6.5，windows下cmd裏執行pip安裝便可。正則表達式

pip3 install beautifulsoup4

測試
python終端裏導入beautifulsoup，無報錯信息即安裝成功。windows

>>from bs4 import BeautifulSoup
>>

BeautifulSoup對象

BeautifulSoup將複雜的HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種:網絡

Tag
NavigableString
BeautifulSoup
Comment

BeautifulSoup 對象表示的是一個文檔的內容。大部分時候,能夠把它看成 Tag 對象，是一個特殊的 Tag。
Comment 對象是一個特殊類型的 NavigableString 對象，其輸出的內容不包括註釋符號。工具

Tag

Tag能夠簡單理解爲HTML文檔中的一個個的標籤，好比：性能

<head><title>The Dormouse's story</title></head>
<ur><li class="item-0"><a href="link1.html">first item</a></li></ur>

上面HTML文檔中的head、title、ur、li都是HTML標籤(節點名稱)，這些標籤加上裏面的內容就是tag。測試

獲取Tagsui

# 導入模塊
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 初始化BeautifulSoup對象，指定lxml解析器
soup = BeautifulSoup(html, 'lxml')

# prettify()方法格式化soup的內容
print(soup.prettify())

# soup.title選出title節點
print(soup.title)
# <title>The Dormouse's story</title>

print(type(soup.title))
# <class 'bs4.element.Tag'>

print(soup.head)
# <head><title>The Dormouse's story</title></head>

print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

說明：使用soup加節點名稱能夠獲取節點內容，這些對象的類型是bs4.element.Tag，可是它查找的是在內容中第一個符合要求的節點。好比上面代碼有多個p標籤，可是它只查找了第一個p標籤。搜索引擎

對於Tag有兩個重要的屬性，name和attrs。當選擇一個節點後，name屬性獲取節點的名稱，attrs屬性獲取節點的屬性(以字典形式返回)。

print(soup.name)
# [document] #soup 對象自己比較特殊，它的 name 即爲 [document]
print(soup.head.name)
# head #對於其餘內部標籤，輸出的值便爲標籤自己的名稱
    
print(soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
# 在這裏，咱們把 p 標籤的全部屬性打印輸出了出來，獲得的類型是一個字典。

# 下面三種方法均可以獲取字典裏的值，是等價的，結果都同樣
print(soup.p.get('class'))
# ['title']
print(soup.p['class'])
# ['title']
print(soup.p.attrs['class'])
# ['title']

# 還能夠針對屬性或者內容進行修改
soup.p['class'] = "newClass"
print (soup.p)
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

NavigableString

獲取了Tag，也就是獲取了節點內容，可是隻想要獲取節點內部的內容怎麼辦？只需使用.string便可。

# 獲取節點內容
print(soup.p.string)
# The Dormouse's story

print(type(soup.p.string))
# <class 'bs4.element.NavigableString'>

遍歷文檔樹

在選取節點的時候，也能夠先選取一個節點，而後以這個節點爲基準選取它的子節點，父節點，子孫節點等等，下面就介紹經常使用的選取方法。

獲取直接子節點.contents .children屬性

.contents

tag的.contents屬性能夠將tag的直接子節點以列表的方式輸出。
下面例子選取head節點爲基準，.contents選取head的子節點title，而後以列表返回。

print(soup.head.contents)
# [<title>The Dormouse's story</title>]

輸出方式爲列表，能夠用列表索引來獲取它的某一個元素.

print(soup.head.contents[0])
# <title>The Dormouse's story</title>

.children

children屬性和contents屬性不一樣的是它返回的不是一個列表，而是一個生成器。可用for循環輸出結果。

print(soup.head.children)
# <list_iterator object at 0x0000017415655588>

for i in soup.head.children:
    print(i)
# <title>The Dormouse's story</title>

獲取全部子孫節點：.descendants屬性

上面兩個屬性都只能獲取到基準節點的下一個節點，要想獲取節點的全部子孫節點，就可使用descendants屬性了。它返回的也是一個生成器。

print(soup.descendants)
# <generator object descendants at 0x0000028FFB17C4C0>

還有其餘屬性如查找父節點，組父節點的屬性就不記錄了(平時不多用)。

搜索文檔樹

BeautifulSoup提供了一些查詢方法(find_all,find等)，調用對應方法，輸入查詢參數就能夠獲得咱們想要的內容了，能夠理解爲搜索引擎的功能。(百度/谷歌=查詢方法，查詢內容=查詢參數，返回的網頁=想要的內容)
下面介紹最經常使用的find_all方法。

find_all方法

做用：查找全部符合條件的元素，返回的是列表形式
API：find_all(name, attrs, recursive, text, **kwargs)
1. name
name 參數能夠根據節點名來查找元素。
A. 傳字符串
最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,BeautifulSoup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中全部的<p>標籤。

print(soup.find_all('p'))
# 一般如下面方式寫比較好
print(soup.find_all(name='p'))

B.傳正則表達式
若是傳入正則表達式做爲參數,Beautiful Soup會經過正則表達式的 match() 來匹配內容.下面例子中找出全部以p開頭的標籤。

import re
print(soup.find_all(re.compile('^p')))

C.傳列表
若是傳入列表參數,BeautifulSoup會將與列表中任一元素匹配的內容返回。下面代碼會找到HTML代碼中的head標籤和b標籤。

print(soup.find_all(['head','b']))
# [<head><title>The Dormouse's story</title></head>, <b>The Dormouse's story</b>]

2. attrs
find_all中attrs參數能夠根據節點屬性查詢。
查詢時傳入的參數是字典類型。好比查詢id=link1的節點

print(soup.find_all(attrs={'id':'link1'}))
# [<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

對於常見的屬性，能夠不用以attrs來傳遞，直接傳入查詢參數便可。好比id,class_(class爲Python關鍵字，使用下劃線區分),以下:

print(soup.find_all(id='link1'))
print(soup.find_all(class_='sister'))

運行結果：

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3. text
text 參數能夠搜搜文檔中的字符串內容，與 name 參數的可選值同樣, text 參數接受字符串 , 正則表達式 , 列表。下面代碼查找節點裏內容中有story字符串的節點，並返回節點的內容。

print(soup.find_all(text=re.compile('story')))
# ["The Dormouse's story", "The Dormouse's story"]

find方法

find方法與find_all方法的區別：
find_all：查詢符合全部條件的元素，返回列表。
find:只查找第一個匹配到的元素，返回單個元素，類型tag。
查詢方法與find_all大同小異。示例：

print(soup.find(name='p')) # 查詢第一個p標籤
print(soup.find(text=re.compile('story'))) # 查找第一個節點內容中有story字符串的節點內容

運行結果：

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
The Dormouse's story

關於BeautifulSoup的使用就這樣吧，經常使用我的就以爲用好find_all便可(=.=~)

參考連接

崔慶才 [Python3網絡爬蟲開發實戰]：4.2-使用Beautiful Soup

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。