[爬蟲] BeautifulSoup庫

時間 2019-11-17

原文原文鏈接

Beautiful Soup庫基礎知識

Beautiful Soup庫是解析xml和html的功能庫。html、xml大都是一對一對的標籤構成，因此Beautiful Soup庫是解析、遍歷、維護「標籤樹」的功能庫，只要提供的是標籤類型Beautiful Soup庫均可以進行很好的解析。html

Beauti Soup庫的導入python

from bs4 import BeautifulSoup函數

import bs4spa

html文檔 == 標籤樹 == BeautifulSoup類能夠認爲三者是等價的3d

>>>from bs4 import BeautifulSoup  
>>>soup = BeautifulSoup('<html>data</html>','html.parser')  
>>>soup1=BeautifulSoup(open(r'D:\demo.html'),'html.parser')

簡單來講一個BeautifulSoup類對應一個html文檔的所有內容。如上面的soup、soup1都對應一個html文檔。xml

Beautiful Soup類的基本元素

Tag 標籤最基本的信息組織單元，分別用<></>開頭結尾，經過soup.<標籤>得到
Name 標籤名字，經過<標籤>.name獲取到
Attribute 標籤的屬性，字典形式，經過<標籤>.attrs獲取
Navigablestring 標籤對之間的字符串，經過<標籤>.string獲取
Comment 標籤內字符串的註釋部分，一種特殊的Comment類型

soup.標籤名 可得到該標籤信息，當存在多個同樣的標籤時，默認返回第一個標籤的信息。htm

>>> soup=BeautifulSoup(demo,'html.parser')  
>>> soup.title  
<title>四大美女</title>  
>>> soup.a  
<a href="http://www.baidu.com" target="_blank"><img alt="" heigth="200" src="picture/1.png" title="貂蟬" width="150"/></a>

soup.標籤.name 可獲取標籤的名字，以字符串形式返回blog

>>> soup.a.name        #獲取a標籤的名字  
'a'  
>>> soup.a.parent.name #獲取a標籤的上級標籤名字  
'p'  
>>> soup.a.parent.parent.name #獲取a標籤的上上級標籤的名字  
'hr'

soup.標籤.attrs 可得到標籤的屬性，以字典形式返回utf-8

>>> soup.a.attrs      #獲取a標籤的屬性，字典返回  
{'href': 'http://www.baidu.com', 'target': '_blank'}  
>>> tag = soup.a.parent  
>>> tag.name         #獲取p標籤的屬性，字典返回  
'p'  
>>> tag.attrs  
{'align': 'center'}

由於返回的是字典能夠採用字典的方法對其進行信息的提取。文檔

>>> for i,j in soup.a.attrs.items(): #for循環遍歷字典  
    print(i,j)  
href http://www.baidu.com  
target _blank  
>>> soup.a.attrs['href']             #獲取某個key的value值  
'http://www.baidu.com'  
>>> soup.a.attrs.keys()              #獲取字典全部的keys  
dict_keys(['href', 'target'])   
>>> soup.a.attrs.values()            #獲取字典全部values  
dict_values(['http://www.baidu.com', '_blank'])

soup.標籤.string 能夠獲取標籤之間的文本，返回字符串

>>> soup.title.string   #獲取title表之間的文本  
'四大美女'  
>>> soup.a.string       #獲取a標籤之間的文本，沒有返回空  
>>> soup.a  
<a href="http://www.baidu.com" target="_blank"><img alt="" heigth="200" src="picture/1.png" title="貂蟬" width="150"/></a>

基於bs4庫的HTML的遍歷方法

HTML基本格式

標籤樹的下行遍歷

屬性及說明

.contents 子節點的列表，將<tag>標籤全部兒子節點存入列表
.children 子節點的迭代類型，與.content相似，用於循環遍歷兒子節點
.descendants 子孫節點的迭代類型，包含全部子孫節點，用於循環遍歷

>>> soup.head  
<head>  
<meta charset="utf-8">  
<title>四大美女</title>  
</meta></head>  
>>> soup.head.contents                #獲取head標籤下的兒子節點  
['\n', <meta charset="utf-8">  
<title>四大美女</title>  
</meta>]  
>>> len(soup.body.contents)          #經過len函數獲取body標籤的兒子節點個數  
3  
>>> for i in soup.body.children:     #遍歷body標籤的兒子節點  
    print(i)  
>>> for i in soup.body.descendants:  #遍歷body標籤全部的兒子、子孫節點  
    print(i)

標籤樹的上行遍歷

.parent 當前節點的父親標籤
.parents 節點先輩標籤的迭代類型，用於循環遍歷先輩節點。

標籤樹的平行遍歷

.next_sibling 返回按照html文本順序的下一個平行節點標籤
.previous_sibling 返回按照html文本順序的上一個平行節點標籤
.next_siblings 迭代類型，返回按照html文本順序的後續全部平行節點標籤
.previous_siblings 迭代類型，返回按照html文本順序的前續全部平行節點標籤

總結以下圖：

bs4庫的prettify()方法--讓HTML內容更加「友好」的顯示

信息提取

Beautiful Soup庫提供了<>.find_all()函數，返回一個列表類型，存儲查找的結果。

詳細介紹以下：

<>.find_all(name, attrs, recursive, string,**kwargs)

<tag>(...) == <tag>.find_all(...)

soup(...) == soup.find_all(...)

name : 對標籤名稱的檢索字符串
attrs: 對標籤屬性值的檢索字符串，可標註屬性檢索
recursive : 是否對子孫所有檢索，默認 True
string : <>...</>中字符串區域的檢索字符串

>>> soup.find_all('p',target='_blank')     #查找p標籤，並且屬性值是_blank的  
[]   
>>> soup.find_all(id='su')          #查找id屬性值是su的標籤  
[<input class="bg s_btn" id="su" type="submit" value="百度一下"/>]

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。