python爬蟲（六）BeautifulSoup庫

時間 2021-08-13

標籤 html python markdown ide spa 命令行 3d code orm 欄目 Python 简体版

原文原文鏈接

概念

安裝：

安裝：命令行輸入pip install beautifulsoup4html

BeautifulSoup支持的解析器

基本用法

from bs4 import BeautifulSoup
html=''' <html><head><title>The Dormousae's story</title></head> <body> <p class="title" name="drimouse"><b>The Dormousae's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/title" class="sister" id="link3">Tillie</a>; and they lived at the boottom of a well.</p> <p class="story">...</p> '''
soup=BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

對於html咱們能夠看到到，並非一個完整的HTML字符串，經過soup=BeautifulSoup(html,‘lxml’)，對BeautifulSoup對象初始化，soup.prettify()方法能夠把藥解析的字符串以標準的縮進格式輸出，
soup.title.string打印除title節點的內容。python

標籤選擇器

選擇元素：

# html與上述的一致
soup=BeautifulSoup(html,'lxml')
print(soup.title)# 打印title標籤以及其中的內容
print(type(soup.title))#<class 'bs4.element.Tag'>
print(soup.head)# 打印head標籤以及其中的內容
print(soup.p)# 只會打印第一個p節點以及其中的內容

獲取名稱

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.name)
#打印出節點的名稱title

獲取屬性

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.attrs)#{'class': ['title'], 'name': 'drimouse'}
print(soup.p.attrs['name'])#drimouse
print(soup.p['name'])#drimouse

獲取內容

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.title.string)

嵌套選擇：

print(soup.title.string)#
print(soup.head.title.string)
print(soup.head.title)
print(type(soup.head.title))
print(type(soup.head.title.string))
# 打印結果依次爲：
The Dormousae's story
The Dormousae's story
<title>The Dormousae's story</title>
<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>

關聯選擇：

在作選擇的時候，有時候不能作到一步就選到想要的節點元素，須要選中某一個節點元素，而後以它爲基準再去選擇它的子節點，父節點，兄弟節點等
（1）子節點和子孫節點：markdown

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.contents)#獲取子節點
# [<b>The Dormousae's story</b>]

方法2：ide

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.children)# 迭代器類型
for i,child in enumerate(soup.p.children):
	print(i,child)

打印的結果爲：
<list_iterator object at 0x000001BABACB9EF0>
0 The Dormousae’s storyui

子孫節點：spa

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(soup.p.descendants)#獲取子孫節點
for i,child in enumerate(soup.descendants):
	print(i,child)

（2）獲取父節點和祖先節點命令行

soup=BeautifulSoup(html,'lxml')
print(soup.a.parent)#獲取父節點
print(soup.a.parents)#返回迭代器
print(list(enumerate(soup.a.parents)))#獲取祖先節點

（3）兄弟節點：3d

from bs4 import BeautifulSoup
soup= BeautifulSoup(html.'lxml')
print(list(enumerate(soup.a.next_siblings)))#獲取後面的兄弟節點
print(list(enumerate(soup.a.previous_siblings)))#獲取前面的兄弟節點

打印結果：
[(0, ‘,\n’), (1, Lacie), (2, ’ and\n’), (3, Tillie), (4, ‘;\nand they lived at the boottom of a well.’)]code

[(0, ‘Once upon a time there were three little sisters;and their names were\n’)]orm

方法選擇器：

前面所說的都是經過屬性來選擇的，這種方法比較快，可是若是遇到比較複雜的選擇的話，就比較麻煩，不靈活，BeautifulSoup庫還提供了find_all(),以及find()方法

find_all(name,attrs,recursive,text,**kwargs)

可根據標籤名，屬性，內容查找文檔

html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> '''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
	print(ul.find_all('li'))

打印結果

attrs屬性：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

等價於

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))# 不能直接使用class,在python中class時關鍵字

text文本

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))

find方法

find（name,attrs,recursive,text,**kwargs）
find返回單個元素，find_all返回全部元素

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

CSS選擇器

經過select直接傳入CSS選擇器便可完成選擇
(1)獲取屬性

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
	print(ul['id'])
	print(ul.attrs['id'])

(2)獲取內容

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
for li in soup.select('li'):
	print(li.get_text())

總結：

總結：推薦使用lxml解析庫，必要時使用html.parser 標籤選擇篩選功能弱可是速度快建議使用find(),find_all()查詢匹配單個結果或者多個結果若是對CSS選擇器熟悉建議使用select() 記住經常使用的獲取屬性值和文本的方法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。