beautifulsoup的使用

時間 2019-11-12

標籤 beautifulsoup 使用简体版

原文原文鏈接

解析庫

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

基本使用

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

標籤選擇器

選擇元素css

print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

獲取名稱html

print(soup.title.name)

titlehtml5

獲取屬性瀏覽器

print(soup.p.attrs['name'])
print(soup.p['name'])

dromouse
dromouseui

獲取內容spa

print(soup.p.string)

The Dormouse's storycode

嵌套選擇orm

print(soup.head.title.string)

The Dormouse's storyxml

子節點和子孫節點htm

print(soup.p.contents)

['\n Once upon a time there were three little sisters; and their names were\n ',
Elsie
, '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']

print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

父節點和祖先節點

print(soup.a.parent)

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

print(list(enumerate(soup.a.parents)))

兄弟節點

print(list(enumerate(soup.a.next_siblings))
print(list(enumerate(soup.a.previous_siblings)))

[(0, '\n'), (1, Lacie), (2, ' \n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')]
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

標準選擇器

find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔

name

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs

print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

text

print(soup.find_all(text='Foo'))

find( name , attrs , recursive , text , **kwargs )
find返回單個元素，find_all返回全部元素

print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

find_parents() find_parent()
find_parents()返回全部祖先節點，find_parent()返回直接父節點。

find_next_siblings() find_next_sibling()
find_next_siblings()返回後面全部兄弟節點，find_next_sibling()返回後面第一個兄弟節點。

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面全部兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。

find_all_next() find_next()
find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點

find_all_previous() 和 find_previous()
find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2

獲取內容

for li in soup.select('li'):
    print(li.get_text())

Foo
Bar
Jay
Foo
Bar

總結

推薦使用lxml解析庫，必要時使用html.parser
標籤選擇篩選功能弱可是速度快
建議使用find()、find_all()查詢匹配單個結果或者多個結果
若是對CSS選擇器熟悉建議使用select()
記住使用的獲取屬性和文本值得方法

參考來源：https://cuiqingcai.com/5548.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。