beautifulsoup的使用

解析庫

解析器 使用方法 優點 劣勢
Python標準庫 BeautifulSoup(markup, "html.parser") Python的內置標準庫、執行速度適中 、文檔容錯能力強 Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文檔容錯能力強 須要安裝C語言庫
lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、惟一支持XML的解析器 須要安裝C語言庫
html5lib BeautifulSoup(markup, "html5lib") 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴展

基本使用

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

標籤選擇器

選擇元素css

print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

獲取名稱html

print(soup.title.name)

titlehtml5

獲取屬性瀏覽器

print(soup.p.attrs['name'])
print(soup.p['name'])

dromouse
dromouseui

獲取內容spa

print(soup.p.string)

The Dormouse's storycode

嵌套選擇orm

print(soup.head.title.string)

The Dormouse's storyxml

子節點和子孫節點htm

print(soup.p.contents)

['\n Once upon a time there were three little sisters; and their names were\n ',
Elsie
, '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']

print(soup.p.children)
for i,child in enumerate(soup.p.children):
    print(i,child)

父節點和祖先節點

print(soup.a.parent)

Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.

print(list(enumerate(soup.a.parents)))

兄弟節點

print(list(enumerate(soup.a.next_siblings))
print(list(enumerate(soup.a.previous_siblings)))

[(0, '\n'), (1, Lacie), (2, ' \n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')]
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

標準選擇器

find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔

name

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs

print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

text

print(soup.find_all(text='Foo'))

find( name , attrs , recursive , text , **kwargs )
find返回單個元素,find_all返回全部元素

print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

find_parents() find_parent()
find_parents()返回全部祖先節點,find_parent()返回直接父節點。

find_next_siblings() find_next_sibling()
find_next_siblings()返回後面全部兄弟節點,find_next_sibling()返回後面第一個兄弟節點。

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面全部兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。

find_all_next() find_next()
find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點

find_all_previous() 和 find_previous()
find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2

獲取內容

for li in soup.select('li'):
    print(li.get_text())

Foo
Bar
Jay
Foo
Bar

總結

  • 推薦使用lxml解析庫,必要時使用html.parser
  • 標籤選擇篩選功能弱可是速度快
  • 建議使用find()、find_all()查詢匹配單個結果或者多個結果
  • 若是對CSS選擇器熟悉建議使用select()
  • 記住使用的獲取屬性和文本值得方法

參考來源:https://cuiqingcai.com/5548.html

相關文章
相關標籤/搜索