解析器 | 使用方法 | 優點 | 劣勢 |
---|---|---|---|
Python標準庫 | BeautifulSoup(markup, "html.parser") | Python的內置標準庫、執行速度適中 、文檔容錯能力強 | Python 2.7.3 or 3.2.2)前的版本中文容錯能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") | 速度快、文檔容錯能力強 | 須要安裝C語言庫 |
lxml XML 解析器 | BeautifulSoup(markup, "xml") | 速度快、惟一支持XML的解析器 | 須要安裝C語言庫 |
html5lib | BeautifulSoup(markup, "html5lib") | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
選擇元素css
print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
獲取名稱html
print(soup.title.name)
titlehtml5
獲取屬性瀏覽器
print(soup.p.attrs['name']) print(soup.p['name'])
dromouse
dromouseui
獲取內容spa
print(soup.p.string)
The Dormouse's storycode
嵌套選擇orm
print(soup.head.title.string)
The Dormouse's storyxml
子節點和子孫節點htm
print(soup.p.contents)
['\n Once upon a time there were three little sisters; and their names were\n ',
Elsie
, '\n', Lacie, ' \n and\n ', Tillie, '\n and they lived at the bottom of a well.\n ']
print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child)
父節點和祖先節點
print(soup.a.parent)
Once upon a time there were three little sisters; and their names were Elsie Lacie and Tillie and they lived at the bottom of a well.
print(list(enumerate(soup.a.parents)))
兄弟節點
print(list(enumerate(soup.a.next_siblings)) print(list(enumerate(soup.a.previous_siblings)))
[(0, '\n'), (1, Lacie), (2, ' \n and\n '), (3, Tillie), (4, '\n and they lived at the bottom of a well.\n ')]
[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]
name
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs
print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'}))
print(soup.find_all(id='list-1')) print(soup.find_all(class_='element'))
text
print(soup.find_all(text='Foo'))
print(soup.find('ul')) print(type(soup.find('ul'))) print(soup.find('page'))
find_parents() find_parent()
find_parents()返回全部祖先節點,find_parent()返回直接父節點。
find_next_siblings() find_next_sibling()
find_next_siblings()返回後面全部兄弟節點,find_next_sibling()返回後面第一個兄弟節點。
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面全部兄弟節點,find_previous_sibling()返回前面第一個兄弟節點。
find_all_next() find_next()
find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點
find_all_previous() 和 find_previous()
find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點
經過select()直接傳入CSS選擇器便可完成選擇
print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0])) [<div class="panel-heading"> <h4>Hello</h4> </div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>
for ul in soup.select('ul'): print(ul.select('li')) [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
list-1
list-1
list-2
list-2
獲取內容
for li in soup.select('li'): print(li.get_text())
Foo
Bar
Jay
Foo
Bar