pip3 install beautifulsoup4css
解析器 | 使用方法 | 優點 | 劣勢 |
---|---|---|---|
Python標準庫 | BeautifulSoup(markup,'html,parser') | Python的內置標準庫、執行速度適中、文檔容錯能力強 | Python 2.7.3 or 3.2.2前的版本中文容錯能力差 |
lxml HTML 解析庫 | BeautifulSoup(markup,'lxml') | 速度快、文檔容錯能力強 | 須要安裝C語言庫 |
lxml XML 解析庫 | BeautifulSoup(markup,'xml') | 速度快、惟一支持XML的解析器 | 須要安裝C語言庫 |
html5lib | BeautifulSoup(markup,'xml') | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
html = """ <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p> <p class="story"> ...story go on...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()
自動補全代碼:html
<html dir="ltr" lang="en"> <head> <meta charset="utf-8"/> <title> The Dormouse's story </title> </head> <body> <p class="title" name="dormouse"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --> </a> <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well </p> <p class="story"> ...story go on... </p> </body> </html>
print(soup.title.string)
輸出html的標題:html5
The Dormouse's story瀏覽器
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title) print(type(soup.title)) print(soup.head) print(soup.p)
輸出結果以下:spa
<title>The Dormouse's story</title> <class 'bs4.element.Tag'> <head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head> <p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一個p標籤
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.title.name)
titlecode
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.attrs['name']) print(soup.p['name'])
兩種獲取屬性名稱的方法orm
dormouse
dormousexml
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.b.string)
The Dormouse's storyhtm
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.head.title.string)
The Dormouse's storythree
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']
children是一個迭代器:
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.children) for i,child in enumerate(soup.p.children): print(i,child)
<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p> ... ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.p.descendants) for i,child in enumerate(soup.p.descendants): print(i,child)
孫節點也被輸出出來:
<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3<span>Elsie </span>
4 Elsie
5<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.a.parent)
顯示結果:
<p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p>
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(list(enumerate(soup.a.parent)))
顯示結果:
[(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.parents)))
顯示全部結果:最後爲源代碼跟節點
[(0, <p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p> <p class="story"> ...story go on...</p> </body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p> <p class="story"> ...story go on...</p> </body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well </p> <p class="story"> ...story go on...</p> </body></html>)]
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(list(enumerate(soup.a.next_siblings)))
顯示以下:
html [(0, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (1, 'and'), (2, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (3, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.previous_siblings)))
[(0, 'Once upon a time there were three little sisters;and their names were\n ')]
html = """ <div class="panel"> <div class="panel-heading"> <h4>Helllo</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all('ul')) print(type(soup.find_all('ul')[0]))
顯示結果以下:
[<ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>, <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul>]<class 'bs4.element.Tag'>
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.find_all('ul'): print(ul.find_all('li'))
顯示結果以下
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
html = ''' <div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(attrs={'id':'list-1'})) print(soup.find_all(attrs={'name':'elements'}))
顯示以下:
[<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
[<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
另外知道ID或Class能夠用下列方法查找:
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>]
print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(text='Foo'))
['Foo', 'Foo']
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find('ul'))
<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
print(type(soup.find('page')))
不存在返回結果:
<class 'NoneType'>
經過select()直接傳入CSS選擇器便可完成選擇
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(soup.select('ul')[0])
顯示結果以下:
[html <div class="panel-heading"> <h4>Helllo</h4> </div>
]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul>
遍歷的用法:
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul.select('li'))
顯示結果以下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>]
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for ul in soup.select('ul'): print(ul['id']) print(ul.attrs['id'])
顯示效果以下:
list-1
list-1
list-2
list-2
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') for li in soup.select('li'): print(li.get_text())
顯示結果:
Foo
Bar
Jay
Foo
Bar