Python爬蟲系列-BeautifulSoup詳解

安裝

pip3 install beautifulsoup4css

解析庫

解析器 使用方法 優點 劣勢
Python標準庫 BeautifulSoup(markup,'html,parser') Python的內置標準庫、執行速度適中、文檔容錯能力強 Python 2.7.3 or 3.2.2前的版本中文容錯能力差
lxml HTML 解析庫 BeautifulSoup(markup,'lxml') 速度快、文檔容錯能力強 須要安裝C語言庫
lxml XML 解析庫 BeautifulSoup(markup,'xml') 速度快、惟一支持XML的解析器 須要安裝C語言庫
html5lib BeautifulSoup(markup,'xml') 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴展

基本使用

html = """ 
 <html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">   <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
   </p>  <p class="story">   ...story go on...</p>
 """
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.prettify()

自動補全代碼:html

<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dormouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;    and they lived at the bottom of a well
  </p>
  <p class="story">
   ...story go on...
  </p>
 </body>
</html>

print(soup.title.string)
輸出html的標題:html5

The Dormouse's story瀏覽器

標籤選擇器

選擇元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

輸出結果以下:spa

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一個p標籤

獲取外層標籤的名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

titlecode

獲取內容的屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

兩種獲取屬性名稱的方法orm

dormouse
dormousexml

獲取內容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.b.string)

The Dormouse's storyhtm

嵌套選擇

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.head.title.string)

The Dormouse's storythree

字節點和子孫節點

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
 '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';    and they lived at the bottom of a well\n  ']

children是一個迭代器:

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.children)
 for i,child in enumerate(soup.p.children):
      print(i,child)

<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
...  '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.descendants)
 for i,child in enumerate(soup.p.descendants):
     print(i,child)

孫節點也被輸出出來:

<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3 <span>Elsie </span>
4 Elsie
5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well

父節點和祖先節點

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.a.parent)

顯示結果:

<p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))

顯示結果:

[(0, 'Once upon a time there were three little sisters;and their names were\n   '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, ';    and they lived at the bottom of a well\n  ')]

print(list(enumerate(soup.a.parents)))
顯示全部結果:最後爲源代碼跟節點

[(0, <p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>)]

兄弟節點

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(list(enumerate(soup.a.next_siblings)))

顯示以下:html [(0, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (1, 'and'), (2, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (3, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.previous_siblings)))
[(0, 'Once upon a time there were three little sisters;and their names were\n ')]

標準選擇器

find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔

name

html = """
 <div class="panel">
   <div class="panel-heading">
     <h4>Helllo</h4>
   </div>
   <div class="panel-body">
     <ul class="list" id="list-1">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
       <li class="element">Jay</li>
     </ul>
     <ul class="list list-small" id="list-2">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
     </ul>
   </div>
 </div>
"""
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all('ul'))
 print(type(soup.find_all('ul')[0]))

顯示結果以下:

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

<class 'bs4.element.Tag'>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
     print(ul.find_all('li'))

顯示結果以下

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs

html = '''
 <div class="panel">\n  <div class="panel-heading">\n    <h4>Helllo</h4>\n  </div>\n  <div class="panel-body">\n    <ul class="list" id="list-1" name=elements>\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n      <li class="element">Jay</li>\n    </ul>\n    <ul class="list list-small" id="list-2">\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n    </ul>\n  </div>\n</div>
 '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(attrs={'id':'list-1'}))
 print(soup.find_all(attrs={'name':'elements'}))

顯示以下:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

另外知道ID或Class能夠用下列方法查找:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

print(soup.find_all(class_='element'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(text='Foo'))

['Foo', 'Foo']

find(name,attrs,recursive,text,**kwargs)
find返回單個元素,find_all返回全部元素

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find('ul'))
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

print(type(soup.find('ul')))

<class 'bs4.element.Tag'>

print(type(soup.find('page')))不存在返回結果:

<class 'NoneType'>

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.select('.panel .panel-heading'))
 print(soup.select('ul li'))
 print(soup.select('#list-2 .element'))
 print(soup.select('ul')[0])

顯示結果以下:
[html <div class="panel-heading"> <h4>Helllo</h4> </div>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

遍歷的用法:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
     print(ul.select('li'))

顯示結果以下:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

獲取屬性

from bs4 import  BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for ul in soup.select('ul'):
     print(ul['id'])
     print(ul.attrs['id'])

顯示效果以下:
list-1
list-1
list-2
list-2

獲取內容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for li in soup.select('li'):
     print(li.get_text())

顯示結果:
Foo
Bar
Jay
Foo
Bar

總結:

  • 推薦使用lxml解析庫,必要時使用html.parser
  • 標籤選擇篩選功能弱可是速度快
  • 建議使用find()、find_all()查詢匹配單個結果或多個結果
  • 若是對CSS選擇器書系建議使用select()
  • 記住經常使用的獲取屬性和文本值的方法
相關文章
相關標籤/搜索