Python爬蟲系列-BeautifulSoup詳解

時間 2019-11-21

標籤 python 爬蟲系列 beautifulsoup 詳解欄目 Python 简体版

原文原文鏈接

安裝

pip3 install beautifulsoup4css

解析庫

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup,'html,parser')	Python的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2前的版本中文容錯能力差
lxml HTML 解析庫	BeautifulSoup(markup,'lxml')	速度快、文檔容錯能力強	須要安裝C語言庫
lxml XML 解析庫	BeautifulSoup(markup,'xml')	速度快、惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup,'xml')	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

基本使用

html = """ 
 <html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">   <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
   </p>  <p class="story">   ...story go on...</p>
 """
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.prettify()

自動補全代碼：html

<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dormouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;    and they lived at the bottom of a well
  </p>
  <p class="story">
   ...story go on...
  </p>
 </body>
</html>

print(soup.title.string)
輸出html的標題：html5

The Dormouse's story瀏覽器

標籤選擇器

選擇元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

輸出結果以下：spa

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一個p標籤

獲取外層標籤的名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

titlecode

獲取內容的屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

兩種獲取屬性名稱的方法orm

dormouse
dormousexml

獲取內容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.b.string)

The Dormouse's storyhtm

嵌套選擇

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.head.title.string)

The Dormouse's storythree

字節點和子孫節點

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
 '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)

['Once upon a time there were three little sisters;and their names were\n   ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ';    and they lived at the bottom of a well\n  ']

children是一個迭代器：

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.children)
 for i,child in enumerate(soup.p.children):
      print(i,child)

<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> </a>
2 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well

html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/>  <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n   <a class="sister" href="http://example.com/elsie" id="link1">   <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well\n  </p>  <p class="story">   ...story go on...</p>
...  '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.p.descendants)
 for i,child in enumerate(soup.p.descendants):
     print(i,child)

孫節點也被輸出出來：

<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3 <span>Elsie </span>
4 Elsie
5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well

父節點和祖先節點

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.a.parent)

顯示結果：

<p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))

顯示結果：

[(0, 'Once upon a time there were three little sisters;and their names were\n   '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, ';    and they lived at the bottom of a well\n  ')]

print(list(enumerate(soup.a.parents)))
顯示全部結果：最後爲源代碼跟節點

[(0, <p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;    and they lived at the bottom of a well
  </p> <p class="story">   ...story go on...</p>
</body></html>)]

兄弟節點

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(list(enumerate(soup.a.next_siblings)))

顯示以下：html [(0, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (1, 'and'), (2, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (3, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.previous_siblings)))
[(0, 'Once upon a time there were three little sisters;and their names were\n ')]

標準選擇器

find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔

name

html = """
 <div class="panel">
   <div class="panel-heading">
     <h4>Helllo</h4>
   </div>
   <div class="panel-body">
     <ul class="list" id="list-1">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
       <li class="element">Jay</li>
     </ul>
     <ul class="list list-small" id="list-2">
       <li class="element">Foo</li>
       <li class="element">Bar</li>
     </ul>
   </div>
 </div>
"""
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all('ul'))
 print(type(soup.find_all('ul')[0]))

顯示結果以下：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
     print(ul.find_all('li'))

顯示結果以下

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs

html = '''
 <div class="panel">\n  <div class="panel-heading">\n    <h4>Helllo</h4>\n  </div>\n  <div class="panel-body">\n    <ul class="list" id="list-1" name=elements>\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n      <li class="element">Jay</li>\n    </ul>\n    <ul class="list list-small" id="list-2">\n      <li class="element">Foo</li>\n      <li class="element">Bar</li>\n    </ul>\n  </div>\n</div>
 '''
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(attrs={'id':'list-1'}))
 print(soup.find_all(attrs={'name':'elements'}))

顯示以下：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

另外知道ID或Class能夠用下列方法查找：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

print(soup.find_all(class_='element'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

text

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find_all(text='Foo'))

['Foo', 'Foo']

find(name,attrs,recursive,text,**kwargs)
find返回單個元素，find_all返回全部元素

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.find('ul'))

<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

print(type(soup.find('ul')))

<class 'bs4.element.Tag'>

print(type(soup.find('page')))不存在返回結果：

<class 'NoneType'>

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 print(soup.select('.panel .panel-heading'))
 print(soup.select('ul li'))
 print(soup.select('#list-2 .element'))
 print(soup.select('ul')[0])

顯示結果以下：
[html <div class="panel-heading"> <h4>Helllo</h4> </div>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

遍歷的用法：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
     print(ul.select('li'))

顯示結果以下：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

獲取屬性

from bs4 import  BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for ul in soup.select('ul'):
     print(ul['id'])
     print(ul.attrs['id'])

顯示效果以下:
list-1
list-1
list-2
list-2

獲取內容

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html,'lxml')
 for li in soup.select('li'):
     print(li.get_text())

顯示結果：
Foo
Bar
Jay
Foo
Bar

總結：

推薦使用lxml解析庫，必要時使用html.parser
標籤選擇篩選功能弱可是速度快
建議使用find()、find_all()查詢匹配單個結果或多個結果
若是對CSS選擇器書系建議使用select()
記住經常使用的獲取屬性和文本值的方法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

Python爬蟲系列-BeautifulSoup詳解

安裝

解析庫

基本使用

標籤選擇器

選擇元素

獲取外層標籤的名稱

獲取內容的屬性

獲取內容

嵌套選擇

字節點和子孫節點

父節點和祖先節點

兄弟節點

標準選擇器

find_all(name,attrs,recursive,text,**kwargs) 可根據標籤名、屬性、內容查找文檔

name

attrs

text

find(name,attrs,recursive,text,**kwargs) find返回單個元素，find_all返回全部元素

CSS選擇器

獲取屬性

獲取內容

總結：

find_all(name,attrs,recursive,text,**kwargs)
可根據標籤名、屬性、內容查找文檔

find(name,attrs,recursive,text,**kwargs)
find返回單個元素，find_all返回全部元素