【Python爬蟲】BeautifulSoup 解析庫

時間 2019-11-30

標籤 Python爬蟲 beautifulsoup 解析欄目 Python 简体版

原文原文鏈接

BeautifulSoup解析 HTML或XML

閱讀目錄

初識Beautiful Soup

官方文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#html

中文文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.htmlhtml5

Beautiful Soup 是一個能夠從HTML或XML文本中提取數據的Python庫，它能對HTML、XML格式進行解析成樹形結構並提取相關信息。python

Beautiful Soup庫是一個靈活又方便的網頁解析庫，處理高效，支持多種解析庫（後面會介紹），利用它不用編寫正則表達式便可方便地實現網頁信息的提取。正則表達式

安裝json

Beautiful Soup 3 目前已經中止開發，推薦在如今的項目中使用Beautiful Soup 4，安裝方法：api

pip install beautifulsoup4

Beautiful Soup庫的4種解析器

解析器	使用方法	優點	劣勢
Python標準庫	BeautifulSoup(markup, "html.parser")	Python的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文檔容錯能力強	須要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup, "html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

若是僅是想要解析HTML文檔，只要用文檔建立 BeautifulSoup 對象就能夠了。Beautiful Soup會自動選擇一個解析器來解析文檔.可是還能夠經過參數指定使用那種解析器來解析當前文檔。BeautifulSoup 第一個參數應該是要被解析的文檔字符串或是文件句柄,第二個參數用來標識怎樣解析文檔.若是第二個參數爲空,那麼Beautiful Soup根據當前系統安裝的庫自動選擇解析器,解析器的優先數序: lxml, html5lib, Python標準庫(python自帶的解析庫).瀏覽器

安裝解析器庫：服務器

pip install html5lib
pip  install lxml

Beautiful Soup類的基本元素

基本使用

容錯處理，文檔的容錯能力指的是在html代碼不完整的狀況下,使用該模塊能夠識別該錯誤。網絡

使用BeautifulSoup解析上述代碼,可以獲得一個 BeautifulSoup 的對象,並能按照 標準的縮進格式結構輸出數據結構

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())  #處理好縮進，結構化顯示
print(soup.title.string)

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

輸出結果

標籤選擇器

選擇標籤元素（存在多個時取第一個）

獲取標籤名稱 + 獲取標籤 + 獲取標籤內容 + 獲取標籤屬性

from bs4 import BeautifulSoup
import requests

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The is pppp</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.title)      #獲取改標籤 <title>The Dormouse's story</title>
print(soup.title.name)      #獲取標籤名

print(soup.title.text)      #獲取標籤內容
print(soup.p.text)
print(soup.p.string)

dic = soup.p.attrs     #獲取 p標籤全部屬性返回一個字典結構
print(dic)     #獲取 p標籤全部屬性返回一個字典結構
print(dic["name"])
print(soup.p.attrs["class"])    #獲取指定屬性值，返回列表
print(soup.p["class"])

打印輸出：

<title>The Dormouse's story</title>
title
The Dormouse's story
The is pppp
The is pppp
{'class': ['title'], 'name': 'dromouse'}
dromouse
['title']
['title']

View Code

標籤嵌套選擇

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<div class="title" name="dromouse"><b class='bb bcls xiong'>The Dormouse's story</b></div>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.div.b['class'])      #標籤嵌套選擇

print(soup.p.stripped_strings)      #<generator object stripped_strings at 0x000002C7CC772830>
print(list(soup.p.stripped_strings))
print(soup.p.text)

打印輸出：

['bb', 'bcls', 'xiong']
<generator object stripped_strings at 0x000002471D323830>
['Once upon a time there were three little sisters; and their names were', ',', 'Lacie', 'and', 'Tillie', ';\nand they lived at the bottom of a well.']
Once upon a time there were three little sisters; and their names were
,
Lacie and
Tillie;
and they lived at the bottom of a well.

節點操做

子節點和子孫節點

對於一個標籤的兒子節點不只包括標籤節點，也包括字符串節點，空格表示爲'\n'

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.p.contents)      #子節點列表，將<p>全部子節點存在列表中

print("======================================================================>")
print(soup.p.children)      #子節點的可迭代類型，<list_iterator object at 0x0000029154DF7FD0>
for i, child in enumerate(soup.p.children):
    print(i, str(child).strip())        #child 是bs4.element 對象

print("======================================================================>")
print(soup.p.descendants)       #子孫節點的迭代類型，<generator object descendants at 0x000001C7583D2888>
for i, child in enumerate(soup.p.descendants):
    print(i, child)

打印輸出：

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
======================================================================>
<list_iterator object at 0x000001C2E2AB6EF0>
0 Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 and they lived at the bottom of a well.
======================================================================>
<generator object descendants at 0x000001C2E2AA3830>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

View Code

父節點和祖先節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.a.parent)

print("========================================================================>")
print(soup.a.parents)   #祖先節點，返回可迭代類型
for item in soup.a.parents:
    print(item)

打印輸出：

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
========================================================================>
<generator object parents at 0x000001A078752830>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>

View Code

兄弟節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(list(enumerate(soup.a.next_sibling)))     #下一個兄弟節點
print(list(enumerate(soup.a.next_siblings)))       #下面全部的兄弟節點
print(list(enumerate(soup.a.previous_sibling)))     #上一個兄弟節點
print(list(enumerate(soup.a.previous_siblings)))    #上面全部的兄弟節點

打印輸出：

[(0, '\n')]
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
[(0, '\n'), (1, ' '), (2, ' '), (3, ' '), (4, ' '), (5, ' '), (6, ' '), (7, ' '), (8, ' '), (9, ' '), (10, ' '), (11, ' '), (12, ' '), (13, 'O'), (14, 'n'), (15, 'c'), (16, 'e'), (17, ' '), (18, 'u'), (19, 'p'), (20, 'o'), (21, 'n'), (22, ' '), (23, 'a'), (24, ' '), (25, 't'), (26, 'i'), (27, 'm'), (28, 'e'), (29, ' '), (30, 't'), (31, 'h'), (32, 'e'), (33, 'r'), (34, 'e'), (35, ' '), (36, 'w'), (37, 'e'), (38, 'r'), (39, 'e'), (40, ' '), (41, 't'), (42, 'h'), (43, 'r'), (44, 'e'), (45, 'e'), (46, ' '), (47, 'l'), (48, 'i'), (49, 't'), (50, 't'), (51, 'l'), (52, 'e'), (53, ' '), (54, 's'), (55, 'i'), (56, 's'), (57, 't'), (58, 'e'), (59, 'r'), (60, 's'), (61, ';'), (62, ' '), (63, 'a'), (64, 'n'), (65, 'd'), (66, ' '), (67, 't'), (68, 'h'), (69, 'e'), (70, 'i'), (71, 'r'), (72, ' '), (73, 'n'), (74, 'a'), (75, 'm'), (76, 'e'), (77, 's'), (78, ' '), (79, 'w'), (80, 'e'), (81, 'r'), (82, 'e'), (83, '\n'), (84, ' '), (85, ' '), (86, ' '), (87, ' '), (88, ' '), (89, ' '), (90, ' '), (91, ' '), (92, ' '), (93, ' '), (94, ' '), (95, ' ')]
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

View Code

標準選擇器

基於bs4庫的HTML內容查找方法

<>.find_all(name,attrs,recursive,text,**kwargs)  # 返回一個列表類型,存儲查找的結果
name 對標籤名稱的檢索字符串 attrs 對標籤屬性值的檢索字符串,可標註屬性檢索 recursive 是否對子孫所有搜索,默認True text 對文本內容進行檢索

其餘的 find 方法:

find_all( name , attrs , recursive , text , **kwargs )

可根據標籤名、屬性、內容查找文檔

name

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

屬性attrs

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list2 list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(attrs={'id': 'list-1'}))        #推薦這種寫法
print(soup.find_all(id="list-1"))         #相似於**kwargs傳值，與上一種寫法效果相同

print(soup.find_all(attrs={'class': 'list-small'}))
print(soup.find_all(class_="list2"))

打印輸出：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list2 list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
[<ul class="list2 list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

View Code

text

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

['Foo', 'Foo']

find( name , attrs , recursive , text , **kwargs )

find返回單個元素，find_all返回全部元素

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
None

find_parents() find_parent()

find_parents()返回全部祖先節點，find_parent()返回直接父節點。

find_next_siblings() find_next_sibling()

find_next_siblings()返回後面全部兄弟節點，find_next_sibling()返回後面第一個兄弟節點。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面全部兄弟節點，find_previous_sibling()返回前面第一個兄弟節點。

find_all_next() find_next()

find_all_next()返回節點後全部符合條件的節點, find_next()返回第一個符合條件的節點

find_all_previous() 和 find_previous()

find_all_previous()返回節點後全部符合條件的節點, find_previous()返回第一個符合條件的節點

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-heading">
        <h4>World</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')

print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

輸出結果：

[<div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel-heading">
<h4>World</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

獲取屬性

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2

獲取內容

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

Foo
Bar
Jay
Foo
Bar

總結：

推薦使用lxml解析庫，必要時使用html.parser
標籤選擇篩選功能弱可是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
若是對CSS選擇器熟悉建議使用select()

實例：中國大學排名爬蟲

步驟1:從網絡上獲取大學排名網頁內容getHTMLText()
步驟2:提取網頁內容中信息到合適的數據結構fillUnivList()
步驟3:利用數據結構展現並輸出結果printUnivLise()

import requests
from bs4 import BeautifulSoup
import bs4
 
 
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "error"
 
 
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):  # 過濾掉非標籤類型
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])
 
 
# 中文對齊問題的解決：
# 採用中文字符的空格填充 chr(12288)
def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "學校名稱", "總分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], chr(12288)))
 
 
def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20)
 
 
if __name__ == '__main__':
    main()

代碼

採集到的數據使用pyecharts進行數據可視化展現

import requests,json,re,bs4
from bs4 import BeautifulSoup

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36'}

def getHtmlText(url):
    try:
        ret = requests.get(url , headers=header , timeout=30)
        ret.encoding =  "utf8"
        ret.raise_for_status()
        return ret.text
    except:
        return None

def fillUnivList(ulist,html):
    soup = BeautifulSoup(html,"lxml")
    for tr in soup.tbody.children:
        if isinstance(tr, bs4.element.Tag):     #判斷tr是不是bs4.element.Tag類型
            tds = tr("td")
            # print(tds)
            ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string])


# 中文對齊問題的解決：
# 採用中文字符的空格填充 chr(12288)
def printUnivList(ulist, num):
    tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
    print(tplt.format("排名", "學校名稱", "總分", chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[3], chr(12288)))

#pyecharts數據可視化展現
def showData(ulist,num):
    from pyecharts import Bar
    attrs = []
    vals = []
    for i in range(num):
        attrs.append(ulist[i][1])
        vals.append(ulist[i][3])
    bar = Bar("2019中國大學排行榜")
    bar.add(
        "中國大學排行榜",
        attrs,
        vals,
        is_datazoom_show=True,
        datazoom_type="both",
        datazoom_range=[0, 10],
        xaxis_rotate=30,
        xaxis_label_textsize=8,
        is_label_show=True,
    )
    bar.render("2019中國大學排行榜4.html")

def showData_funnel(ulist,num):
    from pyecharts import Funnel
    attrs = []
    vals = []
    for i in range(num):
        attrs.append(ulist[i][1])
        vals.append(ulist[i][3])
    funnel = Funnel(width=1000,height=800)
    funnel.add(
        "大學排行榜",
        attrs,
        vals,
        is_label_show=True,
        label_pos="inside",
        label_text_color="#fff",
    )
    funnel.render("2019中國大學排行榜4.html")

def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html = getHtmlText(url)
    fillUnivList(uinfo, html)
    print(uinfo)
    # showData(uinfo,100)
    showData_funnel(uinfo,20)
    # printUnivList(uinfo, 30)


if __name__ == '__main__':
    main()

代碼

補充1：

Python中內建函數isinstance的用法

語法：isinstance（object，type）

做用：來判斷一個對象是不是一個已知的類型。

其第一個參數（object）爲對象，第二個參數（type）爲類型名(int...)或類型名的一個列表((int,list,float)是一個列表)。其返回值爲布爾型（True or flase）。

若對象的類型與參數二的類型相同則返回True。若參數二爲一個元組，則若對象類型與元組中類型名之一相同即返回True。

下面是兩個例子：

例一

>>> a = 4
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))
True

例二

>>> a = "b"
>>> isinstance(a,str)
True
>>> isinstance(a,int)
False
>>> isinstance(a,(int,list,float))
False
>>> isinstance(a,(int,list,float,str))
True

補充2：

Response.raise_for_status()

若是發送了一個錯誤請求(一個 4XX 客戶端錯誤，或者 5XX 服務器錯誤響應)，咱們能夠經過 Response.raise_for_status() 來拋出異常：

>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404

>>> bad_r.raise_for_status()
Traceback (most recent call last):
  File "requests/models.py", line 832, in raise_for_status
    raise http_error
requests.exceptions.HTTPError: 404 Client Error

可是，因爲咱們的例子中 r 的 status_code 是 200 ，當咱們調用 raise_for_status()時，獲得的是：