python爬蟲知識點總結（六）BeautifulSoup庫詳解

時間 2020-05-21

標籤 python 爬蟲知識總結 beautifulsoup 詳解欄目 Python 简体版

原文原文鏈接

官方學習文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/html

1、什麼時BeautifulSoup？html5

答：靈活又方便的網頁解析庫，處理搞笑，支持多種解析器。正則表達式

　　利用它不用編寫正則表達式便可方便地實現網頁信息的提取。瀏覽器

2、安裝學習

pip3 install bewautifulsoup4

3、用法講解spa

解析器	使用方法	優點	劣勢
Py't'hon標準庫	BeautifulSoup(markup,"html.parser")	Python的內置標準庫、執行速度適中、文檔容錯額能力強	Python2.7 or 3.2。2 前的版本中文容錯額能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快、文檔容錯能力強	須要安裝C語言庫
lxml XML解析器	BeautifulSoup(markup,"xml")	速度快、惟一支持XML的解析器	須要安裝C語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

4、基本使用code

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

5、標籤選擇器orm

lxml解析庫xml

一、選擇元素htm

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(soup.title.string)
print(type(soup.title))
print(soup.href)
print(soup.p)

二、獲取名稱

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)

三、獲取屬性

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

四、獲取內容

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.string)

五、嵌套選擇

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)

六、子節點和子孫節點

.contents能夠獲取標籤的子節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">
Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>
and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)# .contents能夠獲取標籤的子節點

.children是一個迭代器,以換行符分隔,獲取全部的子節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children) # .children是一個迭代器,以換行符分隔,獲取全部的子節點
for i,child in enumerate(soup.p.children):
    print(i,child)

.descendants,以換行符分隔，獲取全部的子孫節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants) # .descendants,以換行符分隔，獲取全部的子孫節點
for i,child in enumerate(soup.p.descendants):
    print(i,child)

七、父節點和祖先節點

.parent,獲取父節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent) # .parent,獲取父節點

.parents,獲取祖先節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parents))) # .parents,獲取祖先節點

　
八、兄弟節點

.next_siblings,獲取後面的兄弟節點

.previous_siblings,獲取後面的兄弟節點

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsle" class="sister" id="link1"><!--Elsle--></a>,
<a href="http://example.com/elsle" class="sister" id="link1">Lacle</a>and
<a href="http://example.com/elsle" class="sister" id="link1">Title</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings))) # .next_siblings,獲取後面的兄弟節點
print(list(enumerate(soup.a.previous_siblings))) # .previous_siblings,獲取後面的兄弟節點

標籤選擇器

一、find_all(name,attrs,recursive,text,kwargs）**

可根據標籤名、屬性、內容查找文檔

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

attrs

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

text

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo')) # text方法適用於文本匹配，不適用於標籤查找

二、find(name.attrs,recursive,text,**kwargs)

find返回單個元素，find_all返回全部元素

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul')) 
print(type(soup.find('ul')))
print(soup.find('page'))

三、其餘

find_parents()和 find_parent

find_parents()返回全部祖先節點，find_parent()返回直接父節點

find_next_siblings()和 find_next_siblings()

find_next_siblings()返回後面全部兄弟結點， find_next_siblings()返回後面第一個兄弟結點

find_previous_siblings()和find_previous_sibling()

find_previous_siblings()返回前面全部修兄弟節點，find_previous_sibling()返回前面第一個兄弟節點

find_all_next()和find_next()

find_all_next()返回節點後面全部符合條件的結點，find_next()返回第一個符合條件的結點

find_all_previous()和find_previous()

find_all_previous()返回結點前面全部符合條件的結點，find_previous()返回第一個符合條件的結點

CSS選擇器

經過select()直接傳入CSS選擇器便可完成選擇

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading')) # panel前面的.表明class屬性
print(soup.select('ul li')) #ul li表示ul屬性內的li屬性，嵌套選擇
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

一、獲取屬性

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

二、獲取內容

html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
    print(li.get_text())