爬蟲(七)：BeatifulSoup模塊

時間 2019-12-18

標籤爬蟲 beatifulsoup 模塊欄目網絡爬蟲简体版

原文原文鏈接

1. Beautiful Soup介紹

Beautiful Soup是一個能夠從HTML或XML文件中提取數據的Python庫。能將即將要進行解析的源碼加載到bs對象，調用bs對象中相關的方法或屬性進行源碼中的相關標籤的定位，並獲取定位到的標籤之間存在的文本或者屬性值。html

它可以經過你喜歡的轉換器實現慣用的文檔導航、查找、修改文檔的方式。Beautiful Soup會幫你節省數小時甚至數天的工做時間。python

1.1 安裝bs4

pip install 包名 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com正則表達式

1.2 初始化

from bs4 import BeautifulSoup

soup = BeautifulSoup("<html>A Html Text</html>", "html.parser")

兩個參數：第一個參數是要解析的html文本，第二個參數是使用那種解析器，對於HTML來說就是html.parser，這個是bs4自帶的解析器。express

若是一段HTML或XML文檔格式不正確的話，那麼在不一樣的解析器中返回的結果多是不同的。url

格式化輸出：spa

soup.prettify()

1.3 對象

Beautfiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構，每一個節點都是Python對象，全部對象能夠概括爲4種：tag，NavigableString，BeautifulSoup，Comment。code

1.3.1 tag

Tag對象與xml或html原生文檔中的tag相同。orm

from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','xml')
tag = soup.b
print(type(tag))# <class 'bs4.element.Tag'>

若是不存在，則返回 None，若是存在多個，則返回第一個。xml

每一個tag都有本身的名字。htm

tag = soup.b
print(tag.name)# b

tag的屬性是一個字典。

tag = soup.b
print(tag['class'])# boldest
print(tag.attrs)# {'class': 'boldest'}
print(type(tag.attrs))# <class 'dict'>

多值屬性，最多見的多值屬性是class，多值屬性的返回list。

soup = BeautifulSoup('<p class="body strikeout"></p>','xml') 

print(soup.p['class']) # ['body', 'strikeout'] print(soup.p.attrs) # {'class': ['body', 'strikeout']}

若是某個屬性看起來好像有多個值,但在任何版本的HTML定義中都沒有被定義爲多值屬性，那麼Beautiful Soup會將這個屬性做爲字符串返回。

soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
print(soup.p['id'])    # 'my id'

text屬性返回tag的全部字符串連成的字符串。

1.3.2 NavigableString

字符串常被包含在tag內，BeautifulSoup用NavigableString類來包裝tag中的字符串。可是字符串中不能包含其餘tag。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>'，'xml')

s = soup.b.string

print(s)        # Extremely bold

print(type(s))  # <class 'bs4.element.NavigableString'>

1.3.3 BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的所有內容。大部分時候，能夠把它看成Tag對象。可是BeautifulSoup對象並非真正的HTML或XML的 tag，它沒有attribute屬性，name屬性是一個值爲「[document]」的特殊屬性。

1.3.4 Comment

Comment通常表示文檔的註釋部分。

oup = BeautifulSoup("<b><!--This is a comment--></b>",'xml')

comment = soup.b.string

print(comment)          # This is a comment

print(type(comment))    # <class 'bs4.element.Comment'>

1.4 遍歷

1.4.1 子節點

contents屬性返回全部子節點的列表，包括NavigableString類型節點。若是節點當中有換行符，會被當作是NavigableString類型節點而做爲一個子節點。

NavigableString 類型節點沒有contents屬性，由於沒有子節點。

soup = BeautifulSoup("""<div>
<span>test</span>
</div>
""",'xml')

element = soup.div.contents

print(element)          # ['\n', <span>test</span>, '\n']

children屬性跟contents屬性基本同樣，只不過返回的不是子節點列表，而是子節點的可迭代對象。

descendants屬性返回tag的全部子孫節點。

若是tag只有一個NavigableString類型子節點，那麼這個tag可使用.string獲得子節點。

若是一個tag僅有一個子節點，那麼這個tag也可使用.string方法，輸出結果與當前惟一子節點的.string結果相同。

若是tag包含了多個子節點，tag就沒法肯定.string方法應該調用哪一個子節點的內容，.string 的輸出結果是None。

soup = BeautifulSoup("""<div>
    <p><span><b>test</b></span></p>
</div>
""",'xml')

element = soup.p.string

print(element)          # test

print(type(element))    # <class 'bs4.element.NavigableString'>

特別注意，爲了清楚顯示，通常咱們會將html節點換行縮進顯示，而在BeautifulSoup中會被認爲是一個NavigableString類型子節點，致使出錯。上例中，若是改爲element = soup.div.string就會出錯。

若是tag中包含多個字符串，能夠用strings屬性來獲取。若是返回結果中要去除空行，則能夠用stripped_strings屬性。

soup = BeautifulSoup("""<div>
    <p>      </p>
    <p>test 1</p>
    <p>test 2</p>
</div>
""", 'html.parser')

element = soup.div.stripped_strings

print(list(element))          # ['test 1', 'test 2']

1.4.2 父節點

parent屬性返回某個元素（tag、NavigableString）的父節點，文檔的頂層節點的父節點是BeautifulSoup對象，BeautifulSoup對象的父節點是None。

parents屬性遞歸獲得元素的全部父輩節點，包括BeautifulSoup對象。

1.4.3 兄弟節點

next_sibling返回後一個兄弟節點，previous_sibling返回前一個兄弟節點。直接看個例子，注意別被換行縮進攪了局。

soup = BeautifulSoup("""<div>
    <p>test 1</p><b>test 2</b><h>test 3</h></div>
""", 'html.parser')

print(soup.b.next_sibling)      # <h>test 3</h>

print(soup.b.previous_sibling)  # <p>test 1</p>

print(soup.h.next_sibling)      # None

next_siblings返回後面的兄弟節點

previous_siblings返回前面的兄弟節點

1.4.4 回退和前進

把html解析當作依次解析標籤的一連串事件，BeautifulSoup提供了重現解析器初始化過程的方法。

next_element屬性指向解析過程當中下一個被解析的對象（tag或NavigableString）。

previous_element屬性指向解析過程當中前一個被解析的對象。

另外還有next_elements和previous_elements屬性，不贅述了。

1.5 搜索

1.5.1 過濾器

介紹find_all()方法前，先介紹一下過濾器的類型，這些過濾器貫穿整個搜索的API。過濾器能夠被用在tag的name中，節點的屬性中，字符串中或他們的混合中。

html = """
<div>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a></p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

查找全部的<b>標籤

print(soup.find_all('b'))  # [<b>The Dormouse's story</b>]

傳入正則表達式做爲參數，返回知足正則表達式的標籤。下面例子中找出全部以b開頭的標籤。

print(soup.find_all(re.compile("^b")))  # [<b>The Dormouse's story</b>]

傳入列表參數，將返回與列表中任一元素匹配的內容。下面例子中找出全部<a>標籤和<b>標籤。

print(soup.find_all(["a", "b"]))  # [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True能夠匹配任何值，下面的代碼查找到全部的tag，可是不會返回字符串節點。

print(soup.find_all(True))

若是沒有合適的過濾器，還能夠自定義一個方法，方法只接收一個元素參數，若是這個方法返回True表示當前元素匹配被找到。

例子，返回全部包含class屬性但不包含id屬性的標籤：

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))

結果：

結果怎麼不對呢，<a>標籤含有id屬性。其實返回的list中只有2個元素，都是<p>標籤，<a>標籤是<p>標籤的子節點。

1.5.2 find和find_all

搜索當前tag的全部tag子節點，並判斷是否符合過濾器的條件。

語法：

find(name=None, attrs={}, recursive=True, text=None, **kwargs)

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

參數：

name：查找全部名字爲name的tag，字符串對象會被自動忽略掉。上面過濾器示例中的參數都是name參數。固然，其餘參數中也可使用過濾器。

attrs：按屬性名和值查找。傳入字典，key爲屬性名，value爲屬性值。

recursive：是否遞歸遍歷全部子孫節點，默認True。

text：用於搜索字符串，會找到 .string方法與text參數值相符的tag，一般配合正則表達式使用。也就是說，雖然參數名是 text，但實際上搜索的是string屬性。

limit：限定返回列表的最大個數。

kwargs：若是一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數看成tag的屬性來搜索。這裏注意，若是要按class屬性搜索，由於class是python的保留字，須要寫做class_。

Tag的有些屬性在搜索中不能做爲kwargs參數使用，好比H5中的data-*屬性。

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

print(data_soup.find_all(data-foo="value"))

# SyntaxError: keyword can't be an expression

可是能夠經過attrs參數傳遞

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','xml')

print(data_soup.find_all(attrs={"data-foo": "value"}))

# [<div data-foo="value">foo!</div>]

1.6 CSS選擇器

BeautifulSoup支持大部分的CSS選擇器，這裏直接用代碼來演示。

from bs4 import BeautifulSoup

html = """
<html>
<head><title>標題</title></head>
<body>
 <p class="title" name="dromouse"><b>標題</b></p>
 <div name="divlink">
  <p>
   <a href="http://example.com/1" class="sister" id="link1">連接1</a>
   <a href="http://example.com/2" class="sister" id="link2">連接2</a>
   <a href="http://example.com/3" class="sister" id="link3">連接3</a>
  </p>
 </div>
 <p></p>
 <div name='dv2'></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

# 經過tag查找
print(soup.select('title'))  # [<title>標題</title>]

# 經過tag逐層查找
print(soup.select("html head title"))  # [<title>標題</title>]

# 經過class查找
print(soup.select('.sister'))
# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">連接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">連接3</a>]


# 經過id查找
print(soup.select('#link1, #link2'))
# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>,
# <a class="sister" href="http://example.com/2" id="link2">連接2</a>]


# 組合查找
print(soup.select('p #link1'))# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>]


# 查找直接子標籤
print(soup.select("head > title"))# [<title>標題</title>]

print(soup.select("p > #link1"))# [<a class="sister" href="http://example.com/1" id="link1">連接1</a>]

print(soup.select("p > a:nth-of-type(2)")) # [<a class="sister" href="http://example.com/2" id="link2">連接2</a>]
# nth-of-type 是CSS選擇器


# 查找兄弟節點（向後查找）
print(soup.select("#link1 ~ .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">連接2</a>,
# <a class="sister" href="http://example.com/3" id="link3">連接3</a>]

print(soup.select("#link1 + .sister"))
# [<a class="sister" href="http://example.com/2" id="link2">連接2</a>]


# 經過屬性查找
print(soup.select('a[href="http://example.com/1"]'))

# ^ 以XX開頭
print(soup.select('a[href^="http://example.com/"]'))

# * 包含
print(soup.select('a[href*=".com/"]'))

# 查找包含指定屬性的標籤
print(soup.select('[name]'))

# 查找第一個元素
print(soup.select_one(".sister"))

1.7 實例

咱們來爬取三國演義的內容，我甚至有個想法，去那些小說網裏爬小說去。

import requests
from bs4 import BeautifulSoup

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
page_text = requests.get(url=url,headers=headers).text

soup = BeautifulSoup(page_text,'lxml')

a_list = soup.select('.book-mulu > ul > li > a')

fp = open('sanguo.txt','w',encoding='utf-8')
for a in a_list:
    title = a.string
    detail_url = 'http://www.shicimingju.com'+a['href']
    detail_page_text = requests.get(url=detail_url,headers=headers).text

    soup = BeautifulSoup(detail_page_text,'lxml')
    content = soup.find('div',class_='chapter_content').text

    fp.write(title+'\n'+content)
    print(title,'下載完畢')
print('over')

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。