Python爬蟲--- 1.3 BS4庫的解析器

時間 2019-11-30

標籤 python 爬蟲 1.3 bs4 解析欄目 Python 简体版

原文原文鏈接

原文連接https://www.fkomm.cn/article/2018/7/20/18.htmlhtml

bs4庫之因此能快速的定位咱們想要的元素，是由於他可以用一種方式將html文件解析了一遍，不一樣的解析器有不一樣的效果。下文將一一進行介紹。python

bs4解析器的選擇

網絡爬蟲的最終目的就是過濾選取網絡信息，最重要的部分能夠說是解析器。解析器的優劣決定了爬蟲的速度和效率。bs4庫除了支持咱們上文用過的‘html.parser’解析器外，還支持不少第三方的解析器，下面咱們來對他們進行對比分析。windows

bs4庫官方推薦咱們使用的是lxml解析器，緣由是它具備更高的效率，因此咱們也將採用lxml解析器。bash

lxml解析器的安裝：

依舊採用pip安裝工具來安裝：

$ pip install lxml網絡

注意，因爲我用的是unix類系統，用pip工具十分的方便，可是若是在windows下安裝，老是會出現這樣或者那樣的問題，這裏推薦win用戶去lxml官方，下載安裝包，來安裝適合本身系統版本的lxml解析器。工具

使用lxml解析器來解釋網頁

咱們依舊以上一篇的愛麗絲文檔爲例子:學習

html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
複製代碼

試一下吧：ui

import bs4


#首先咱們先將html文件已lxml的方式作成一鍋湯
soup = bs4.BeautifulSoup(open('Beautiful Soup 爬蟲/demo.html'),'lxml')

#咱們把結果輸出一下，是一個很清晰的樹形結構。
#print(soup.prettify())

''' OUT: <html> <head> <title> The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> ··· 複製代碼

如何具體的使用？

bs4 庫首先將傳入的字符串或文件句柄轉換爲 Unicode的類型，這樣，咱們在抓取中文信息的時候，就不會有很麻煩的編碼問題了。固然，有一些生僻的編碼如：‘big5’，就須要咱們手動設置編碼： soup = BeautifulSoup(markup, from_encoding="編碼方式")編碼

對象的種類：

bs4 庫將複雜的html文檔轉化爲一個複雜的樹形結構，每一個節點都是Python對象，全部對象能夠分爲如下四個類型：Tag , NavigableString , BeautifulSoup , Comment 咱們來逐一解釋：spa

Tag：和html中的Tag基本沒有區別，能夠簡單上手使用

NavigableString：被包裹在tag內的字符串

BeautifulSoup：表示一個文檔的所有內容，大部分的時候能夠吧他看作一個tag對象，支持遍歷文檔樹和搜索文檔樹方法。

Comment：這是一個特殊的NavigableSting對象，在出如今html文檔中時，會以特殊的格式輸出，好比註釋類型。

搜索文檔樹的最簡單的方法就是搜索你想獲取tag的的name：

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>
複製代碼

若是你還想更深刻的得到更小的tag：例如咱們想找到body下的被b標籤包裹的部分

soup.body.b
# <b>The Dormouse's story</b>
複製代碼

可是這個方法只能找到按順序第一個出現的tag。

獲取全部的標籤呢？

這個時候須要find_all()方法，他返回一個列表類型

tag=soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
#假設咱們要找到a標籤中的第二個元素：
need = tag[1]
#簡單吧
複製代碼

tag的.contents屬性能夠將tag的子節點以列表的方式輸出：

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>] title_tag = head_tag.contents[0] print(title_tag) # <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
複製代碼

另外經過tag的 .children生成器，能夠對tag的子節點進行循環：

for child in title_tag.children:
    print(child)
    # The Dormouse's story
複製代碼

這種方式只能遍歷出子節點。如何遍歷出子孫節點呢？子孫節點：好比 head.contents 的子節點是,這裏 title自己也有子節點：‘The Dormouse‘s story’ 。這裏的‘The Dormouse‘s story’也叫做head的子孫節點

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story
複製代碼

如何找到tag下的全部的文本內容呢？

若是該tag只有一個子節點（NavigableString類型）：直接使用tag.string就能找到。
若是tag有不少個子、孫節點，而且每一個節點裏都string：

咱們能夠用迭代的方式將其所有找出：

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'
複製代碼

好了，關於bs4庫的基本使用，咱們就先介紹到這。剩下來的部分：父節點、兄弟節點、回退和前進，都與上面從子節點找元素的過程差很少。