爬蟲之BeautifulSoup， CSS

時間 2019-11-11

標籤爬蟲 beautifulsoup css 欄目網絡爬蟲简体版

原文原文鏈接

1. Beautiful Soup的簡介

2. Beautiful Soup 安裝

能夠利用 pip 或者 easy_install 來安裝，如下兩種方法都可html

easy_install beautifulsoup4python

pip install beautifulsoup4

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，若是咱們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更增強大，速度更快，推薦安裝。

Python標準庫：BeautifulSoup(markup, 「html.parser」)

lxml HTML 解析器：BeautifulSoup(markup, 「lxml」)

4. 建立 Beautiful Soup 對象

首先必需要導入 bs4 庫：from bs4 import BeautifulSoupgit

咱們建立一個字符串，後面的例子咱們便會用它來演示github

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

建立 beautifulsoup 對象：soup = BeautifulSoup(html)

另外，咱們還能夠用本地 HTML 文件來建立對象，例如soup = BeautifulSoup(open('index.html'))正則表達式

上面這句代碼即是將本地 index.html 文件打開，用它來建立 soup 對象express

下面咱們來打印一下 soup 對象的內容，格式化輸出：print soup.prettify()app

5. 四大對象種類

Beautiful Soup將複雜HTML文檔轉換成一個複雜的樹形結構,每一個節點都是Python對象,全部對象能夠概括爲4種:ide

Tag
NavigableString
BeautifulSoup
Comment

Tag 是什麼？通俗點講就是 HTML 中的一個個標籤，例如：<title>The Dormouse's story</title> ；<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>spa

上面的 title a 等等 HTML 標籤加上裏面包括的內容就是 Tag，下面咱們來感覺一下怎樣用 Beautiful Soup 來方便地獲取 Tags3d

對於 Tag，它有兩個重要的屬性，是 name 和 attrs，下面咱們分別來感覺一下

print soup.name

print soup.head.name

#[document]

#head

7.搜索文檔樹

（1）find_all( name , attrs , recursive , text , **kwargs )

（2）find( name , attrs , recursive , text , **kwargs )

它與 find_all() 方法惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果

find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索當前tag的全部tag子節點,並判斷是否符合過濾器的條件

1）name 參數

name 參數能夠查找全部名字爲 name 的tag,字符串對象會被自動忽略掉

A.傳字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中全部的標籤

soup.find_all('b')

# [The Dormouse's story]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.傳正則表達式

若是傳入正則表達式做爲參數,Beautiful Soup會經過正則表達式的 match() 來匹配內容.下面例子中找出全部以b開頭的標籤,這表示<body>和標籤都應該被找到

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

C.傳列表

若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和標籤

soup.find_all(["a", "b"])

# [The Dormouse's story,

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.傳 True

True 能夠匹配任何值,下面代碼查找到全部的tag,可是不會返回字符串節點

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

E.傳方法

若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數 [4] ,若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False

下面方法校驗了當前元素,若是包含 class 屬性卻不包含 id 屬性,那麼將返回 True:

def has_class_but_no_id(tag):

return tag.has_attr('class') and not tag.has_attr('id')

將這個方法做爲參數傳入 find_all() 方法,將獲得全部標籤:

soup.find_all(has_class_but_no_id)

# [The Dormouse's story,

# Once upon a time there were...,

# ...]

         
      2）keyword 參數 
     
      注意：若是一個指定名字的參數不是搜索內置的參數名,搜索時會把該參數看成指定名字tag的屬性來搜索,若是包含一個名字爲 id 的參數,Beautiful Soup會搜索每一個tag的」id」屬性

 
       soup.find_all(id='link2') 
      
       # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 
      
       若是傳入  
      href 參數,Beautiful Soup會搜索每一個tag的」href」屬性

 
       soup.find_all(href=re.compile("elsie")) 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

使用多個指定名字的參數能夠同時過濾tag的多個屬性

 
       soup.find_all(href=re.compile("elsie"), id='link1') 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

在這裏咱們想用 class 過濾，不過 class 是 python 的關鍵詞，這怎麼辦？加個下劃線就能夠

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性

 
       data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 
      
       data_soup.find_all(data-foo="value") 
      
       # SyntaxError: keyword can't be an expression

可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag

 
       data_soup.find_all(attrs={"data-foo": "value"}) 
      
       # [<div data-foo="value">foo!</div>]

3）text 參數

經過 text 參數能夠搜搜文檔中的字符串內容.與 name 參數的可選值同樣, text 參數接受字符串 , 正則表達式 , 列表, True

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"

4）limit 參數

find_all() 方法返回所有的搜索結構,若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果.

文檔樹中有3個tag符合搜索條件,但結果只返回了2個,由於咱們限制了返回數量

 
       soup.find_all("a", limit=2) 
      
       # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
      
       #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 參數

調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的全部子孫節點,若是隻想搜索tag的直接子節點,可使用參數 recursive=False .

一段簡單的文檔:

       < 
     html> 
      <head> 
     
        <title> 
     
         The Dormouse's story 
     
        </title> 
     
      </head> 
     
      ... 
     
      是否使用  
     recursive 參數的搜索結果: