爬蟲系列之第2章-BS和Xpath模塊

時間 2019-11-12

原文原文鏈接

BeautifulSoup

一簡介

簡單來講，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取數據。官方解釋以下：css

'''
Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。
它是一個工具箱，經過解析文檔爲用戶提供須要抓取的數據，由於簡單，因此不須要多少代碼就能夠寫出一個完整的應用程序。
'''

Beautiful Soup 是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工做時間.你可能在尋找 Beautiful Soup3 的文檔,Beautiful Soup 3 目前已經中止開發,官網推薦在如今的項目中使用Beautiful Soup 4。html

安裝

pip3 install beautifulsoup4

解析器

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器，若是咱們不安裝它，則 Python 會使用 Python默認的解析器，lxml 解析器更增強大，速度更快，推薦安裝。html5

pip3 install lxml

另外一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,能夠選擇下列方法來安裝html5lib:node

pip install html5lib

解析器對比：python

官方文檔正則表達式

簡單使用

下面的一段HTML代碼將做爲例子被屢次用到.這是 愛麗絲夢遊仙境的 的一段內容(之後內容中簡稱爲 愛麗絲 的文檔):express

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用BeautifulSoup解析這段代碼,可以獲得一個 BeautifulSoup 的對象瀏覽器

# 導入BeautifulSoup
from bs4 import BeautifulSoup
# 兩種方式可以獲得一個 BeautifulSoup 的對象
soup=BeautifulSoup(html_doc,'lxml')  # 要解析的內容，什麼解析器
soup=BeautifulSoup(open('a.html','r',encoding='utf8'),'lxml') # 用lxml解析器解析a.html文件

從文檔中找到全部<a>標籤的連接:app

for link in soup.find_all('a'):
    print(link.get('href'))

從文檔中獲取全部文字內容:python2.7

print(soup.get_text())

二標籤對象　　

通俗點講就是 HTML 中的一個個標籤，Tag 對象與XML或HTML原生文檔中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tag的名字

soup對象再以愛麗絲夢遊仙境的html_doc爲例，操做文檔樹最簡單的方法就是告訴它你想獲取的tag的name.若是想獲取 <head> 標籤,只要用 soup.head :

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

這是個獲取tag的小竅門,能夠在文檔樹的tag中屢次調用這個方法.下面的代碼能夠獲取<body>標籤中的第一個<b>標籤:

soup.body.b
# <b>The Dormouse's story</b>

經過點取屬性的方式只能得到當前名字的第一個tag:

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

若是想要獲得全部的<a>標籤,或是經過名字獲得比一個tag更多的內容的時候,就須要用到 Searching the tree 中描述的方法,好比: find_all()

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

咱們能夠利用 soup加標籤名輕鬆地獲取這些標籤的內容，注意，它查找的是在全部內容中的第一個符合要求的標籤。

Tag的name和attributes屬性

Tag有不少方法和屬性,如今介紹一下tag中最重要的屬性: name和attributes

每一個tag都有本身的名字,經過 .name 來獲取:

tag.name
# u'b'

tag['class']
# u'boldest'

tag.attrs
# {u'class': u'boldest'}

tag的屬性能夠被添加,刪除或修改. 再說一次, tag的屬性操做方法與字典同樣

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

三遍歷文檔樹

'''
一、用法
二、獲取標籤的名稱
三、獲取標籤的屬性
四、獲取標籤的內容
五、嵌套選擇
六、子節點、子孫節點
七、父節點、祖先節點
八、兄弟節點
'''

#遍歷文檔樹：即直接經過標籤名字選擇，特色是選擇速度快，但若是存在多個相同的標籤則只返回第一個
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#一、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
# soup=BeautifulSoup(open('a.html'),'lxml')

print(soup.p) #存在多個相同的標籤則只返回第一個
print(soup.a) #存在多個相同的標籤則只返回第一個

#二、獲取標籤的名稱
print(soup.p.name)

#三、獲取標籤的屬性
print(soup.p.attrs)

#四、獲取標籤的內容
print(soup.p.string) # p下的文本只有一個時，取到，不然爲None
print(soup.p.strings) #拿到一個生成器對象, 取到p下全部的文本內容
print(soup.p.text) #取到p下全部的文本內容
for line in soup.stripped_strings: #去掉空白
    print(line)


'''
若是tag包含了多個子節點,tag就沒法肯定 .string 方法應該調用哪一個子節點的內容, .string 的輸出結果是 None，若是隻有一個子節點那麼就輸出該子節點的文本，好比下面的這種結構，soup.p.string 返回爲None,但soup.p.strings就能夠找到全部文本
<p id='list-1'>
    哈哈哈哈
    <a class='sss'>
        <span>
            <h1>aaaa</h1>
        </span>
    </a>
    <b>bbbbb</b>
</p>
'''

#五、嵌套選擇
print(soup.head.title.string)
print(soup.body.a.string)


#六、子節點、子孫節點
print(soup.p.contents) #p下全部子節點
print(soup.p.children) #獲得一個迭代器,包含p下全部子節點

for i,child in enumerate(soup.p.children):
    print(i,child)

print(soup.p.descendants) #獲取子孫節點,p下全部的標籤都會選擇出來
for i,child in enumerate(soup.p.descendants):
    print(i,child)

#七、父節點、祖先節點
print(soup.a.parent) #獲取a標籤的父節點
print(soup.a.parents) #找到a標籤全部的祖先節點，父親的父親，父親的父親的父親...


#八、兄弟節點
print('=====>')
print(soup.a.next_sibling) #下一個兄弟
print(soup.a.previous_sibling) #上一個兄弟

print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器對象
print(soup.a.previous_siblings) #上面的兄弟們=>生成器對象

View Code

四搜索文檔樹

BeautifulSoup定義了不少搜索方法,這裏着重介紹2個: find() 和 find_all() .其它方法的參數和用法相似

一、五種過濾器

#搜索文檔樹：BeautifulSoup定義了不少搜索方法,這裏着重介紹2個: find() 和 find_all() .其它方法的參數和用法相似
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

#一、五種過濾器: 字符串、正則表達式、列表、True、方法
#1.一、字符串：即標籤名
print(soup.find_all('b'))

#1.二、正則表達式
import re
print(soup.find_all(re.compile('^b'))) #找出b開頭的標籤，結果有body和b標籤

#1.三、列表：若是傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中全部<a>標籤和<b>標籤:
print(soup.find_all(['a','b']))

#1.四、True：能夠匹配任何值,下面代碼查找到全部的tag,可是不會返回字符串節點
print(soup.find_all(True))
for tag in soup.find_all(True):
    print(tag.name)

#1.五、方法:若是沒有合適過濾器,那麼還能夠定義一個方法,方法只接受一個元素參數 ,若是這個方法返回 True 表示當前元素匹配而且被找到,若是不是則反回 False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print(soup.find_all(has_class_but_no_id))

二、find_all()

#二、find_all( name , attrs , recursive , text , **kwargs )
#2.一、name: 搜索name參數的值可使任一類型的 過濾器 ,字符竄,正則表達式,列表,方法或是 True .
print(soup.find_all(name=re.compile('^t')))

#2.二、keyword: key=value的形式，value能夠是過濾器：字符串 , 正則表達式 , 列表, True .
print(soup.find_all(id=re.compile('my')))
print(soup.find_all(href=re.compile('lacie'),id=re.compile('\d'))) #注意類要用class_
print(soup.find_all(id=True)) #查找有id屬性的標籤

# 有些tag屬性在搜索不能使用,好比HTML5中的 data-* 屬性:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>','lxml')
# data_soup.find_all(data-foo="value") #報錯：SyntaxError: keyword can't be an expression
# 可是能夠經過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))
# [<div data-foo="value">foo!</div>]

#2.三、按照類名查找，注意關鍵字是class_，class_=value,value能夠是五種選擇器之一
print(soup.find_all('a',class_='sister')) #查找類爲sister的a標籤
print(soup.find_all('a',class_='sister ssss')) #查找類爲sister和sss的a標籤，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile('^sis'))) #查找類爲sister的全部標籤

#2.四、attrs
print(soup.find_all('p',attrs={'class':'story'}))

#2.五、text: 值能夠是：字符，列表，True，正則
print(soup.find_all(text='Elsie'))
print(soup.find_all('a',text='Elsie'))

#2.六、limit參數:若是文檔樹很大那麼搜索會很慢.若是咱們不須要所有結果,可使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字相似,當搜索到的結果數量達到 limit 的限制時,就中止搜索返回結果
print(soup.find_all('a',limit=2))

#2.七、recursive:調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的全部子孫節點,若是隻想搜索tag的直接子節點,可使用參數 recursive=False .
print(soup.html.find_all('a'))
print(soup.html.find_all('a',recursive=False))

'''
像調用 find_all() 同樣調用tag
find_all() 幾乎是Beautiful Soup中最經常使用的搜索方法,因此咱們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象能夠被看成一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:
soup.find_all("a")
soup("a")
這兩行代碼也是等價的:
soup.title.find_all(text=True)
soup.title(text=True)
'''

三、find()

#三、find( name , attrs , recursive , text , **kwargs )
find_all() 方法將返回文檔中符合條件的全部tag,儘管有時候咱們只想獲得一個結果.好比文檔中只有一個<body>標籤,那麼使用 find_all() 方法來查找<body>標籤就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法.下面兩行代碼是等價的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]
soup.find('title')
# <title>The Dormouse's story</title>

惟一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.
find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .
print(soup.find("nosuchtag"))
# None

soup.head.title 是 tag的名字 方法的簡寫.這個簡寫的原理就是屢次調用當前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>
soup.find("head").find("title")
# <title>The Dormouse's story</title>

四、其餘方法

#見官網:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent

五、css選擇器

咱們在寫 CSS 時，標籤名不加任何修飾，類名前加點，id名前加 #，在這裏咱們也能夠利用相似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

（1）經過標籤名查找

print(soup.select("title"))  #[<title>The Dormouse's story</title>]
print(soup.select("b"))      #[<b>The Dormouse's story</b>]

（2）經過類名查找

print(soup.select(".sister")) 

'''
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

'''

（3）經過 id 名查找

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

（4）組合查找

組合查找即和寫 class 文件時，標籤名與類名、id名進行的組合原理是同樣的，例如查找 p 標籤中，id 等於 link1的內容，兩者須要用空格分開

print(soup.select("p #link2"))

#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

直接子標籤查找

print(soup.select("p > #link2"))
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

（5）屬性查找

查找時還能夠加入屬性元素，屬性須要用中括號括起來，注意屬性和標籤屬於同一節點，因此中間不能加空格，不然會沒法匹配到。

print(soup.select("a[href='http://example.com/tillie']"))
#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select 方法返回的結果都是列表形式，能夠遍歷形式輸出，而後用 get_text() 方法來獲取它的內容：

for title in soup.select('a'):
    print (title.get_text())

'''
Elsie
Lacie
Tillie
'''

五修改文檔樹

修改文檔樹

xpath

xpath簡介

XPath在Python的爬蟲學習中，起着舉足輕重的地位，對比正則表達式 re二者能夠完成一樣的工做，實現的功能也差很少，但XPath明顯比re具備優點，在網頁分析上使re退居二線。

XPath介紹

是什麼？全稱爲XML Path Language 一種小型的查詢語言
說道XPath是門語言，不得不說它所具有的優勢：

可在XML中查找信息
支持HTML的查找
經過元素和屬性進行導航

python開發使用XPath條件： 因爲XPath屬於lxml庫模塊，因此首先要安裝庫lxml。

XPath的簡單調用方法：

from lxml import etree

selector=etree.HTML(源碼) #將源碼轉化爲能被XPath匹配的格式
 selector.xpath(表達式) #返回爲一列表

Xpath語法

查詢

html_doc = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>

<div class="d1">
    <div class="d2">
            <p class="story">
                <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
                <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
                <a href="http://example.com/tillie" id="link3">Tillie</a>
            </p>
    </div>
    <div>
        <p id="p1">ALex is dsb</p>
        <p id="p2">Egon too</p>
    </div>
</div>

<div class="d3">
    <a href="http://www.baidu.com">baidu</a>
    <p>百度</p>
</div>

</body>
</html>
"""

from lxml import etree
selector=etree.HTML(html_doc) # 將源碼轉化爲能被XPath匹配的格式


'''
1、選取節點

nodename     選取nodename節點的全部子節點         xpath(‘//div’)         選取了全部div節點
/            從根節點選取                        xpath(‘/div’)          從根節點上選取div節點
//           選取全部的當前節點，不考慮他們的位置    xpath(‘//div’)         選取全部的div節點
.            選取當前節點                        xpath(‘./div’)         選取當前節點下的div節點
..           選取當前節點的父節點                 xpath(‘..’)            回到上一個節點
@            選取屬性                           xpath（’//@calss’）     選取全部的class屬性

'''

ret=selector.xpath("//div")
ret=selector.xpath("/div")
ret=selector.xpath("./div")
ret=selector.xpath("//p[@id='p1']")
ret=selector.xpath("//div[@class='d1']/div/p[@class='story']")


'''
2、謂語

表達式                                         結果
xpath(‘/body/div[1]’)                     選取body下的第一個div節點
xpath(‘/body/div[last()]’)                選取body下最後一個div節點
xpath(‘/body/div[last()-1]’)              選取body下倒數第二個div節點
xpath(‘/body/div[positon()<3]’)           選取body下前兩個div節點
xpath(‘/body/div[@class]’)                選取body下帶有class屬性的div節點
xpath(‘/body/div[@class=」main」]’)         選取body下class屬性爲main的div節點
xpath(‘/body/div[price>35.00]’)           選取body下price元素值大於35的div節點

'''

ret=selector.xpath("//p[@class='story']//a[2]")
ret=selector.xpath("//p[@class='story']//a[last()]")


'''
通配符 Xpath經過通配符來選取未知的XML元素

表達式                 結果
xpath（’/div/*’）     選取div下的全部子節點
xpath(‘/div[@*]’)    選取全部帶屬性的div節點


'''

ret=selector.xpath("//p[@class='story']/*")
ret=selector.xpath("//p[@class='story']/a[@class]")

'''
4、取多個路徑
使用「|」運算符能夠選取多個路徑

表達式                         結果
xpath(‘//div|//table’)    選取全部的div和table節點


'''

ret=selector.xpath("//p[@class='story']/a[@class]|//div[@class='d3']")
print(ret)

'''


5、Xpath軸
軸能夠定義相對於當前節點的節點集

軸名稱                      表達式                                  描述
ancestor                xpath(‘./ancestor::*’)              選取當前節點的全部先輩節點（父、祖父）
ancestor-or-self        xpath(‘./ancestor-or-self::*’)      選取當前節點的全部先輩節點以及節點自己
attribute               xpath(‘./attribute::*’)             選取當前節點的全部屬性
child                   xpath(‘./child::*’)                 返回當前節點的全部子節點
descendant              xpath(‘./descendant::*’)            返回當前節點的全部後代節點（子節點、孫節點）
following               xpath(‘./following::*’)             選取文檔中當前節點結束標籤後的全部節點
following-sibing        xpath(‘./following-sibing::*’)      選取當前節點以後的兄弟節點
parent                  xpath(‘./parent::*’)                選取當前節點的父節點
preceding               xpath(‘./preceding::*’)             選取文檔中當前節點開始標籤前的全部節點

preceding-sibling       xpath(‘./preceding-sibling::*’)     選取當前節點以前的兄弟節點
self                    xpath(‘./self::*’)                  選取當前節點
 

6、功能函數   
使用功能函數可以更好的進行模糊搜索

函數                  用法                                                               解釋
starts-with         xpath(‘//div[starts-with(@id,」ma」)]‘)                        選取id值以ma開頭的div節點
contains            xpath(‘//div[contains(@id,」ma」)]‘)                           選取id值包含ma的div節點
and                 xpath(‘//div[contains(@id,」ma」) and contains(@id,」in」)]‘)    選取id值包含ma和in的div節點
text()              xpath(‘//div[contains(text(),」ma」)]‘)                        選取節點文本包含ma的div節點

'''

Element對象

from lxml.etree import _Element
for obj in ret:
    print(obj)
    print(type(obj))  # from lxml.etree import _Element

'''
Element對象

class xml.etree.ElementTree.Element(tag, attrib={}, **extra)

　　tag：string，元素表明的數據種類。
　　text：string，元素的內容。
　　tail：string，元素的尾形。
　　attrib：dictionary，元素的屬性字典。
　　
　　＃針對屬性的操做
　　clear()：清空元素的後代、屬性、text和tail也設置爲None。
　　get(key, default=None)：獲取key對應的屬性值，如該屬性不存在則返回default值。
　　items()：根據屬性字典返回一個列表，列表元素爲(key, value）。
　　keys()：返回包含全部元素屬性鍵的列表。
　　set(key, value)：設置新的屬性鍵與值。

　　＃針對後代的操做
　　append(subelement)：添加直系子元素。
　　extend(subelements)：增長一串元素對象做爲子元素。＃python2.7新特性
　　find(match)：尋找第一個匹配子元素，匹配對象能夠爲tag或path。
　　findall(match)：尋找全部匹配子元素，匹配對象能夠爲tag或path。
　　findtext(match)：尋找第一個匹配子元素，返回其text值。匹配對象能夠爲tag或path。
　　insert(index, element)：在指定位置插入子元素。
　　iter(tag=None)：生成遍歷當前元素全部後代或者給定tag的後代的迭代器。＃python2.7新特性
　　iterfind(match)：根據tag或path查找全部的後代。
　　itertext()：遍歷全部後代並返回text值。
　　remove(subelement)：刪除子元素。



'''