Python中xPath技術和BeautifulSoup的使用

時間 2019-11-17

標籤 python xpath 技術 beautifulsoup 使用欄目 Python 简体版

原文原文鏈接

xpath基本知識html

XPath語法：使用路徑表達式來選取XML或HTML文檔中的節點或節點集node

路徑表達式python

nodename:表示選取此節點的全部子節點正則表達式

/ ：表示從根節點選取ide

// ：選擇任意位置的某個節點。函數

. ：選取當前節點測試

.. ：選取當前節點的父節點spa

@ ：選取屬性ssr

謂語實例code

實現效果路勁表達式

選取屬於classroom子元素的第一個student元素 /classroom/student[1]

選取屬於classroom子元素的最後一個student元素 /classroom/student[last()]

選取屬於classroom子元素的倒數第二個stduent元素 /classroom/stduent[last()-1]

選取最前面的兩個屬於classroom元素的子元素的student元素 /classroom/stduent[position()<3]

選取全部擁有名爲lang的屬性的name元素 //name[@lang]

選取全部name元素，且這些元素擁有值爲eng的lang屬性 //name[@lang='en']

選取classroom元素的全部student元素，且其中的age元素的值須大於20 .classroom.stduent[age>20]

選取classroom元素中的student元素的全部name元素，且其中的age元素的值須大於20 /classroom/stduent[age>20]/name

通配符「*」與「|」操做

實現效果路徑表達式

選取classroom元素的全部子元素 /classroom/*

選取文檔中的全部元素 //*

選取全部帶有屬性的name元素 //name[@*]

選取stduent元素的全部name和age元素 //stduent/name | //stduent/age

選取屬於classroom元素的student元素的全部name元素，以及文檔中全部的age元素 /classroom/stduent/name | //age

XPath軸步的語法爲軸名稱:節點測試[謂語]

軸名稱含義

child 選取當前節點的全部子節點

parent 選取當前節點的父節點

ancestor 選取當前節點的全部先輩(父、祖父等)

ancestor-or-self 選取當前節點的全部先輩以及當前節點自己

descendant 選取當前節點的全部後代節點

descendant-or-self 選取當前節點的全部後代節點以及當前節點自己

preceding 選取文檔中當前節點的開始標記以前的全部節點

following 選取文檔中當前節點的結束標記以後的全部節點

preceding-sibling 選取當前節點以前的全部同級節點

following-sibling 選取當前節點以後的所用同級節點

self 選取當前節點

attribute 選取當前節點的全部屬性

namespace 選取當前節點的全部命名空間

XPath軸示例分析

實現效果路徑表達式

選取當前classroom節點中子元素的teacher節點 /classroom/child：：teacher

選取全部id節點的父節點 //id/parent：：*

選取全部以classid爲子節點的祖先節點 //classid/ancestor：：*

選取classroom節點下的全部後代節點 /classroom/descendant：：*

選取全部以student爲父節點的id元素 //student/descendant：：id

選取全部classid元素的祖先節點及自己 //classid/ancestor-or-self：：*

選擇/classroom/student自己及其全部後代元素 /classroom/student/descendant-or-self：：*

選取/classroom/teacher以前的全部同級節點，結果就是選全部的student節點 /classroom/teacher/preceding-sibling：：*

選取/classroom中第二個stduent以後的全部同級節點 /classroom/student[2]/following-sibling：：*

選取/classroom/teacher節點全部以前的節點(除其祖先外)，不只僅是student節點，還有裏面的子節點 /classroom/teacher/preceding：：*

選取/classroom中第二個student以後的全部節點，結果就是選擇了teacher節點及其子節點 /classroom/student[2]/following：：*

選取student節點，單獨使用沒有什麼意思 //stduent/self：：*

選取/classroom/teacher/name節點下的全部屬性 /classroom/teacher/name/attribute：：*

XPath運算符示例分析

含義實例

選取classroom元素的全部student元素 /classroom/student[age=19+1] /classroom/stduent[age=5*4] /classroom/student[age=21-1]

且其中的age元素的值須等於20 /classroom/student[age=40div2]

相似能夠選取大於、小於、不等於等操做

or 運算實例 /classroom/stduent[age<20 or age>25] .................age小於20或者大於25

and 運算實例 /classroom/stduent[age>20 and age<25] ..................age在20 到25 之間

mod 計算除法的餘數

實例代碼

from lxml import etree

contentStream = open(r'xpathText.xml', 'rb')
content = contentStream.read().decode('utf-8')
root = etree.XML(content)
print(content)
print('-------')
em = root.xpath('/classroom/student[2]/following::*')
print(em[0].xpath('./name/text()'))#獲取name標籤中文本的內容
print(em[0].xpath('./name/@lang')) #獲取name標籤中屬性名爲lang的屬性值

View Code

BeautifulSoup基礎知識

建立BeautifulSoup對象的兩種方式 1.經過字符串建立 soup=BeautifulSoup(htl_str,'lxml') 其中'lxml'表示指定的解析方式

2.經過文件建立 soup=BeautifulSoup(open('index.html'))

對象種類四種 Tag、NavigableString、BeautifulSoup 、Comment

1）Tag

在html中每一個標籤及其裏面的內容就是一個Tag對象,如何抽取Tag呢？

soup.title抽取title soup.a 抽取a 利用soup+標記名查找的是再內容中第一個符合要求的標記

Tag中有兩個最重要的屬性：name和attributes.每一個Tag都有本身的名字，經過.name來獲取

修改Tag的name,修改完成後將影響全部經過當前Beautiful Soup對象生成的HTML文檔

html_str = """<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>
 <a href="http://example.com/lacie" class="sister" id="link2">
 <!--Lacie -->
 </a>
 and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>
</html>"""
soup = BeautifulSoup(html_str, 'lxml')    
# soup = BeautifulSoup(open(r'index.html','rb'),'lxml')
print(soup.prettify())  #以格式化的形式輸出文檔的內容
print(soup.name)      
print(soup.title.name)#輸出title的名稱
soup.title.name = 'mytitle'  #修改title的名稱爲mytitle
print(soup.title)    #title已經修改輸出None
print(soup.mytitle)#輸出mytitle  Tag

輸出結果

整個文檔的內容
[document]
title
None
<mytitle>The Dormouse's story</mytitle>

獲取Tag屬性？The Dormouse's storyTag p中有一個屬性class值爲title，獲取代碼以下：

Tag屬性值的修改相似於上述標籤名的修改 soup.p['class']='myclass' 就把屬性值title改成了myclass

# 獲取Tag中的屬性  和字典相似
print(soup.p['class'])
print(soup.p.get('class'))

輸出結果

['title']
['title']

用於獲取Tag全部屬性的方法 print(soup.p.attrs)以字典的行書獲取指定Tag的全部屬性：屬性值字典

輸出格式以下

{'class': ['title']}

2)NavigableString 當已經獲得了標記的內容，要想獲取標記的內部文字怎麼辦呢？須要用到.string。

print(soup.b.string)#輸出Tag對象b的內容
print(type(soup.b.string))#輸出Tage對象b的內容的類型  其實就是NavigableString類型

輸出結果

The Dormouse's story
<class 'bs4.element.NavigableString'>

3）Beautiful Soup

Beautiful Soup對象表示的是一個文檔的所有內容。大部分時候，能夠把它看成Tag對象，是一個特殊的人Tag,實例以下

print(type(soup.name))
print(soup.name)
print(soup.attrs)

輸出結果

<class 'str'>
[document]
{}

4) Comment 文檔的註釋部分，示例以下

print(soup.a.string)
print(type(soup.a.string))

輸出結果

Elsie 
<class 'bs4.element.Comment'>

遍歷文檔

1)子節點

Tag中的.contents和.children是很是重要的，都是輸出直接子節點，Tag的contents屬性能夠將Tag子節點以列表的方式輸出：

print(soup.html.contents)
print(soup.html.contents[1])#若是soup.html.contents[1].string會直接輸出文檔裏的內容，具體解釋看下面

輸出結果

['\n', <head><mytitle>The Dormouse's story</mytitle></head>, '\n', <body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>, '\n']
<head><mytitle>The Dormouse's story</mytitle></head>

Tag中children，其實.children返回的是一個生成器，能夠對Tag的子節點進行循環

for child in soup.html.children:  # 孩子結點遞歸循環
    print(child)

輸出結果：對於輸出換行時，他要空兩行，由於print自帶換行


<head><mytitle>The Dormouse's story</mytitle></head>


<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>

.descendants屬性能夠對全部tag的子孫節點進行遞歸循環：head中只有一個直接2節點title,但title也包含一個子節點：字符串'The Dormouse's story',

在這種狀況下，字符串也屬於<head>標記的子孫節點，

for child in soup.head.descendants:  # 子孫節點遞歸循環
    print(child)

輸出結果

<mytitle>The Dormouse's story</mytitle>
The Dormouse's story

如何獲取標記的內容呢？？？這就涉及.string、.strings、stripped_strings三個屬性

.string這個屬性頗有特色：若是一個標記裏面沒有標記了，那麼.string就會返回標記裏面的內容。若是標記裏面只有惟一

的一個標記了，那麼.string也會返回最裏面的內容。若是tag包含多個子節點，tag就沒法肯定，string方法應該調用哪一個子節點的內容，.string的輸出結果是None

print(soup.head.string)
print(soup.mytitle.string)
print(soup.html.string)

輸出結果

The Dormouse's story
The Dormouse's story
None

.strings屬性主要應用於tag中包含多個字符串的狀況，能夠進行循環遍歷

for stri in soup.strings:
    print(repr(stri))

輸出結果

'\n'
"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were\n    '
'\n'
'\n'
'\n'
'\n    and\n    '
'Tillie'
';\n    and they lived at the bottom of a well.\n'
'……'
'\n'
'\n'

.stripped_strings屬性能夠去掉輸出字符串中包含的空格或換行，示例以下

for stri in soup.stripped_strings:
    print(repr(stri))

輸出結果

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'and'
'Tillie'
';\n    and they lived at the bottom of a well.'
'……'

2)父節點

每一個Tag或者字符串都有父節點：被包含在某個Tag中。經過.parent能夠獲取某個元素的父節點

print soup.mytitle.parent 輸出<head><title>........</title></head>

經過元素的.parents屬性能夠遞歸獲得元素全部父輩節點，使用.parents方法遍歷了<a>標記到根節點的全部節點

print(soup.a)
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

輸出結果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
p
body
html
[document]

3)兄弟節點:能夠理解爲和本節點出在同一級上的節點，.next_sibling屬性能夠獲取該節點的下一個兄弟節點，.previous_sibling則與之相反，

若是節點不存在，則返回None

能夠經過.next_siblings和.previous_siblings來迭代全部的兄弟節點　

4)先後節點

先後節點須要使用.next_element、previous_element這兩個屬性,他針對全部節點，不分層次，例如<head><title>The Dormouse‘s story</title></head>

中的下一個節點是title

若是想遍歷全部的前節點或者後節點，經過.next_elements和previous_elements的迭代器就能夠向前或向後訪問文檔的解析內容

for elem in soup.html.next_elements:  #有點像深度優先遍歷
    print(repr(elem))

輸出結果

'\n'
<head><mytitle>The Dormouse's story</mytitle></head>
<mytitle>The Dormouse's story</mytitle>
"The Dormouse's story"
'\n'
<body>
<p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>
'\n'
<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
"The Dormouse's story"
<p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
'Once upon a time there were three little sisters; and their names were\n    '
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
' Elsie '
'\n'
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
'\n'
'Lacie '
'\n'
'\n    and\n    '
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
'Tillie'
';\n    and they lived at the bottom of a well.\n'
<p class="story">……</p>
'……'
'\n'
'\n'

搜索文檔

只介紹find_all()方法，其它方法相似

函數原型

find_all(name，attrs，recursive，text，**kwargs)

1)name參數

name參數能夠查找全部名字爲name的標記，字符對象會被自動忽略掉。name參數取值能夠是字符串、正則表達式、列表、True和方法

字符串案例用於查找文檔中全部的標記，返回值爲列表：

print(soup.find_all('b'))
#輸出結果
[<b>The Dormouse's story</b>]

傳入正則表達式做爲參數，會經過正則表達式的match()來匹配內容。下面列出全部以b開頭的標記，這表示<body>和標記

for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
#輸出結果
body
b

傳入列表

print(soup.find_all(['a','b']))//找到文檔中全部的<a>標記和標記

傳入True,True能夠匹配任何值，會查找全部的tag ,但不會返回字符串節點

for tag in soup.find_all(True):
    print(tag.name)
#輸出結果
html
head
mytitle
body
p
b
p
a
a
a
p

若是沒有合適過濾器，那麼還能夠定義一個方法，方法只接受一個元素參數Tag節點，若是這個方法返回？True表示當前元素匹配而且被找到

，若是不是則返回False,好比過濾包含class屬性，也包含id屬性的元素

def hasClass_Id(tag):
    return tag.has_attr('class') and tag.has_attr('id')
print(soup.find_all(hasClass_Id))
#輸出結果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2)kwargs參數

kwargs參數就是python中的keyword參數，若是一個指定名字的參數不是搜索內置的參數名，搜索時會把該參數看成指定名字Tag的屬性來搜索

。搜索指定名字的屬性時可使用的參數值包括字符串、正則表達式、列表、True

傳入字符串 print(soup.find_all(id='link2')) 會搜索每一個tag的id屬性

傳入正則表達式 print（soup.find_all(href=re.compile('elsie'))）搜索href屬性中含有‘elsie’的tag

True print(soup.find_all(id=True)) 文檔樹中查找全部包含id屬性的Tag,不管id的值是什麼：

若是想用 class過濾·，但class是python的關鍵字，須要在class後main加個下劃線:

soup.find_all('a',class_='sister')

有些tag屬性在搜索中不能使用，好比HTML5中的data-*屬性能夠經過find_all()方法的attrs參數定義一個字典參數來搜索包含特殊屬性的tag

，

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>', 'lxml')
print(data_soup.find_all(attrs={"data-foo": "value"}))
# data_soup.find_all(data - foo = 'value')  #報錯 特殊屬性不能這樣處理
#輸出結果
[<div data-foo="value">foo!</div>]

3)text參數

經過text參數能夠搜索文檔中的字符串內容。與name參數的可選值同樣，text參數接受字符、正則表達式、列表、True

print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.find_all(text=re.compile("Dormouse"))輸出結果

[u'Elsie', u'Lacie', u'Tillie']
[u"The Dormouse's story", u"The Dormouse's story"]

4)limit參數

find_all()方法返回所有的搜索結構，若是文檔樹很大那麼搜索會很慢2.若是咱們不須要所有結果，可使用limit參數限制返回結果的數量

soup.find_all('a',limit=2)值返回兩條結果

5)recursive參數

調用tag的find_all()方法是，Beautiful Soup會檢索當前tag的全部子孫節點，若是隻想檢索tag的直接子節點，可使用參數

recusive=False

print(soup.find_all('mytitle'))
print(soup.find_all('mytitle', recursive=False))
#輸出結果
[<mytitle>The Dormouse's story</mytitle>]
[]

1. xPath，beautifulsoup和pyquery
2. Python中使用XPath
3. python中使用XPath
4. python中使用 xpath
5. python web XPath BeautifulSoup pyQuery基礎
6. 網頁解析xpath和BeautifulSoup
7. Python BeautifulSoup的使用
8. Xpath or BeautifulSoup
9. BeautifulSoup 與 Xpath
10. Python-beautifulsoup使用
更多相關文章...
• Hibernate的快照技術 - Hibernate教程
• XML 相關技術 - XML 教程
• C# 中 foreach 遍歷的用法
• 適用於PHP初學者的學習線路和建議

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。