xpath解析html

時間 2019-11-30

標籤 xpath 解析 html 欄目 HTML 简体版

原文原文鏈接

XPath

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對元素和屬性進行遍歷。XPath 是 W3C XSLT 標準的主要元素，而且 XQuery 和 XPointer 都構建於 XPath 表達之上。html

在爬蟲中主要用於對html進行解析node

要解析的html:spa

from lxml import etree

# 要解析的html標籤
html_str = """
<li data_group="server" class="content"> 
    <a href="/commands.html" class="index" name="a1">第一個a標籤</a>
    <a href="/commands.html" class="index2" name="a2">第二個a標籤</a>
    <a href="/commands/flushdb.html">
        <span class="first">
            這是第一個span標籤
            <span class="second">
            這是第二個span標籤,第一個下的子span標籤
            </span>
        </span>
        <span class="third">這是第三個span標籤</span>
        <h3>這是一個h3</h3>
    </a></li>
"""

1. 對文件進行讀取解析操做code

# 解析xpath.html文件
html = etree.parse('xpath.html') print(html, type(html))  # <lxml.etree._ElementTree object at 0x00000141445A08C8> <class 'lxml.etree._ElementTree'>
a = html.xpath("//a") print(a, type(a))  # [<Element a at 0x141445a0808>, <Element a at 0x141445a0908>, <Element a at 0x141445a0948>] <class 'list'>

2. 找標籤的屬性信息server

# 找到全部a標籤的href和text
a = html.xpath("//a") a_href = html.xpath("//a/@href") a_text = html.xpath("//a/text()") print(a, type(a))   # [<Element a at 0x191c1691888>, <Element a at 0x191c1691848>, <Element a at 0x191c1691948>] <class 'list'>
print(a_href, type(a_href))  # ['/commands.html', '/commands.html', '/commands/flushdb.html'] <class 'list'>
print(a_text, type(a_text), len(a_text))

3. 找到指定的標籤xml

# 找到class="first"的span標籤
span_first = html.xpath("//span[@class='first']") span_first_text = html.xpath("//span[@class='first']/text()") print(span_first, type(span_first))   # [<Element a at 0x191c1691888>, <Element a at 0x191c1691848>, <Element a at 0x191c1691948>] <class 'list'>
print(span_first_text, type(span_first_text))  # ['這是第一個span標籤\n\t\t', '\n\t'] <class 'list'> # 找到第二個a標籤
a_second = html.xpath("//a")[1] # print(a_second, type(a_second)) # <Element a at 0x23844950808> <class 'lxml.etree._Element'>
a_second_text = a_second.text # ### a_second_t = a_second.get_text # ###print(a_second_t)
print(a_second_text, type(a_second_text))   # 第二個a標籤 <class 'str'>
a_second_href = a_second.get("href") print(a_second_href)  # /commands.html

4. 處理子標籤和後代標籤htm

# 找到li標籤下的a標籤下的全部span標籤
span_all = html.xpath("//li/a//span") print(span_all, type(span_all), len(span_all)) # [<Element span at 0x2d9dcd18888>, <Element span at 0x2d9dcd18988>, <Element span at 0x2d9dcd189c8>] <class 'list'> 3 # 找到li標籤下的a標籤下的span標籤
span = html.xpath("//li/a/span") print(span, type(span), len(span)) # [<Element span at 0x188548118c8>, <Element span at 0x18854811a08>] <class 'list'> 2

路徑表達式

表達式	描述
nodename	選取此節點的全部子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

匹配屬性

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

XPath運算符

運算符	描述	實例	返回值
\|	計算兩個節點集	//book \| //cd	返回全部擁有 book 和 cd 元素的節點集
+	加法	6 + 4	10
–	減法	6 – 4	2
*	乘法	6 * 4	24
div	除法	8 div 4	2
=	等於	price=9.80	若是 price 是 9.80，則返回 true。若是 price 是 9.90，則返回 false。
!=	不等於	price!=9.80	若是 price 是 9.90，則返回 true。若是 price 是 9.80，則返回 false。
<	小於	price<9.80	若是 price 是 9.00，則返回 true。若是 price 是 9.90，則返回 false。
<=	小於或等於	price<=9.80	若是 price 是 9.00，則返回 true。若是 price 是 9.90，則返回 false。
>	大於	price>9.80	若是 price 是 9.90，則返回 true。若是 price 是 9.80，則返回 false。
>=	大於或等於	price>=9.80	若是 price 是 9.90，則返回 true。若是 price 是 9.70，則返回 false。
or	或	price=9.80 or price=9.70	若是 price 是 9.80，則返回 true。若是 price 是 9.50，則返回 false。
and	與	price>9.00 and price<9.90	若是 price 是 9.80，則返回 true。若是 price 是 8.50，則返回 false。
mod	計算除法的餘數	5 mod 2	1

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　xpath文檔 blog

問題如何區別 a_second_2 = html.xpath("//li/a/text()")[1] a_second_1 = html.xpath("//li/a[1]/text()")

a_second_2 = html.xpath("//li/a/text()")[1] a_second_1 = html.xpath("//li/a[1]/text()") print(a_second_2, a_second_1)   # 第二個a標籤 ['第一個a標籤']

""" 能夠看到a_second_2打印的是 第二個a標籤 能夠看到a_second_1打印的是 第一個a標籤 xpath()方法返回的是一個列表類型 a_second_1表示找到li標籤下第一個a標籤的文本, 返回的是一個列表 a_second_2表示li標籤下的a標籤下的全部文本第二個 """

""" 打印每一個a標籤的文本 html.xpath("//li/a[1]/text()") html.xpath("//li/a[2]/text()") html.xpath("//li/a[3]/text()") 沒有list爲空 ['第一個a標籤'] ['第二個a標籤'] ['\n\t', '\n\t', '\n\t', '\n\t'] html.xpath("//li/a/text()") ['第一個a標籤', '第二個a標籤', '\n\t', '\n\t', '\n\t', '\n\t'] 能夠發現當a標籤下有其它標籤時會把\n\t字符也加入到列表中 """