xpath選擇器簡介及如何使用

時間 2019-11-24

標籤 xpath 選擇器簡介如何使用简体版

原文原文鏈接

xpath選擇器簡介及如何使用

1、總結

一句話總結：XPath 的全稱是 XML Path Language，即 XML 路徑語言，它是一種在結構化文檔（好比 XML 和 HTML 文檔）中定位信息的語言，XPath 使用路徑表達式來選取 XML 文檔中的節點或節點集。節點是經過沿着路徑 (path) 或者步 (steps) 來選取的。

一、xpath如何使用？

19 xml=loadXMLDoc("/example/xmle/books.xml"); 20 path="/bookstore/book/title"  32 // code for Mozilla, Firefox, Opera, etc. 33 else if (document.implementation && document.implementation.createDocument) 34 { 35 var nodes=xml.evaluate(path, xml, null, XPathResult.ANY_TYPE, null); 36 var result=nodes.iterateNext(); 37 38 while (result)

二、插件如何使用？

其實全部的插件的使用無非就是如下步驟：javascript

一、引入插件html

二、調用函數java

2、xpath選擇器

XPath 的全稱是 XML Path Language，即 XML 路徑語言，它是一種在結構化文檔（好比 XML 和 HTML 文檔）中定位信息的語言，XPath 使用路徑表達式來選取 XML 文檔中的節點或節點集。節點是經過沿着路徑 (path) 或者步 (steps) 來選取的。node

XPath 是一門在 XML 文檔中查找信息的語言。XPath 可用來在 XML 文檔中對元素和屬性進行遍歷。python

XPath 是 W3C XSLT 標準的主要元素，而且 XQuery 和 XPointer 都構建於 XPath 表達之上。web

所以，對 XPath 的理解是不少高級 XML 應用的基礎。面試

1. 語法

1.1 HTML 實例文檔

後面咱們將如下面的 HTML 文檔介紹 XPath 的使用 http://doc.scrapy.org/en/latest/_static/selectors-sample1.html。scrapy

<html>
    <head>
        <basehref='http://example.com/' />
        <title>Example website</title>
    </head>
    <body>
        <divid='images'>
            <ahref='image1.html'>Name: My image 1 <br/><imgsrc='image1_thumb.jpg'/></a>
            <ahref='image2.html'>Name: My image 2 <br/><imgsrc='image2_thumb.jpg'/></a>
            <ahref='image3.html'>Name: My image 3 <br/><imgsrc='image3_thumb.jpg'/></a>
            <ahref='image4.html'>Name: My image 4 <br/><imgsrc='image4_thumb.jpg'/></a>
            <a>Name: My image 5 <br/><imgsrc='image5_thumb.jpg'/></a>
        </div>
    </body></html>

1.2 選取節點

下表是 XPath 經常使用的語法，實例對應的是上面的 HTML 文檔。函數

表達式	描述	實例	結果
nodename	選取此節點的全部子節點	body	選取 body 元素的全部子節點
/	從根節點選取	/html	選取根元素 html
//	匹配選擇的當前節點，不考慮位置	//img	選取全部 img 元素，而無論它們在文檔的位置
.	選取當前節點	./img	選取當前節點下的 img 節點
..	選取當前節點的父節點	../img	選取當前節點的父節點下的 title
@	選取屬性	//a[@href=」image1.html」]	選取全部 href 屬性爲「image1.html」的 a 節點
		//a/@href	獲取全部 a 節點的 href 屬性的值

1.3 謂語

謂語用來查找某個特定的節點或者包含某個指定的值的節點，謂語嵌在方括號中。post

路徑表達式	結果
//body//a[1]	選取屬於 body 子元素的第一個 a 元素
//body//a[last()]	選取屬於 body 子元素的最好一個 a 元素
//a[@href]	選取全部擁有名爲 href 的屬性的 a 元素
//a[@href=’image2.html’]	選取全部 href 屬性等於「image2.html」的 a 元素

2. 在 Python 中使用

在 python 中使用 XPath 須要安裝相應的庫，這裏推薦使用 lxml 庫。

代碼示例：

# -*- coding: utf-8 -*-

from lxml import etree

html = """<html>
    <head>
        <base href='http://example.com/' />
        <title>Example website</title>
    </head>
    <body>
        <div id='images'>
            <a href='image1.html'>Name: My image 1 <br/><img src='image1_thumb.jpg'/></a>
            <a href='image2.html'>Name: My image 2 <br/><img src='image2_thumb.jpg'/></a>
            <a href='image3.html'>Name: My image 3 <br/><img src='image3_thumb.jpg'/></a>
            <a href='image4.html'>Name: My image 4 <br/><img src='image4_thumb.jpg'/></a>
            <a>Name: My image 5 <br/><img src='image5_thumb.jpg'/></a>
        </div>
    </body>
</html>"""

from lxml import etree
soup = etree.HTML(html)
page=soup.xpath('/html/head/base/@href') #從根節點開始選取
page=soup.xpath('/html/head//@href') #也能夠這樣選擇結果是相同的
#//表示從當前節點開始選擇，沒必要考慮位置。
#選取title下全部文本
page=soup.xpath("//title/text()")
#選取HTML下全部a節點
page=soup.xpath('//a')
#選取標籤下屬性爲image.html的scr屬性
page=soup.xpath("//a[@href='image1.html']/img/@src")
#選取a標籤下第三個href屬性
page=soup.xpath("//a[contains(@href, '3')]/@href")
#body最後一個a標籤href屬性
page=soup.xpath("//body//a[last()]/img/@src")

page=soup.xpath('//a[@class="active"][@id="value"]/img/@src') #多個屬性定位

3.經常使用函數

除了索引、屬性外，Xpath還能夠使用便捷的函數來加強定位的準確性。下面試經常使用的幾個函數：

#定位href屬性中包含「promote.html」的全部a節點
//a[contains(@href,'promote.html')]

#元素內的文本爲「應用推廣」的全部a節點
//a[text()='應用推廣']

#href屬性值是以「/ads」開頭的全部a節點
//a[starts-with(@href,'/ads')]

參考：xpath選擇器 - moon的博客 - CSDN博客
https://blog.csdn.net/qq_32942549/article/details/78400675

3、xpath使用實例

一、操做的xml（books.xml）

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book category="COOKING">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

<book category="CHILDREN">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

<book category="WEB">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

<book category="WEB">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

</bookstore>

二、需求及代碼

選取全部 title

下面的例子選取全部 title 節點：

/bookstore/book/title

 1 <html>
 2 <body>
 3 <script type="text/javascript">
 4 function loadXMLDoc(dname)
 5 {
 6 if (window.XMLHttpRequest)
 7   {
 8   xhttp=new XMLHttpRequest();
 9   }
10 else
11   {
12   xhttp=new ActiveXObject("Microsoft.XMLHTTP");
13   }
14 xhttp.open("GET",dname,false);
15 xhttp.send("");
16 return xhttp.responseXML;
17 }
18 
19 xml=loadXMLDoc("/example/xmle/books.xml");
20 path="/bookstore/book/title"
21 // code for IE
22 if (window.ActiveXObject)
23 {
24 var nodes=xml.selectNodes(path);
25 
26 for (i=0;i<nodes.length;i++)
27   {
28   document.write(nodes[i].childNodes[0].nodeValue);
29   document.write("<br />");
30   }
31 }
32 // code for Mozilla, Firefox, Opera, etc.
33 else if (document.implementation && document.implementation.createDocument)
34 {
35 var nodes=xml.evaluate(path, xml, null, XPathResult.ANY_TYPE, null);
36 var result=nodes.iterateNext();
37 
38 while (result)
39   {
40   document.write(result.childNodes[0].nodeValue);
41   document.write("<br />");
42   result=nodes.iterateNext();
43   }
44 }
45 </script>
46 
47 </body>
48 </html>