使用 XPath

時間 2019-11-18

標籤使用 xpath 简体版

原文原文鏈接

XPath 簡介：html

(1) 前面咱們爬取一個網頁，都是使用正則表達式來提取想要的信息，可是這種方式比較複雜，一旦有一個地方寫錯，就匹配不出來了，所以咱們能夠使用 XPath 來進行提取
(2) XPath 即 XML Path Language，XML路徑語言，起初是用來在 XML 文檔中提取信息的，但一樣適用於在 HTML 文檔中提取信息，經過 XPath 來定位一個或多個HTML節點
(3) 什麼是HTML節點：http://www.javashuo.com/article/p-uskpxpvw-br.html ；在 Python 中，使用 lxml 庫進行信息的提取，能夠使用 pip3 install lxml 進行安裝正則表達式

XPath 規則：spa

// 用於提取指定的節點，如 //p 表示提取全部 <p> 節點
/ 用於提取當前節點的子節點，如 //ul/li 表示提取 <ul> 下的 <li> 節點
.. 用於提取當前節點的父節點，如 //body/../html 表示提取 <body> 節點的上一層 <html> 節點
@ 用於提取屬性，如 //a[@href] 表示提取 <a> 節點的 href 屬性值code

XPath 用法：xml

假設 index.html 內容以下，使用 XPath 提取咱們想要的內容：htm

<div>
    <ul>
        <li class="item-1"><a href="link1.html">first item</a> </li>
        <li class="item-2"><a href="link2.html">second item</a> </li>
        <li class="item-3"><a href="link3.html">third item</a> </li>
        <li class="item-4"><a href="link4.html">fourth item</a> </li>
        <li class="item-5 id-6"><a href="link5.html">fifth item</a> </li>
    </ul>
</div>

from lxml import etree

html = etree.parse('./index.html', etree.HTMLParser())   //etree.parse()用於加載本地文件，etree.HTMLParser() 是一個 HTML 解析器，用於解析 HTML 文件

result = html.xpath('//li')                              //提取全部<li>節點，結果爲：[<Element li at 0x488030>, <Element li at 0x484fd0>, ......
result = html.xpath('//li/a')                            //提取全部 <li> 節點下的 <a> 節點，結果爲：[<Element a at 0x3c28030>, <Element a at 0x3c24fd0>, ......
result = html.xpath('//li/a/text()')                     //提取全部 <li> 節點下的 <a> 節點的文本內容，結果爲：['first item', 'second item', 'third item', 'fourth item', 'fifth item']
result = html.xpath('//li/a/@href')                      //提取全部 <li> 節點下的 <a> 節點的 href 屬性值，結果爲：['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
result = html.xpath('//li[1]/a/text()')                  //提取第一個出現的 <li> 節點，結果爲：['first item']
result = html.xpath('//li[last()]/a/text()')             //提取最後一個出現的 <li> 節點，結果爲：['fifth item']
result = html.xpath('//li[last()-2]/a/text()')           //提取倒數第三個出現的 <li> 節點，結果爲：['third item']
result = html.xpath('//li[position()<3]/a/text()')       //提取位置小於3的 <li> 節點，結果爲：['first item', 'second item']
result = html.xpath('//a[@href="link4.html"]')           //提取屬性爲 href="link4.html" 的 <a> 節點，結果爲：[<Element a at 0x33b4fd0>]
result = html.xpath('//a[@href="link4.html"]/../@class') //提取屬性爲 href="link4.html" 的 <a> 節點的父節點，而後獲取父節點的 class 屬性，結果爲：['item-4']
result = html.xpath('//li[1]/ancestor::*')               //ancestor用於提取祖先節點，也就是提取第一個<li>節點上面的全部節點，結果爲：[<Element html at 0x3738058>, <Element body at 0x3734fd0>, ......
result = html.xpath('//li[1]/ancestor::div')             //ancestor用於提取祖先節點，這裏表示提取第一個<li>節點上面的<div>節點，結果爲：[<Element div at 0x3554fd0>]
result = html.xpath('//li[1]/attribute::*')              //attribute用於提取節點的屬性值，這裏表示提取第一個<li>節點的全部屬性值，結果爲：['item-1']
result = html.xpath('//li[1]/child::*')                  //child用於提取指定節點下的全部子節點，這裏表示提取第一個<li>節點下的全部子節點，結果爲：[<Element a at 0xb58030>]
result = html.xpath('//li[contains(@class, "item") and contains(@class, "id")]')    //在上面的最後一個 <li> 節點中，class屬性有兩個值，咱們須要使用 contains 來進行模糊匹配，表示提取屬性值包含 item 和 id 的節點

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。