XPath使用總結

 

參考:https://cuiqingcai.com/5545.htmlhtml

 

XPathnode

  • XML Path Language
  • 在XML文檔中查找信息,同樣適用於HTML文檔
  • 使用路徑選擇表達式的方式查找信息

XPath經常使用規則ui

  • nodename:選取次節點的全部子節點
  • /:從當前節點選取直接子節點
  • //: 從當前節點選取子孫節點
  • .: 選取當前節點
  • ..: 選取當前節點的父節點
  • @: 選取屬性

 

text = '''
  <div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

 

選取全部節點

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath('//*')
print(result)

輸出spa

[<Element html at 0x1761bfd5508>, <Element body at 0x1761bfd5a88>, <Element div at 0x1761bfd5ac8>, <Element ul at 0x1761bfd5b08>, <Element li at 0x1761bfd5e88>, <Element a at 0x1761bfd5f08>, <Element li at 0x1761bfd5f48>, <Element a at 0x1761bfd5f88>, <Element li at 0x1761bfd5fc8>, <Element a at 0x1761bfd5ec8>, <Element li at 0x1761bfdb048>, <Element a at 0x1761bfdb088>, <Element li at 0x1761bfdb0c8>, <Element a at 0x1761bfdb108>]

子節點

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath('//li/a')
print(result)

輸出code

[<Element a at 0x1761c02dec8>, <Element a at 0x1761c02de88>, <Element a at 0x1761c02df08>, <Element a at 0x1761c02df48>, <Element a at 0x1761c02df88>]

父節點

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath('//li/..')
print(result)

輸出xml

[<Element ul at 0x1761ae7c288>]

屬性匹配

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath('//li[@class="item-0"]')
print(result)

輸出htm

[<Element li at 0x1761afe2dc8>, <Element li at 0x1761c067748>]

注:[@class="item-0"]要使用雙引號

文本獲取

from lxml import etree

selector = etree.HTML(text)
result1 = selector.xpath('//li[@class="item-0"]/text()')
result2 = selector.xpath('//li[@class="item-0"]/a/text()')
print(result1)
print(result2)

輸出blog

['\n     ']
['first item', 'fifth item']

注://li[@class="item-0"]/text()獲得['\n '] 因"/"是獲取直接子節點it

屬性獲取

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath('//li[@class="item-0"]/a/@href')
print(result)

輸出io

['link1.html', 'link5.html']

屬性多值匹配

from lxml import etree

text1 = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''

selector = etree.HTML(text1)
result1 = selector.xpath('//li[@calss="li"]/a/text()')
result2 = selector.xpath('//li[contains(@class,"li")]/a/text()')
print(result1)
print(result2)

輸出

[]
['first item']

多屬性匹配

from lxml import etree

text2 = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''

selector = etree.HTML(text2)
result = selector.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result

輸出

['first item']

按序選擇

from lxml import etree
 
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
selector = etree.HTML(text)
result1 = selector.xpath('//li[1]/a/text()')
print(result1)
result2 = selector.xpath('//li[last()]/a/text()')
print(result2)
result3 = selector.xpath('//li[position()<3]/a/text()')
print(result3)
result4 = selector.xpath('//li[last()-2]/a/text()')
print(result4)

輸出

['first item']
['fifth item']
['first item', 'second item']
['third item']

節點軸選擇

from lxml import etree
 
text3 = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
selector = etree.HTML(text3)
result1 = selector.xpath('//li[1]/ancestor::*')
print(result1)
result2 = selector.xpath('//li[1]/ancestor::div')
print(result2)
result3 = selector.xpath('//li[1]/attribute::*')
print(result3)
result4 = selector.xpath('//child::a[@href="link1.html"]')
print(result4)
result5 = selector.xpath('//li[1]/descendant::span')
print(result5)
result6 = selector.xpath('//li[1]/following::*[2]')
print(result6)
result7 = selector.xpath('//li[1]/following-sibling::*')
print(result7)

輸出

[<Element html at 0x1761c02db88>, <Element body at 0x1761c07bf08>, <Element div at 0x1761c078308>, <Element ul at 0x1761c086088>]
[<Element div at 0x1761c078308>]
['item-0']
[<Element a at 0x1761c086288>]
[<Element span at 0x1761c06e6c8>]
[<Element a at 0x1761c06e688>]
[<Element li at 0x1761c078b08>, <Element li at 0x1761c078648>, <Element li at 0x1761c0864c8>, <Element li at 0x1761c086448>]
相關文章
相關標籤/搜索