Xpath語法與lxml庫

時間 2019-12-12

標籤 xpath 語法 lxml 简体版

原文原文鏈接

1. Xpath

1 )什麼是XPath？html

xpath（XML Path Language）是一門在XML和HTML文檔中查找信息的語言，可用來在XML和HTML文檔中對元素和屬性進行遍歷。node

2) XPath開發工具python

Chrome插件XPath Helper。
Firefox插件Try XPath。

1.1Xpath語法

<?xml version="1.0" encoding="ISO-8859-1"?>

<bookstore>

<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>

xml實例文檔

1.1.1 選取節點

XPath 使用路徑表達式在 XML 文檔中選取節點。節點是經過沿着路徑或者 step 來選取的。app

下面列出了最有用的路徑表達式：ide

表達式	描述
nodename	選取此節點的全部子節點。
/	若是是在最前面，表明從根節點選取。不然選擇某節點下的直接子節點
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

實例：在下面的表格中，列出了一些路徑表達式以及表達式的結果：函數

路徑表達式	結果
bookstore	選取 bookstore 元素的全部子節點。
/bookstore	選取根元素 bookstore。工具註釋：假如路徑起始於正斜槓( / )，則此路徑始終表明到某元素的絕對路徑！post
bookstore/book	選取屬於 bookstore 的子元素的全部 book 元素。
//book	選取全部 book 子元素，而無論它們在文檔中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的全部 book 元素，而無論它們位於 bookstore 之下的什麼位置。
//book[@lang]	選取全部擁有lang屬性的book節點。

1.1.2 謂語（Predicates）

謂語用來查找某個特定的節點或者包含某個指定的值的節點。開發工具

謂語被嵌在方括號中。ui

實例：在下面的表格中，列出了帶有謂語的一些路徑表達式，以及表達式的結果：

路徑表達式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取全部擁有名爲 lang 的屬性的 title 元素。
//title[@lang='eng']	選取全部 title 元素，且這些元素擁有值爲 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的全部 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的全部 title 元素，且其中的 price 元素的值須大於 35.00。

1. 1.3 通配符

XPath 通配符可用來選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何類型的節點。

實例：在下面的表格中，咱們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
/bookstore/*	選取 bookstore 元素的全部子元素。
//*	選取文檔中的全部元素。
//title[@*]	選取全部帶有屬性的 title 元素。

1.1.4 選取若干路徑

經過在路徑表達式中使用「|」運算符，您能夠選取若干個路徑。

實例：在下面的表格中，咱們列出了一些路徑表達式，以及這些表達式的結果：

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的全部 title 和 price 元素。
//title \| //price	選取文檔中的全部 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的全部 title 元素，以及文檔中全部的 price 元素。

Xpath語法詳解文檔路徑：http://www.w3school.com.cn/xpath/index.asp

2. lxml庫

lxml是python的一個解析庫，支持HTML和XML的解析，支持XPath解析方式，並且解析效率很是高

2.1 lxml庫經常使用類的屬性和方法

object ---+
          |
         _Element

# =====================================
# Properties(屬性)
# =====================================

attrib  # 元素屬性字典
base  # 原始文檔的url或None
sourceline  # 原始行數或None
tag  # tag名
tail  # 尾巴文本(存在於兄弟節點間，屬於父節點的文本內容)
text  # 位於第一個子標籤以前的子文本
prefix  # 命名空間前綴(XML)(詳解見底部附錄)
nsmap  # 命名空間與URL映射關係(XML)(詳解見底部附錄)


# =====================================
# Instance Methods(實例方法)(經常使用)
# =====================================

xpath(self, _path, namespaces=None, extensions=None, smart_strings=True, **_variables)
# 經過xpath表達式查找指定節點元素，返回指定節點元素列表或None

getparent(self)
# 查找父節點，返回找到的父節點對象或None

getprevious(self)
# 查找前一個相鄰的兄弟節點元素，返回找到的節點對象或None

getnext(self)
# 查找後一個相鄰的兄弟節點對象，返回找到的節點對象或None

getchildren(self)
# 返回全部直屬的子節點對象

getroottree(self)
# 返回所在文檔的根節點樹

find(self, path, namespaces=None)
# 根據標籤名或路徑，返回第一個匹配到的子節點對象

findall(self, path, namespaces=None)
# 根據標籤名或路徑，返回所有符合要求的子節點對象

findtext(self, path, default=None, namespaces=None)
# 根據標籤名或路徑，返回第一個匹配到的子節點對象的text文本

clear(self)
# 重置節點對象，清除全部子節點對象，以及全部的text、tail對象

get(self, key, default=None)
# 返回節點屬性key對應的值

items(self)
# 以任意順序返回節點屬性鍵和值

keys(self)
# 以任意順序返回包含節點所有屬性名的列表

values(self)
# 以任意順序返回包含節點所有屬性值的列表

set(self, key, value)
# 設置節點屬性

Class _Element(頂級基類)

object ---+
          |
   _Element ---+
               |
              ElementBase

# =====================================
Functions(函數)(經常使用)
# =====================================

HTML(text, parser=None, base_url=None)
# 將字符型HTML文檔內容轉換爲節點樹對象

fromstring(text, parser=None, base_url=None)
# 將字符型XML文檔或文檔片斷轉換問節點樹對象

tostring(element_or_tree, encoding=None, method="xml", xml_declaration=None, pretty_print=False, with_tail=True, standalone=None, doctype=None, exclusive=False, with_commments=True, inclusive_ns_prefixes=None)
# 將節點樹對象序列化爲編碼的字符型

tounicode(element_or_tree, method="xml", pretty_print=False, with_tail=True, doctype=None)
# 將節點樹對象序列化爲Unicode型

lxml.etree

 object ---+ 
            | 
etree._Element ---+
                  | 
    etree.ElementBase---+ 
                        | 
         object ---+    | 
                   |    |
           HtmlMixin ---+  
                        |
                       HtmlElement


# =====================================
Functions(函數)(經常使用)
# =====================================

fromstring(html, base_url=None, parser=None, **kwargs)
# 將字符型html文檔轉換爲節點樹或文檔樹

tostring(doc, pretty_print=False, include_meta_content_type=False, encoding=None, method="html", with_tail=True, doctype=None)
# 將節點樹或文檔樹序列化爲字符型

######################################
**Class HtmlMixin**

object ---+
          |
          HtmlMixin

# =====================================
Properties(屬性)
# =====================================
base_url  # 文檔url
head  # <head>標籤部分
body  # <body>標籤部分
forms  # 返回所有form列表
label  # 元素的label標籤
classes  # class屬性值的集合

# =====================================
Instance Methods(實例方法)(經常使用)
# =====================================

drop_tag(self)
# 移除標籤，但不移除其子標籤和text文本，將其合併到父節點

drop_tree(self)
# 移除節點樹（包含子節點和text），但不移除它的tail文本，將其合併到父節點或前一個兄弟節點

find_class(self, class_name)
# 根據class屬性值查找節點元素

get_element_by_id(self, rel)
# 根據id屬性值查找節點元素

set(self, key, value=None)
# 設置節點元素的屬性

text_content(self)
# 返回其後代節點與其自身的所有text內容

lxml.html

2.2 從字符串中解析HTML代碼

解析html字符串，使用'lxml.etree.HTML'進行解析。

# 使用 lxml 的 etree 庫
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此處缺乏一個 </li> 閉合標籤
     </ul>
 </div>
'''

#利用etree.HTML，將字符串解析爲HTML文檔
htmlElementTree = etree.HTML(text) 

# 按字符串序列化HTML文檔
result = etree.tostring(htmlElementTree,encoding='utf-8') .decode('utf-8'))

print(result)

View Code

輸出結果以下：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

能夠看到。lxml會自動修改HTML代碼。例子中不只補全了li標籤，還添加了body，html標籤。

2.3 從文件中解析html代碼

除了直接使用字符串進行解析，lxml還支持從文件中讀取內容。咱們新建一個hello.html文件：

<!-- hello.html -->
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

解析html文件，使用lxml.etree.parse()進行解析，這個函數默認使用XMLparser解析器，因此若是遇到一些不規範的HTML代碼就會解析錯誤，此時須要本身建立HTMLparser解析器。示例代碼以下：

from lxml import etree
# 讀取外部文件 hello.html
parser = etree.HTMLParser()#指定解析器HTMLParser,解析時會根據文件修復HTML文件中缺失的信息
htmlElementTree = etree.parse('hello.html',parser = parser) 
result = etree.tostring(htmlElementTree,encoding = 'utf-8',pretty_print=True).decode('utf-8')
print(result)

輸出結果和以前是相同的。

2.4 Xpath與lxml結合

#-*-coding = utf-8 -*-
from lxml import etree
import requests
#爬取豆瓣電影熱映電影信息
headers = {
    "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

response = requests.request(method='get',url='https://movie.douban.com',headers=headers)
text = response.text
parser = etree.HTMLParser()
html = etree.fromstring(text,parser=parser)
ul = html.xpath('//ul[@class="ui-slide-content"]')[0]
li_list = ul.xpath('./li')
move_list = []
for li in li_list:
    if li.xpath('./@data-title')!= []:
        data_title = li.xpath('./@data-title')
        data_release = li.xpath('./@date-release')
        data_rate = li.xpath('./@data-rate')
        data_duration = li.xpath('./@data-duration')
        data_director = li.xpath('./@data-director')
        data_actors = li.xpath('./@data-actors')
        data_postor = li.xpath('.//img/@src')
        data = {
            'data_title':data_title,
            'data_release':data_release,
            'data_rate':data_rate,
            'data_duration':data_duration,
            'data_director':data_director,
            'data_actors':data_actors,
            'data_postor':data_postor
        }
        move_list.append(data)


print(move_list)

爬取豆瓣電影熱映電影信息

如下面的xml練習lxml結合Xpath語法查找感興趣的元素

<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>

xml="""<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
"""
#1）獲得根節點
root = etree.fromstring(xml.encode('utf-8'))#<Element bookstore at 0x2044cf28e08>
#2）選取全部book子元素，注意xpath()方法返回的是列表
booklist=root.xpath('book')#[<Element book at 0x1bf9d0bddc8>, <Element book at 0x1bf9d0bdd88>]
#3）選取根節點bookstore
bookstore = root.xpath('/bookstore')#[<Element bookstore at 0x2563d6e6ec8>]
#4）選取全部book子元素的title子元素
titlelist1 = root.xpath('/bookstore/book/title')#[<Element title at 0x1ceb6736f48>, <Element title at 0x1ceb6736f88>]
titlelist2 = root.xpath('book/title')#[<Element title at 0x22da6316fc8>, <Element title at 0x22da6333048>]
#5）以根節點爲始祖，選取其後代的title元素
titlelist = root.xpath('//title')#[<Element title at 0x195107c3048>, <Element title at 0x195107c3088>]
#6)以book子元素爲始祖，選取後代中的price元素
pricelist = root.xpath('book//price')#[<Element price at 0x200d84321c8>, <Element price at 0x200d8432208>]
#7)以根節點爲始祖，選取其後代的lang屬性值
langValue = root.xpath('//@lang')#['eng', 'eng']
#8）獲取bookstore的第一個book子元素
book = root.xpath('/bookstore/book[1]')#[<Element book at 0x25f421920c8>]
#9)獲取bookstore的最後一個book子元素
book_last = root.xpath('/bookstore/book[last()]')#[<Element book at 0x1bf133f2048>]
#10)選取bookstore的倒數第二個book子元素
print(root.xpath('/bookstore/book[last()-1]'))#[<Element book at 0x1ff5cbf2088>]
#11)選取bookstore的前兩個book子元素
print(root.xpath('/bookstore/book[position()<3]'))#[<Element book at 0x172ac252088>, <Element book at 0x172ac252048>]
#12)以根節點爲始祖，選取其後代中含有lang屬性的title元素
print(root.xpath('//title[@lang]'))#[<Element title at 0x1a2431cb188>, <Element title at 0x1a2431cb1c8>]
#13)以根節點爲始祖，選取其後代中含有lang屬性而且其值爲eng的title元素
print(root.xpath("//title[@lang='eng']"))#[<Element title at 0x1ac988f1188>, <Element title at 0x1ac988f11c8>]
#14)選取bookstore子元素book，條件是book的price子元素要大於35
print(root.xpath('/bookstore/book[price>35.00]'))#[<Element book at 0x2a907bf1088>]
#15)選取bookstore子元素book的子元素title,條件是book的price子元素要大於35
print(root.xpath('/bookstore/book[price>35.00]/title'))#[<Element title at 0x1f309bf11c8>]
#16）選取bookstore的全部子元素
print(root.xpath('/bookstore/*'))#[<Element book at 0x24fe7e51108>, <Element book at 0x24fe7e510c8>]
#17)選取根節點的全部後代元素
print(root.xpath('//*'))#[<Element bookstore at 0x195e1061188>, <Element book at 0x195e1061108>, <Element title at 0x195e10611c8>, <Element price at 0x195e10612c8>, <Element book at 0x195e10610c8>, <Element title at 0x195e1061208>, <Element price at 0x195e1061308>]
#18）選取根節點的全部具備屬性的title元素
print(root.xpath('//title[@*]'))#[<Element title at 0x1eb712c1208>, <Element title at 0x1eb712c1248>]
#19）選取當前節點下的全部節點。'\n'是文本節點
print(root.xpath('node()'))#['\n    ', <Element book at 0x23822bb1148>, '\n    ', <Element book at 0x23822bb1108>, '\n']
#20）選取根節點全部後代節點，包括元素、屬性、文本
print(root.xpath('//node()'))#[<Element bookstore at 0x2013d601208>, '\n    ', <Element book at 0x2013d601188>, '\n        ', <Element title at 0x2013d601248>, 'Harry Potter', '\n        ', <Element price at 0x2013d601348>, '29.99', '\n    ', '\n    ', <Element book at 0x2013d601148>, '\n        ', <Element title at 0x2013d601288>, 'Learning XML', '\n        ', <Element price at 0x2013d601388>, '39.95', '\n    ', '\n']
#21）選取全部book的title元素或者price元素
print(root.xpath('//book/title|//book/price'))#[<Element title at 0x1c64d751248>, <Element price at 0x1c64d751348>, <Element title at 0x1c64d751288>, <Element price at 0x1c64d751388>]
#22）選取全部的title或者price元素
print(root.xpath('//title|//price'))#[<Element title at 0x212757e1288>, <Element price at 0x212757e1388>, <Element title at 0x212757e12c8>, <Element price at 0x212757e13c8>]

xml_1="""<?xml version="1.0" encoding="utf8"?>
<bookstore>
    <book>
        <title lang="eng">Harry Potter</title>
        <price>29.99</price>
        <content>分部內容
            <part1>
                HarryPotter and the Philosopher's Stone
                    <br>
                    1.大難不死的男孩
                    <br>
                    2.悄悄消失的玻璃
                    <br>
                    3.貓頭鷹傳書
                    <br>
                    4.鑰匙保管員
            </part1>
            <part2>HarryPotter and the Chamber of Secrets</part2>
            <part3>HarryPotter and the Prisoner of Azkaban</part2>
            <part3>HarryPotter and the Prisoner of Azkaban</part2>
        </content>
    </book>
    <book>
        <title lang="eng">Learning XML</title>
        <price>39.95</price>
    </book>
</bookstore>
"""

#23）獲取全部price的文本內容
root = etree.fromstring(xml_1.encode('utf-8'),parser=etree.HTMLParser())
#way1
print(root.xpath('//price/text()'))#['29.99', '39.95'],
print(type(root.xpath('//price/text()')[0]))#返回的是一個<class 'lxml.etree._ElementUnicodeResult'>
#way2
price_list = root.xpath('//price')
for price in price_list:
    print(price.xpath("string(.)"))#若是匹配的標籤是多個，直接用xpath的string(.)方法會報錯，如:root.xpath('//price/string(.)')
    #29.99
    #39.95
print(root.xpath('//content/part1/text()'))#["\n                HarryPotter and the Philosopher's Stone\n                    ", '\n                    1.大難不死的男孩\n                    ', '\n                    2.悄悄消失的玻璃\n                    ', '\n                    3.貓頭鷹傳書\n                    ', '\n                    4.鑰匙保管員\n            ']
#24）注意
#1.使用'xpath'語法，應該使用'Element.xpath'方法來選擇感興趣的元素.’xpath函數返回來的永遠是一個列表。
#2.獲取某個標籤的屬性:href = html.xpath('//a/@href')
#3.獲取某個標籤的文本，經過xpath中的'text()'函數，root.xpath('//price/text()')

Xpath練習

#-*-coding = utf-8 -*-
from lxml import etree
import requests

BASE_DOMAIN = 'https://www.dytt8.net'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

def get_detail_urls(url):
    response = requests.request(method='get',url=url,headers=headers)
    html=response.text
    parser = etree.HTMLParser()
    root = etree.fromstring(html,parser=parser)
    movies_url_list = root.xpath('//table[@class="tbspan"]//a/@href')
    #movies_urls = list(map(lambda url:BASE_DOMAIN + url,movies_url_list))
    movies_urls = list(map(lambda url:''.join((BASE_DOMAIN,url)),movies_url_list))
    return movies_urls

def parse_detail_page(url):
    movie = {}
    response = requests.request(method='get',url=url,headers=headers)
    html = response.content.decode('gbk')
    parser = etree.HTMLParser()
    root = etree.fromstring(html, parser=parser)
    title = root.xpath('//h1/font[@color="#07519a"]/text()')[0]
    movie['title'] = title
    zoom = root.xpath('//div[@id="Zoom"]')[0]
    infors = zoom.xpath('.//p/text()')
    for index,infor in enumerate(infors):
        if infor.startswith('◎年　　代'):
            movie['年代'] = infor.replace('◎年　　代','').strip()
        elif infor.startswith('◎產　　地'):
            movie['產地'] = infor.replace('◎產　　地','').strip()
        elif infor.startswith('◎類　　別'):
            movie['類別'] = infor.replace('◎類　　別', '').strip()
        elif infor.startswith('◎語　　言'):
            movie['語言'] = infor.replace('◎語　　言', '').strip()
        elif infor.startswith('◎字　　幕'):
            movie['字幕'] = infor.replace('◎字　　幕', '').strip()
        elif infor.startswith('◎豆瓣評分'):
            movie['豆瓣評分'] = infor.replace('◎豆瓣評分', '').strip()
        elif infor.startswith('◎片　　長'):
            movie['片長'] = infor.replace('◎片　　長', '').strip()
        elif infor.startswith('◎導　　演'):
            movie['導演'] = infor.replace('◎導　　演', '').strip()
        elif infor.startswith('◎主　　演'):
            movie['主演'] = []
            movie['主演'].append(infor.replace('◎主　　演', '').strip())
            for infor in infors[index+1:len(infors)]:
                if infor.startswith('◎'):
                    break
                movie['主演'].append(infor.strip())
        elif infor.startswith('◎簡　　介'):
            profile = infor.replace('◎簡　　介', '').strip()
            for infor in infors[index+1:len(infors)]:
                profile = profile + infor.strip()
            movie['簡介'] = profile
        movie['下載地址'] = root.xpath('//td[@bgcolor = "#fdfddf"]/a/@href')[0]
    return movie

def spider():
    #url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_1.html'
    base_url = 'https://www.dytt8.net/html/gndy/dyzz/list_23_{}.html'
    movies = []
    for i in range(1,2):
        url = base_url.format(i)
        movies_urls = get_detail_urls(url)
        for detail_url in movies_urls:
            movie = parse_detail_page(detail_url)
            movies.append(movie)
    return movies
if __name__ == '__main__':
    movies = spider()
    print(movies)

爬取電影天堂電影信息

>>>>>待續

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。