介紹:css
最近在學Python爬蟲,在這裏對數據解析模塊lxml作個學習筆記。html
lxml、xpath及解析器介紹:java
lxml是Python的一個解析庫,支持HTML和XML的解析,支持xpath解析方式,並且解析效率很是高。xpath,全稱XML Path Language,即XML路徑語言,它是一門在XML文檔中查找信息的語言,它最初是用來搜尋XML文檔的,可是它一樣適用於HTML文檔的搜索node
xml文件/html文件結點關係:python
父節點(Parent)linux
子節點(Children)web
同胞節點(Sibling)服務器
先輩節點(Ancestor)app
後代節點(Descendant)ide
xpath語法:
nodename 選取此節點的全部子節點
// 從任意子節點中選取
/ 從根節點選取
. 選取當前節點
.. 選取當前節點的父節點
@ 選取屬性
解析器比較:
解析器 速度 難度
re 最快 難
BeautifulSoup 慢 很是簡單
lxml 快 簡單
學習筆記:
# -*- coding: utf-8 -*-
from lxml import etree
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p><b>The Dormouse's story</b></p>
<p>Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class=... ... ... ... ... ... "sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" id="link2">Lacie</a> and
<a href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p>...</p>
"""
selector = etree.HTML(html_doc) #建立一個對象
links = selector.xpath('//p[@class="story"]/a/@href') # 取出頁面內全部的連接
for link in links:
print link
xml_test = """
<?xml version='1.0'?>
<?xml-stylesheet type="text/css" href="first.css"?>
<notebook>
<user id="1" category='cb' class="dba python linux">
<name>lizibin</name>
<sex>m</sex>
<address>sjz</address>
<age>28</age>
<concat>
<email>konigerwin@163.com</email>
<phone>135......</phone>
</concat>
</user>
<user id="2" category='za'>
<name>wsq</name>
<sex>f</sex>
<address>shanghai</address>
<age>25</age>
<concat>
<email>konigerwiner@163.com</email>
<phone>135......</phone>
</concat>
</user>
<user id="3" category='za'>
<name>liqian</name>
<sex>f</sex>
<address>SH</address>
<age>28</age>
<concat>
<email>konigerwinarry@163.com</email>
<phone>135......</phone>
</concat>
</user>
<user id="4" category='cb'>
<name>qiangli</name>
<sex>f</sex>
<address>SH</address>
<age>29</age>
<concat>
<email>konigerwinarry@163.com</email>
<phone>135......</phone>
</concat>
</user>
<user id="5" class="dba linux c java python test teacher">
<name>buzhidao</name>
<sex>f</sex>
<address>SH</address>
<age>999</age>
<concat>
<email>konigerwinarry@163.com</email>
<phone>135......</phone>
</concat>
</user>
</notebook>
"""
#r = requests.get('http://xxx.com/abc.xml') 也能夠請求遠程服務器上的xml文件
#etree.HTML(r.text.encode('utf-8'))
xml_code = etree.HTML(xml_test) #生成一個etree對象
#選取全部子節點的name(地址)
print xml_code.xpath('//name')
選取全部子節點的name值(數據)
print xml_code.xpath('//name/text()')
print ''
#以notebook以根節點選取全部數據
notebook = xml_code.xpath('//notebook')
#取出第一個節點的name值(數據)
print notebook[0].xpath('.//name/text()')[0]
addres = notebook[0].xpath('.//name')[0]
#取出和第一個節點同級的 address 值
print addres.xpath('../address/text()')
#選取屬性值
print addres.xpath('../address/@lang')
#選取notebook下第一個user的name屬性
print xml_code.xpath('//notebook/user[1]/name/text()')
#選取notebook下最後一個user的name屬性
print xml_code.xpath('//notebook/user[last()]/name/text()')
#選取notebook下倒數第二個user的name屬性
print xml_code.xpath('//notebook/user[last()-1]/name/text()')
#選取notebook下前兩名user的address屬性
print xml_code.xpath('//notebook/user[position()<3]/address/text()')
#選取全部分類爲web的name
print xml_code.xpath('//notebook/user[@category="cb"]/name/text()')
#選取全部年齡小於30的人
print xml_code.xpath('//notebook/user[age<30]/name/text()')
#選取全部class屬性中包含dba的class屬性
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/@class')
print xml_code.xpath('//notebook/user[contains(@class,"dba")]/name/text()')