lxml的XPath解析

BeautifulSoup 能夠將lxml做爲默認的解析器使用,一樣lxml能夠單獨使用。下面比較這二者之間優缺點:

  • BeautifulSoup和lxml原理不同,BeautifulSoup是基於DOM的,會載入整個文檔,解析整個DOM樹,所以時間和內存開銷都會比較大不少。而lxml是使用XPath技術查詢和處理HTML/XML文檔的庫,只會局部遍歷,因此速度會快一些。幸虧如今BeautifulSoup能夠使用lxml做爲默認解析庫html

  • 關於XPath的用法,請點擊:https://www.cnblogs.com/guguobao/p/9401643.htmlpython

  • 示例:url

#coding:utf-8

from lxml import etree
html_str = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = etree.HTML(html_str)
result = etree.tostring(html)
print(result)

能夠發現html_str最後<body><html>是沒有閉合的,但能夠經過etree.tostring(html)自動修正HTML代碼spa

from lxml import etree
html = etree.parse('index.html')
result = etree.tostring(html, pretty_print=True)
print(result)

除了讀取字符串以外,lxml還能夠直接讀取html文件。假設html_str被複制index.html,則能夠用parse方法解析(代碼在上)。code

接下來使用XPath語句抽取html中的URL

html = etree.HTML(html_str)
urls = html.xpath(".//*[@class='sister']/@href")
print urls
相關文章
相關標籤/搜索