在作一個可視化配置爬蟲項目時,須要配置爬蟲的用戶本身輸入xpath和csspath路徑以提取數據或作瀏覽器操做。考慮到用戶的有時會輸入錯誤的xpath或csspath路徑,後臺須要對其作合法性校驗。css
對於xpath的有效性檢驗,使用第三方lxml模塊中的etree.XPathEvalError進行校驗。不得不說lxml是一個解析爬蟲數據的利器,當etree.xpath()遇到不合法的xpath路徑時會拋出XPathEvalError錯誤。python
代碼以下:瀏覽器
from lxml import etree from StringIO import StringIO def _validXpathExpression(xpath): """ 檢查xpath合法性 :param xpath: :return: """ tree = etree.parse(StringIO('<foo><bar></bar></foo>')) try: tree.xpath(xpath) return True except etree.XPathEvalError, e: return False
只有當輸入的xpath路徑合法時返回True。 驗證:spa
>>>print _validXpathExpression('./div[@class="name"]/a/text()') >>>True >>> >>>print _validXpathExpression('./div(@class="name")') >>>False
對於csspath檢驗的思路時,藉助python標準庫cssselect的css_to_xpath()方法。當輸入的csspath不合法時會拋出SelectorError錯誤。code
代碼以下:xml
from cssselect.parser import SelectorError from cssselect.xpath import HTMLTranslator def _validCssExpression(css): """ 檢查css合法性 :param css: :return: """ try: HTMLTranslator().css_to_xpath(css) return True except SelectorError, e: return False
只有當輸入的csspath路徑合法時返回True。 驗證:io
>>>print _validCssExpression('.content>a') >>>True >>> >>>print _validCssExpression('.content>a[123]') >>>False