就爬取和解析內容而言,咱們有太多選擇。
好比,不少人都以爲Jsoup就能夠解決全部問題。
不管是Http請求、DOM操做、CSS query selector篩選都很是方便。
關鍵是這個selector,僅經過一個表達式篩選出的只能是一個node。
如過我想得到一個text或者一個node的屬性值,我須要從返回的element對象中再獲取一次。
而我剛好接到了一個有意思的需求,僅經過一個表達式表示想篩選的內容,獲取一個新聞網頁的每一條新聞的標題、連接等信息。 html
XPath再合適不過了,好比下面這個例子: java
static void crawlByXPath(String url,String xpathExp) throws IOException, ParserConfigurationException, SAXException, XPathExpressionException { String html = Jsoup.connect(url).post().html(); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(html); XPathFactory xPathFactory = XPathFactory.newInstance(); XPath xPath = xPathFactory.newXPath(); XPathExpression expression = xPath.compile(xpathExp); expression.evaluate(html); }
遺憾的是,幾乎沒有網站能夠經過documentBuilder.parse這段代碼。
而XPath卻對DOM很是嚴格。
對HTML進行一次clean,因而我加入了這個東西:node
<dependency> <groupId>net.sourceforge.htmlcleaner</groupId> <artifactId>htmlcleaner</artifactId> <version>2.9</version> </dependency>
HtmlCleaner能夠幫我解決這個問題,並且他自己就支持XPath。
僅僅一行HtmlCleaner.clean就解決了:express
public static void main(String[] args) throws IOException, XPatherException { String url = "http://zhidao.baidu.com/daily"; String contents = Jsoup.connect(url).post().html(); HtmlCleaner hc = new HtmlCleaner(); TagNode tn = hc.clean(contents); String xpath = "//h2/a/@href"; Object[] objects = tn.evaluateXPath(xpath); System.out.println(objects.length); }
可是HtmlCleaner又引起了新的問題,當我把表達式寫成"//h2/a[contains(@href,'daily')]/@href"時,他提示我不支持contains函數。
而javax.xml.xpath則支持函數使用,這下問題來了。
如何結合兩者? HtmlCleaner提供了DomSerializer,能夠將TagNode對象轉爲org.w3c.dom.Document對象,好比:dom
Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn);
如此一來就能夠發揮各自長處了。 函數
public static void main(String[] args) throws IOException, XPatherException, ParserConfigurationException, XPathExpressionException { String url = "http://zhidao.baidu.com/daily"; String exp = "//h2/a[contains(@href,'daily')]/@href"; String html = null; try { Connection connect = Jsoup.connect(url); html = connect.get().body().html(); } catch (IOException e) { e.printStackTrace(); } HtmlCleaner hc = new HtmlCleaner(); TagNode tn = hc.clean(html); Document dom = new DomSerializer(new CleanerProperties()).createDOM(tn); XPath xPath = XPathFactory.newInstance().newXPath(); Object result; result = xPath.evaluate(exp, dom, XPathConstants.NODESET); if (result instanceof NodeList) { NodeList nodeList = (NodeList) result; System.out.println(nodeList.getLength()); for (int i = 0; i < nodeList.getLength(); i++) { Node node = nodeList.item(i); System.out.println(node.getNodeValue() == null ? node.getTextContent() : node.getNodeValue()); } } }