通過綜合對比分析(此處省略幾千字),最終選定了HtmlUnit做爲網頁解析的工具。html
經過maven來引入HtmlUnit資源包:java
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.28</version>
</dependency>web
下面是解析圖書分類的核心邏輯,大量的精力是要放在分析網頁源碼上,從中找出一級級的節點規律,而後再解析出本身須要的數據。maven
public List<BookCategory> categoryFromDangdang() { List<BookCategory> lsCategory = new LinkedList<BookCategory>(); String categoryUrl = "http://category.dangdang.com/?ref=www-0-C"; try { HtmlPage page = webClientGetPage(categoryUrl,false, false, null); List<DomElement> ll = page.getElementsByTagName("div"); DomElement bookElement = null; for(int i=0;i<ll.size();i++) { DomElement e = ll.get(i); String s = e.getAttribute("class"); if(s.equalsIgnoreCase("classify_con")) { System.out.println("find book. class="+s); //在整個html中找到圖書的一級節點 bookElement = e; break; } } if(bookElement != null) { DomElement eClassify_books = bookElement.getFirstElementChild().getFirstElementChild(); String s = eClassify_books.getAttribute("class"); //找到圖書分類的解析區域 if(s.equalsIgnoreCase("classify_books")) { System.out.println("find classify_books. class="+s); String rootCategory = ""; Iterable<DomElement> elementIterable = eClassify_books.getChildElements(); for (java.util.Iterator<DomElement> i = elementIterable.iterator(); i.hasNext(); ) { DomElement e = (DomElement) i.next(); s = e.getAttribute("class"); //圖書分類的描述 if(s.equalsIgnoreCase("classify_books_detail")) { DomElement eRoot = e.getElementsByTagName("h3").get(0).getFirstElementChild(); String url = eRoot.getAttribute("href"); String name = eRoot.getTextContent(); rootCategory = urlToCategory(url); System.out.println("find book rootCategory." + " name=" + name + " category=" + rootCategory); } //圖書具體分類 else if(s.indexOf("classify_kind") != -1) { DomElement eCategory = e.getFirstElementChild().getFirstElementChild(); String url = eCategory.getAttribute("href"); String name = eCategory.getTextContent(); String category = urlToCategory(url); if(category.equalsIgnoreCase("cp01.59.00.00.00.00"))//繁體字顯示有問題 name = "港臺圖書"; System.out.println("find book category. " + " name=" + name + " category=" + category); BookCategory bookCategory = new BookCategory(); bookCategory.setTitle(name); bookCategory.setCategory(category); bookCategory.setCategory_parent(rootCategory); bookCategory.setCache(0); lsCategory.add(bookCategory); //二級分類 DomElement ul = e.getElementsByTagName("ul").get(0); DomNodeList<HtmlElement> ulList = ul.getElementsByTagName("li"); for(int j=0;j<ulList.size();j++) { HtmlElement he = ulList.get(j); if(he.getAttribute("name").equalsIgnoreCase("cat_3")) { DomElement eSubCategory = he.getFirstElementChild(); url = eSubCategory.getAttribute("href"); name = eSubCategory.getTextContent(); String subCategory = urlToCategory(url); System.out.println("===========find book sub category. " + " name=" + name + " category=" + subCategory); BookCategory bookSubCategory = new BookCategory(); bookSubCategory.setTitle(name); bookSubCategory.setCategory(subCategory); bookSubCategory.setCategory_parent(category); bookSubCategory.setCache(1); lsCategory.add(bookSubCategory); } } } } } } //stringToFile(result,"E:\\category.html"); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); System.out.println("Exception="+e); } System.out.println("find book category finish. "); return lsCategory; }
解析出來的分類以下圖所示:工具
這樣就獲取到了噹噹的全部圖書分類,由於分類數據只有一個頁面,因此相對比較簡單一些。url
另外還能夠解析分類下的第一個頁面,從而能夠獲取到關聯分類下的網頁頁數和圖書數量。 spa