詳情頁面涉及到圖書的標題、出版社、做者、摘要等等具體信息,因此是整個網頁解析中的難點,也是核心數據部分。html
首先找到涉及到的一級節點:web
1 //一級節點 2 DomElement productMainElement = null; 3 DomElement productContentElement = null; 4 5 BookSpider bookSpider = new BookSpider();//記錄所用代理 6 7 try { 8 HtmlPage page = webClientGetPage(url,true, false,bookSpider); 9 if(page == null || !page.isHtmlPage() || !page.hasChildNodes()){ 10 System.out.println("bookFromDangdangUrl failed. url = " + url); 11 return null; 12 } 13 14 //stringToFile(page.asXml(),"E:\\book.html"); 15 16 List<DomElement> ll = page.getElementsByTagName("div"); 17 18 for(int i=0;i<ll.size();i++) { 19 DomElement e = ll.get(i); 20 21 String cls = e.getAttribute("class"); 22 23 if(cls.equalsIgnoreCase("product_main clearfix")) { 24 System.out.println("find productMainElement. cls = " + cls); 25 26 productMainElement = e; 27 } 28 29 if(cls.equalsIgnoreCase("product_content clearfix")) { 30 System.out.println("find productContentElement. cls = " + cls); 31 32 productContentElement = e; 33 } 34 35 if(productMainElement != null && productContentElement != null) 36 break; 37 }
而後找到須要的二級節點:ide
1 //二級節點 2 DomElement picInfoElement = null; //圖片節點 3 DomElement showInfoElement = null;//詳情節點 4 if(productMainElement != null) { 5 List<HtmlElement> lsElement = productMainElement.getElementsByTagName("div"); 6 7 for(int i=0;i<lsElement.size();i++) { 8 DomElement e = lsElement.get(i); 9 10 String cls = e.getAttribute("class"); 11 if(cls != null){ 12 if(cls.equalsIgnoreCase("pic_info")) { 13 System.out.println("find picInfoElement. class = " + cls); 14 15 //圖片節點 16 picInfoElement = e; 17 } 18 19 if(cls.equalsIgnoreCase("show_info")) { 20 System.out.println("find showInfoElement. class = " + cls); 21 22 //在整個html中找到圖書的一級節點 23 showInfoElement = e; 24 } 25 } 26 27 if(picInfoElement != null && showInfoElement != null) 28 break; 29 } 30 }
接下來就能夠進行具體解析了,以圖片節點爲例:url
1 //圖片 2 if(picInfoElement != null) { 3 DomElement imgElement = picInfoElement.getFirstElementChild().getFirstElementChild().getFirstElementChild(); 4 image = imgElement.getAttribute("src").trim(); 5 }
基礎信息的節點查找方式以下:spa
1 //基礎信息相關節點 2 DomElement nameInfoElement = null; 3 DomElement messboxInfoElement = null; 4 DomElement priceInfoElement = null; 5 if(showInfoElement != null) { 6 DomElement saleBoxElement = showInfoElement.getFirstElementChild().getFirstElementChild(); 7 List<HtmlElement> lsElement = saleBoxElement.getElementsByTagName("div"); 8 9 for(int i=0;i<lsElement.size();i++) { 10 HtmlElement e = lsElement.get(i); 11 12 String cls = e.getAttribute("class"); 13 if(cls == null) 14 continue; 15 16 if(cls.equalsIgnoreCase("name_info")) { 17 System.out.println("find nameInfoElement. class = " + cls); 18 19 nameInfoElement = e; 20 } 21 22 if(cls.equalsIgnoreCase("messbox_info")) { 23 System.out.println("find messboxInfoElement. class = " + cls); 24 25 messboxInfoElement = e; 26 } 27 28 if(cls.equalsIgnoreCase("price_info clearfix")) { 29 System.out.println("find priceInfoElement. class = " + cls); 30 31 priceInfoElement = e; 32 } 33 34 if(nameInfoElement != null && messboxInfoElement != null && priceInfoElement != null) 35 break; 36 } 37 }
這樣基礎信息的各個節點基本就找到了,接下來模仿圖片節點的解析方式就能夠進行處理了。代理
1 //標題 2 if(nameInfoElement != null) { 3 4 if(nameInfoElement.getElementsByTagName("h1").isEmpty()){ 5 title = nameInfoElement.getTextContent().trim(); 6 }else{ 7 title = nameInfoElement.getElementsByTagName("h1").get(0).getTextContent().trim(); 8 } 9 10 title = strSeparatorContent(title); 11 12 if(nameInfoElement.getElementsByTagName("h2").isEmpty()){ 13 ; 14 }else{ 15 subtitle = nameInfoElement.getElementsByTagName("h2").get(0).getTextContent().trim(); 16 } 17 18 subtitle = strSeparatorContent(subtitle); 19 }
其餘節點解析再也不贅述,參考以上方式便可解析出全部數據。code