圖書信息庫完整解決方案（四）解析圖書詳情

時間 2019-11-15

標籤圖書信息完整解決方案解析詳情简体版

原文原文鏈接

詳情頁面涉及到圖書的標題、出版社、做者、摘要等等具體信息，因此是整個網頁解析中的難點，也是核心數據部分。html

首先找到涉及到的一級節點：web

 1         //一級節點
 2         DomElement productMainElement = null;
 3         DomElement productContentElement = null;
 4         
 5         BookSpider bookSpider = new BookSpider();//記錄所用代理
 6 
 7         try {        
 8             HtmlPage page = webClientGetPage(url,true, false,bookSpider);
 9             if(page == null || !page.isHtmlPage() || !page.hasChildNodes()){
10                 System.out.println("bookFromDangdangUrl failed.  url = " + url);
11                 return null;
12             }
13     
14             //stringToFile(page.asXml(),"E:\\book.html");
15             
16             List<DomElement> ll =  page.getElementsByTagName("div");
17 
18             for(int i=0;i<ll.size();i++) {
19                 DomElement e = ll.get(i);
20                 
21                 String cls = e.getAttribute("class");
22 
23                 if(cls.equalsIgnoreCase("product_main clearfix")) {
24                     System.out.println("find productMainElement. cls = " + cls);
25                     
26                     productMainElement = e;
27                 }
28                 
29                 if(cls.equalsIgnoreCase("product_content clearfix")) {
30                     System.out.println("find productContentElement. cls = " + cls);
31                     
32                     productContentElement = e;
33                 }
34                 
35                 if(productMainElement != null && productContentElement != null)
36                     break;
37             }

而後找到須要的二級節點：ide

 1             //二級節點
 2             DomElement picInfoElement = null; //圖片節點
 3             DomElement showInfoElement = null;//詳情節點
 4             if(productMainElement != null) {
 5                 List<HtmlElement> lsElement =  productMainElement.getElementsByTagName("div");
 6                 
 7                 for(int i=0;i<lsElement.size();i++) {
 8                     DomElement e = lsElement.get(i);
 9                     
10                     String cls = e.getAttribute("class");
11                     if(cls != null){
12                         if(cls.equalsIgnoreCase("pic_info")) {
13                             System.out.println("find picInfoElement. class = " + cls);
14                             
15                             //圖片節點
16                             picInfoElement = e;
17                         }
18                         
19                         if(cls.equalsIgnoreCase("show_info")) {
20                             System.out.println("find showInfoElement. class = " + cls);
21                             
22                             //在整個html中找到圖書的一級節點
23                             showInfoElement = e;
24                         }
25                     }
26 
27                     if(picInfoElement != null && showInfoElement != null)
28                         break;
29                 }
30             }

接下來就能夠進行具體解析了，以圖片節點爲例：url

1             //圖片
2             if(picInfoElement != null) {
3                 DomElement imgElement = picInfoElement.getFirstElementChild().getFirstElementChild().getFirstElementChild();
4                 image = imgElement.getAttribute("src").trim();
5             }

基礎信息的節點查找方式以下：spa

 1             //基礎信息相關節點
 2             DomElement nameInfoElement = null;
 3             DomElement messboxInfoElement = null;
 4             DomElement priceInfoElement = null;
 5             if(showInfoElement != null) {
 6                 DomElement saleBoxElement = showInfoElement.getFirstElementChild().getFirstElementChild();
 7                 List<HtmlElement> lsElement =  saleBoxElement.getElementsByTagName("div");
 8                 
 9                 for(int i=0;i<lsElement.size();i++) {
10                     HtmlElement e = lsElement.get(i);
11                     
12                     String cls = e.getAttribute("class");
13                     if(cls == null)
14                         continue;
15 
16                     if(cls.equalsIgnoreCase("name_info")) {
17                         System.out.println("find nameInfoElement. class = " + cls);
18                         
19                         nameInfoElement = e;
20                     }
21                     
22                     if(cls.equalsIgnoreCase("messbox_info")) {
23                         System.out.println("find messboxInfoElement. class = " + cls);
24                         
25                         messboxInfoElement = e;
26                     }
27                     
28                     if(cls.equalsIgnoreCase("price_info clearfix")) {
29                         System.out.println("find priceInfoElement. class = " + cls);
30                         
31                         priceInfoElement = e;
32                     }
33                     
34                     if(nameInfoElement != null && messboxInfoElement != null && priceInfoElement != null)
35                         break;
36                 }
37             }

這樣基礎信息的各個節點基本就找到了，接下來模仿圖片節點的解析方式就能夠進行處理了。代理

 1             //標題
 2             if(nameInfoElement != null) {
 3                 
 4                 if(nameInfoElement.getElementsByTagName("h1").isEmpty()){
 5                     title = nameInfoElement.getTextContent().trim();
 6                 }else{
 7                     title  = nameInfoElement.getElementsByTagName("h1").get(0).getTextContent().trim();
 8                 }
 9                 
10                 title = strSeparatorContent(title);
11                                 
12                 if(nameInfoElement.getElementsByTagName("h2").isEmpty()){
13                     ;
14                 }else{
15                     subtitle  = nameInfoElement.getElementsByTagName("h2").get(0).getTextContent().trim();
16                 }
17                 
18                 subtitle = strSeparatorContent(subtitle);
19             }

其餘節點解析再也不贅述，參考以上方式便可解析出全部數據。code

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。