經過jsoup對網頁進行數據抓取。

jsoup 是一款Java 的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套很是省力的API,可經過DOM,CSS以及相似於jQuery的操做方法來取出和操做數據。html

下面是一個解析博客園首頁數據的demo:java

 1 package com.haojiahong.test;
 2 
 3 import org.jsoup.Jsoup;
 4 import org.jsoup.nodes.Document;
 5 import org.jsoup.nodes.Element;
 6 import org.jsoup.select.Elements;
 7 
 8 import com.haojiahong.domain.PostItem;
 9 
10 public class ZhuaHtmlDataTest {
11     public static void main(String[] args) throws Exception {
12         Document doc = Jsoup.connect("http://www.cnblogs.com/").get();
13         Element content = doc.getElementById("post_list");
14         Elements datas = content.getElementsByClass("post_item");
15         for (Element data : datas) {
16             PostItem postItem = new PostItem();
17             Elements itemBodys = data.getElementsByClass("post_item_body");
18 
19             // 拿到title
20             Elements titles = itemBodys.get(0).getElementsByClass("titlelnk");
21             // 拿到summary
22             Elements summarys = itemBodys.get(0).getElementsByClass(
23                     "post_item_summary");
24             // 拿到底部做者及其鏈接
25             Elements foots = itemBodys.get(0).getElementsByClass("lightblue");
26             postItem.setTitleName(titles.get(0).text());
27             postItem.setTitleUrl(titles.get(0).attr("href"));
28             postItem.setSummary(summarys.get(0).text());
29             postItem.setFootWriter(foots.get(0).text());
30             postItem.setFootWriterUrl(foots.get(0).attr("href"));
31             System.out.println(postItem.toString());
32 
33         }
34     }
35 }

其中用到了一個JavaBean類,方面讀取數據和解析數據PostItem.java,這也是爲何要時時刻刻記住面向對象的思想。node

 1 package com.haojiahong.domain;
 2 
 3 /**
 4  * 帖子信息
 5  * 
 6  * @author haojiahong
 7  * 
 8  * @createtime:2015-7-17 下午2:11:54
 9  * 
10  * 
11  */
12 public class PostItem {
13 
14     private String titleName;
15     private String titleUrl;
16     private String summary;
17     private String footWriter;
18     private String footWriterUrl;
19 
20     @Override
21     public String toString() {
22         return "帖子標題" + titleName + "帖子地址" + titleUrl + "做者" + footWriter
23                 + "做者地址" + footWriterUrl;
24     }
25 
26     public String getTitleName() {
27         return titleName;
28     }
29 
30     public void setTitleName(String titleName) {
31         this.titleName = titleName;
32     }
33 
34     public String getTitleUrl() {
35         return titleUrl;
36     }
37 
38     public void setTitleUrl(String titleUrl) {
39         this.titleUrl = titleUrl;
40     }
41 
42     public String getSummary() {
43         return summary;
44     }
45 
46     public void setSummary(String summary) {
47         this.summary = summary;
48     }
49 
50     public String getFootWriter() {
51         return footWriter;
52     }
53 
54     public void setFootWriter(String footWriter) {
55         this.footWriter = footWriter;
56     }
57 
58     public String getFootWriterUrl() {
59         return footWriterUrl;
60     }
61 
62     public void setFootWriterUrl(String footWriterUrl) {
63         this.footWriterUrl = footWriterUrl;
64     }
65 }

最後解析出來的結果以下:正則表達式

帖子標題:常見正則表達式帖子地址:http://www.cnblogs.com/dandandeyoushangnan/p/4661977.html做者:淡淡的憂傷IT男做者地址:http://www.cnblogs.com/dandandeyoushangnan/
帖子標題:用jQuery寫了一個模態框插件感受挺好看的在博客園分享一下!帖子地址:http://www.cnblogs.com/YingYue/p/4661944.html做者:周建旭的博客做者地址:http://www.cnblogs.com/YingYue/
帖子標題:小議 html 實體解析帖子地址:http://www.cnblogs.com/52cik/p/js-entity.html做者:亂碼.做者地址:http://www.cnblogs.com/52cik/
帖子標題:WPF入門教程系列十三——依賴屬性(三)帖子地址:http://www.cnblogs.com/chillsrc/p/4661658.html做者:DotNet菜園做者地址:http://www.cnblogs.com/chillsrc/
帖子標題:IOS NSNotification Center 通知中心的使用帖子地址:http://www.cnblogs.com/jerehedu/p/4661608.html做者:傑瑞教育做者地址:http://www.cnblogs.com/jerehedu/
帖子標題:網絡IO之阻塞、非阻塞、同步、異步總結帖子地址:http://www.cnblogs.com/Fly-Wind/p/io.html做者:Fly_Wind做者地址:http://www.cnblogs.com/Fly-Wind/
帖子標題:跨域解決方案之HTML5 postMessage帖子地址:http://www.cnblogs.com/hutuzhu/p/4661526.html做者:彼岸花在開做者地址:http://www.cnblogs.com/hutuzhu/
帖子標題:Windows Azure Virtual Machine (24) Azure VM支持多網卡功能帖子地址:http://www.cnblogs.com/threestone/p/4661454.html做者:Lei Zhang的博客做者地址:http://www.cnblogs.com/threestone/
帖子標題:UVa 673 Parentheses Balance(棧的使用)帖子地址:http://www.cnblogs.com/hfc-xx/p/4661443.html做者:黃鳳成做者地址:http://www.cnblogs.com/hfc-xx/
帖子標題:ECMAScript 6教程 (三) Class和Module(類和模塊)帖子地址:http://www.cnblogs.com/jasonnode/p/4661422.html做者:Jason-node做者地址:http://www.cnblogs.com/jasonnode/
帖子標題:GROUP BY的擴展帖子地址:http://www.cnblogs.com/ivictor/p/4660984.html做者:iVictor做者地址:http://www.cnblogs.com/ivictor/
帖子標題:ASP.NET MVC 過濾器開發與使用帖子地址:http://www.cnblogs.com/JinvidLiang/p/4660200.html做者:々蕞噯の﹎做者地址:http://www.cnblogs.com/JinvidLiang/
帖子標題:JavaScript「並不是」一切皆對象帖子地址:http://www.cnblogs.com/myvin/p/4660138.html做者:myvin做者地址:http://www.cnblogs.com/myvin/
帖子標題:Android CollapsingToolbarLayout帖子地址:http://www.cnblogs.com/wingyip/p/4609891.html做者:wingyip做者地址:http://www.cnblogs.com/wingyip/
帖子標題:C#基礎系列——Attribute特性使用帖子地址:http://www.cnblogs.com/landeanfen/p/4642819.html做者:懶得安分做者地址:http://www.cnblogs.com/landeanfen/
帖子標題:SQL Server表分區的NULL值問題帖子地址:http://www.cnblogs.com/lyhabc/p/4660846.html做者:樺仔做者地址:http://www.cnblogs.com/lyhabc/
帖子標題:認真分析mmap:是什麼 爲何 怎麼用帖子地址:http://www.cnblogs.com/huxiao-tee/p/4660352.html做者:胡瀟做者地址:http://www.cnblogs.com/huxiao-tee/
帖子標題:上週熱點回顧(7.13-7.19)帖子地址:http://www.cnblogs.com/cmt/p/4660705.html做者:博客園團隊做者地址:http://www.cnblogs.com/cmt/
帖子標題:Python開發入門與實戰11-單元測試帖子地址:http://www.cnblogs.com/haozi0804/p/4660652.html做者:wuch做者地址:http://www.cnblogs.com/haozi0804/
帖子標題:【Oracle 集羣】11G RAC 知識圖文詳細教程之RAC在LINUX上使用NFS安裝前準備(六)帖子地址:http://www.cnblogs.com/baiboy/p/orc6.html做者:伏草唯存做者地址:http://www.cnblogs.com/baiboy/跨域

首頁的十條相應的博文內容都會解析出來。就是這麼犀利啊哈哈哈網絡

相關文章
相關標籤/搜索