Visual Paradigm是一個很是棒的UML繪圖工具,須要對它瞭解的,能夠直接看官網,在此不作更多介紹,最近要使用它來作一些設計,它有很是完備的在線教學,寫得很是不錯。這個時候問題來了,這些教學很是多,並且若是全部人都訪問外網的話及學習效率都比較低。經過觀察,發現裏面的全部文章都有PDF能夠下載,並且裏面的示例也能夠下載,呵呵,這就好辦了,作個程序把它抓下來不就解決了?因而把此問題交給HulkZ同窗去幹了,HulkZ同窗花了半天時間交工,我看了下,發現雖然局部有優化的地方,可是整體仍是能夠的,因而就寫這篇文章作個說明。 html
注:HulkZ同窗尚未大學畢業,正在大學4年級學習。 java
public class VisualParadigmMain { public static void main(String[] args) throws Exception { Spider spider = new SpiderImpl("UTF-8"); Watcher watcher = new WatcherImpl(); watcher.addProcessor(new VisualParadigmMainProcessor()); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName("li"); nodeFilter.setIncludeAttribute("class", "tutorialLeftMenuItem"); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl("http://www.visual-paradigm.com/tutorials/"); } }
大意是建立一個抓取UTF-8的爬蟲,而後建立一個觀察者,設定對帶有class爲tutorialLeftMenuItem的li標籤用VisualParadigmMainProcessor類去處理,而後去處理URLhttp://www.visual-paradigm.com/tutorials/。 node
到這裏處理仍是比較清晰的,接下來看看VisualParadigmMainProcessor是怎麼樣的。 app
public class VisualParadigmMainProcessor implements Processor { public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { HtmlNode a = node.getSubNode("a"); File file = new File("E:\\臨時\\spider\\" + a.getPureText().trim()); if (!file.exists()) { file.mkdirs(); } VisualParadigmList.process(a.getAttribute("href")); } }
這裏傳入的node實際上,就是上面說的帶有class爲tutorialLeftMenuItem的li標籤,這裏的意思是找到它下面的全部<a>標籤,而後建立目錄,而後再用VisualParadigmList類來對這些分類鏈接進行處理。 框架
教學分類中主要就是一篇一篇的教學文章了,接下來固然是對這些文章進行處理了。 ide
public class VisualParadigmList { public static void process(String url) throws Exception { Spider spider = new SpiderImpl("UTF-8"); Watcher watcher = new WatcherImpl(); watcher.addProcessor(new VisualParadigmListProcessor()); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName("div"); nodeFilter.setIncludeAttribute("class", "tutorial-link-container"); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl("http://www.visual-paradigm.com" + url); System.out.println(System.currentTimeMillis()+"-"+url); } }這裏又建立一個爬蟲,而後對須要處理的 頁面查找class爲"tutorial-link-container"的div標籤,找到以後用VisualParadigmListProcessor進行處理。
VisualParadigmListProcessor類的內容以下: 工具
public class VisualParadigmListProcessor implements Processor { public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { HtmlNode a = node.getSubNode("a"); VisualParadigmPage.process(a.getPureText(), a.getAttribute("href")); } }
意思就是再對它下面的a標籤中的URL用VisualParadigmPage類進行處理。 學習
上面就是具體的教學頁面了,在其右上角,就有PDF的連接,下面要作的工做就是把這些PDF抓取下來,首先看VisualParadigmPage類: 優化
public class VisualParadigmPage { public static void process(String title,String url) throws Exception { Spider spider = new SpiderImpl("UTF-8"); Watcher watcher = new WatcherImpl(); watcher.addProcessor(new VisualParadigmPageProcessor(title)); QuickNameFilter<HtmlNode> nodeFilter = new QuickNameFilter<HtmlNode>(); nodeFilter.setNodeName("html"); watcher.setNodeFilter(nodeFilter); spider.addWatcher(watcher); spider.processUrl("http://www.visual-paradigm.com" + url); System.out.println(System.currentTimeMillis()+"-"+url); } }
public class VisualParadigmPageProcessor implements Processor { static HttpClient httpClient = new HttpClient(); private final String title; public VisualParadigmPageProcessor(String title) { this.title = title; } public void process(String url, HtmlNode node, Map<String, Object> parameters) throws Exception { NameFilter<HtmlNode> titleUrlFilter = new NameFilter(node); titleUrlFilter.setNodeName("title"); HtmlNode titleNode = titleUrlFilter.findNode(); NameFilter<HtmlNode> pdfUrlFilter = new NameFilter(node); pdfUrlFilter.setNodeName("a"); pdfUrlFilter.setIncludeAttribute("class", "pdf notranslate"); HtmlNode pdfNode = pdfUrlFilter.findNode(); NameFilter<HtmlNode> ol = new NameFilter(node); pdfUrlFilter.setNodeName("ol"); pdfUrlFilter.setIncludeAttribute("class", "contentPoint"); HtmlNode olNode = pdfUrlFilter.findNode(); if (pdfNode != null) { String pdfUrl = "http://www.visual-paradigm.com" + pdfNode.getAttribute("href"); saveUrl(titleNode.getPureText() + ".pdf", pdfUrl); } if (olNode != null && olNode.getSubNodes("a") != null) { for (HtmlNode aNode : olNode.getSubNodes("a")) { String vppUrl = "http://www.visual-paradigm.com" + aNode.getAttribute("href"); saveUrl(aNode.getPureText(), vppUrl); } } } private void saveUrl(String name, String urlAddress) throws IOException { String fileName = "E:\\臨時\\spider\\" + title + "\\" + name; GetMethod getMethod = new GetMethod(urlAddress); int iGetResultCode = httpClient.executeMethod(getMethod); if (iGetResultCode == HttpStatus.SC_OK) { InputStream inputStream = getMethod.getResponseBodyAsStream(); OutputStream outputStream = new FileOutputStream(fileName); byte[] buffer = new byte[4096]; int n = -1; while ((n = inputStream.read(buffer)) != -1) { if (n > 0) { outputStream.write(buffer, 0, n); } } inputStream.close(); outputStream.close(); } getMethod.releaseConnection(); } }
到文章中查找到標題,再查找到PDF的連接,再查找到其它附件的連接,若是有的話,就把它們存儲下來,整個代碼編寫任務結束。 ui
D:\BaiduYunDownload\VPTutorials 的目錄 2014/11/02 20:06 <DIR> . 2014/11/02 20:06 <DIR> .. 2014/11/02 20:05 <DIR> Business Modeling 2014/11/02 20:05 <DIR> Business Process Modeling 2014/11/02 20:05 <DIR> Business Rule 2014/11/02 20:05 <DIR> Code Engineering 2014/11/02 20:05 <DIR> Customization 2014/11/02 20:05 <DIR> Data Modeling 2014/11/02 20:05 <DIR> Database Tools 2014/11/02 20:05 <DIR> Design Animation 2014/11/02 20:05 <DIR> Diagramming 2014/11/02 20:05 <DIR> Enterprise Architecture 2014/11/02 20:05 <DIR> Glossary 2014/11/02 20:05 <DIR> Grid 2014/11/02 20:05 <DIR> IDE Integration 2014/11/02 20:05 <DIR> Impact Analysis 2014/11/02 20:05 <DIR> Interoperability 2014/11/02 20:05 <DIR> Modeling Toolset 2014/11/02 20:05 <DIR> Object Relational Mapping 2014/11/02 20:05 <DIR> Plug-in Development 2014/11/02 20:05 <DIR> Process Simulation 2014/11/02 20:05 <DIR> Project Referencing 2014/11/02 20:06 <DIR> Reporting 2014/11/02 20:06 <DIR> Requirements Capturing 2014/11/02 20:06 <DIR> SoaML Modeling 2014/11/02 20:06 <DIR> Team Collaboration 2014/11/02 20:06 <DIR> UML Modeling 2014/11/02 20:06 <DIR> Use Case Modeling
上面是他提交給個人成果物,總共108M,任務完成得很是漂亮。
這部分代碼,HulkZ同窗已經Push給我,放在TinySpider工程當中。
另外作這個過程當中,還發現TinySpider不能處理gzip方式處理過的Html文檔,所以還增長了對gzip方式的html內容進行處理的支持。
更多內容,請查看Tiny框架官網:http://www.tinygroup.org,也能夠查看本人的博客,相信不會空手而歸。
也能夠添加本人QQ進行直接溝通,還能夠加入Tiny羣與TinyFans互動。