一、htmlunit 是一款開源的java 頁面分析工具,讀取頁面後,能夠有效的使用htmlunit分析頁面上的內容。項目能夠模擬瀏覽器運行,被譽爲java瀏覽器的開源實現。這個沒有界面的瀏覽器,運行速度也是很是迅速的。javascript
2、下載地址:http://sourceforge.net/projects/htmlunit/?source=directory html
3、訪問指定頁面java
網絡爬蟲第一個要面臨的問題,就是如何抓取網頁,抓取其實很容易,沒你想的那麼複雜,一個開源HtmlUnit
包,4行主要代碼就OK啦!jquery
1 import java.io.IOException; 2 import java.net.MalformedURLException; 3 import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 4 import com.gargoylesoftware.htmlunit.WebClient; 5 import com.gargoylesoftware.htmlunit.html.HtmlPage; 6 7 public class Main { 8 9 public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { 10 // TODO Auto-generated method stub 11 final WebClient mWebClient = new WebClient(); 12 final HtmlPage mHtmlPage = mWebClient.getPage("http://www.baidu.com"); 13 System.out.println(mHtmlPage.asText()); 14 mWebClient.closeAllWindows(); 15 } 16 17 }
運行結果:程序員
1 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 2 嚴重: runtimeError: message=[An invalid or illegal selector was specified (selector: ':checked' error: Invalid selector: *:checked).] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] lineOffset=[0] 3 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 4 嚴重: runtimeError: message=[An invalid or illegal selector was specified (selector: ':enabled' error: Invalid selector: *:enabled).] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] lineOffset=[0] 5 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 6 嚴重: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[10] lineSource=[null] lineOffset=[0] 7 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 8 警告: CSS error: 'http://www.baidu.com/' [1:81] Error in expression. (Invalid token ";". Was expecting one of: <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>, "-".) 9 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 10 警告: CSS error: 'http://www.baidu.com/' [1:143] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 11 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 12 警告: CSS warning: 'http://www.baidu.com/' [1:143] Ignoring the following declarations in this rule. 13 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 14 警告: CSS error: 'http://www.baidu.com/' [1:339] Error in expression. (Invalid token ";". Was expecting one of: <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>, "-".) 15 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 16 警告: CSS error: 'http://www.baidu.com/' [2:204] Error in declaration. (Invalid token "normal". Was expecting one of: <S>, ":".) 17 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 18 警告: CSS error: 'http://www.baidu.com/' [2:970] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 19 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 20 警告: CSS warning: 'http://www.baidu.com/' [2:970] Ignoring the following declarations in this rule. 21 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 22 警告: CSS error: 'http://www.baidu.com/' [4:856] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 23 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 24 警告: CSS warning: 'http://www.baidu.com/' [4:856] Ignoring the following declarations in this rule. 25 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 26 警告: CSS error: 'http://www.baidu.com/' [4:1016] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 27 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 28 警告: CSS warning: 'http://www.baidu.com/' [4:1016] Ignoring the following declarations in this rule. 29 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 30 警告: CSS error: 'http://www.baidu.com/' [5:68] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 31 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 32 警告: CSS warning: 'http://www.baidu.com/' [5:68] Ignoring the following declarations in this rule. 33 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 34 警告: CSS error: 'http://www.baidu.com/' [6:751] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".) 35 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning 36 警告: CSS warning: 'http://www.baidu.com/' [6:751] Ignoring the following declarations in this rule. 37 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 38 警告: CSS error: 'http://www.baidu.com/' [8:127] Error in expression; ':' found after identifier "progid". 39 二月 03, 2015 11:46:03 上午 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 40 警告: Obsolete content type encountered: 'text/javascript'. 41 二月 03, 2015 11:46:03 上午 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 42 警告: Obsolete content type encountered: 'text/javascript'. 43 百度一下,你就知道 44 百度一下 45 新聞hao123地圖視頻貼吧登陸設置更多產品 46 把百度設爲主頁關於百度About Baidu 47 ©2015 Baidu 使用百度前必讀 京ICP證030173號
在上面的程序運行的過程當中,咱們能夠獲得百度首頁的全部內容,上面的代碼在運行的過程當中會出現不少的警告,出現這些警告的主要緣由是因爲如下兩種緣由:web
一、HtmlUnit對Javascript的支持不是很好
二、HtmlUnit對CSS的支持不是很好express
明白了上面的兩點後,將代碼從新改寫一下,該禁用的就禁用,同時禁用一些沒必要要的功能,也有利於提升程序的運行效率,再者說網絡爬蟲也不須要CSS的支持。瀏覽器
1 import java.io.IOException; 2 import java.net.MalformedURLException; 3 import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 4 import com.gargoylesoftware.htmlunit.WebClient; 5 import com.gargoylesoftware.htmlunit.html.HtmlPage; 6 7 public class Main { 8 9 public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { 10 // TODO Auto-generated method stub 11 final WebClient mWebClient = new WebClient(); 12 mWebClient.getOptions().setCssEnabled(false); 13 mWebClient.getOptions().setJavaScriptEnabled(false); 14 final HtmlPage mHtmlPage = mWebClient.getPage("http://www.baidu.com"); 15 System.out.println(mHtmlPage.asText()); 16 mWebClient.closeAllWindows(); 17 } 18 19 }
1 百度一下,你就知道 2 搜索設置 | 登陸 3 新 聞 網 頁 貼 吧 知 道 MP3 圖 片 視 頻 地 圖 4 百度一下 5 輸入法 6 手寫 7 拼音 8 關閉 9 空間 百科 hao123 | 更多>> 10 把百度設爲主頁 11 加入百度推廣 | 搜索風雲榜 | 關於百度 | About Baidu 12 ©2014 Baidu 使用百度前必讀 京ICP證030173號
HtmlUnit的使用: 簡介:HtmlUnit說白了就是一個瀏覽器,這個瀏覽器是用Java寫的無界面的瀏覽器,正由於其沒有界面,所以執行的速度仍是能夠滴,HtmlUnit提供了一系列的API,這些API能夠乾的功能比較多,如表單的填充,表單的提交,模仿點擊連接,因爲內置了Rhinojs引擎,所以能夠執行Javascript服務器
做用:web的自動化測試(最初的目的),瀏覽器,網絡爬蟲網絡
重要API的使用 在介紹API的使用以前要先明白的一個問題是,WebClient,WebWindow,Page三者之間的關係,全部的頁面最終都是在一個WebWindow對象裏面,WebClient在建立時會自動的建立一個WebWindow對象,當調用getPage時會將新頁面加載到WebWindow裏,你能夠理解成WebClient就是IE內核,WebWindow就是呈現頁面的瀏覽器窗口,三者之間的關係圖以下圖所示:
一、模擬特定瀏覽器,也能夠指定瀏覽器的相應版本(HtmlUnit最新版2.13如今能夠模擬的瀏覽器有Chrome/FireFox/IE)
//模擬chorme瀏覽器,其餘瀏覽器請修改BrowserVersion.後面 final WebClient mWebClient = new WebClient(BrowserVersion.CHROME);
二、查找特定元素,經過get或者XPath能夠從HtmlPage中得到特定的Html元素,以下例子
方法一,經過get方法獲取
1 import java.io.IOException; 2 import java.net.MalformedURLException; 3 4 import com.gargoylesoftware.htmlunit.BrowserVersion; 5 import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 6 import com.gargoylesoftware.htmlunit.WebClient; 7 import com.gargoylesoftware.htmlunit.html.HtmlDivision; 8 import com.gargoylesoftware.htmlunit.html.HtmlPage; 9 10 public class Main { 11 12 public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { 13 // TODO Auto-generated method stub 14 final WebClient mWebClient = new WebClient(BrowserVersion.CHROME); 15 mWebClient.getOptions().setCssEnabled(false); 16 mWebClient.getOptions().setJavaScriptEnabled(false); 17 final HtmlPage mHtmlPage = mWebClient.getPage("http://www.yanyulin.info/"); 18 //從[煙雨林博客]上獲取標籤hed的內容 19 HtmlDivision mdiv = (HtmlDivision)mHtmlPage.getElementById("hed"); 20 System.out.println(mdiv.asText()); 21 mWebClient.closeAllWindows(); 22 } 23 24 }
方法二,經過XPath獲取,XPath一般用於沒法經過Id搜索,或者須要更爲複雜的搜索時,XPath的相關教程。
1 // 一樣能夠打印出hed的內容,//div中//表示搜索整個文檔中的div,並將這些div 2 // 放入list中,而後獲取第一個div 3 HtmlDivision mdiv = (HtmlDivision) mHtmlPage.getByXPath("//div").get(0); 4 System.out.println(mdiv.asXml());
運行結果:
1 <div id="hed"> 2 <div class="top_part"> 3 <div style="float:left;"> 4 <a href="http://www.yanyulin.info"> 5 <img src="http://www.yanyulin.info/theme/images/logo.png" alt="煙雨林-關注程序員的IT科技博客" title="煙雨林-關注程序員的IT科技博客" width="127px" height="29px"/> 6 </a> 7 </div> 8 <div align="right" style="padding: 5px 0 0 0px;"> 9 <div class="side_search"> 10 <form action="http://zhannei.baidu.com/cse/search" method="get" target="_blank" class="bdcs-search-form" id="bdcs-search-form"> 11 <input type="hidden" name="s" value="36283161565572170"/> 12 <input type="hidden" name="entry" value="1"/> 13 <input type="text" name="q" class="search_input" id="bdcs-search-form-input" placeholder="找找看"/> 14 <input type="submit" class="search_btn" id="bdcs-search-form-submit" value="找找看"/> 15 </form> 16 </div> 17 <div id="google_search" class="side_search" style="margin-top:4px"> 18 <form method="get" action="http://www.google.com.hk/search" target="_blank"> 19 <input type="text" name="q" class="search_input" placeholder="Google一下"/> 20 <input type="submit" value="Google" name="btnG" id="btnG" class="search_btn" title="Google"/> 21 <input type="hidden" name="ie" value="UTF-8"/> 22 <input type="hidden" name="oe" value="UTF-8"/> 23 <input type="hidden" name="hl" value="zh-CN"/> 24 <input type="hidden" name="domains" value="www.yanyulin.info"/> 25 <input type="hidden" name="sitesearch" value="www.yanyulin.info"/> 26 </form> 27 </div> 28 </div> 29 </div> 30 </div>
三、代理服務器的配置,代理的配置很簡單,只須要配置好地址,端口,用戶名與密碼便可。
1 final WebClient mWebClient = new WebClient(BrowserVersion.CHROME,"http://127.0.0.1", 8080); 2 final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) mWebClient.getCredentialsProvider(); 3 credentialsProvider.addCredentials("username", "password");
四、模擬表單的提交
++++++++++++++++