前段時間寫了個小說線上採集閱讀(猛戳這裏:http://www.javashuo.com/article/p-vrqcfmvu-bo.html),當咱們去採集起點網的小說目錄時發現目錄數據沒有在html裏面,數據是頁面加載時,用ajax請求獲取,且對應的div是隱藏的,須要點擊「目錄」,纔看到目錄,雖然通過研究最終咱們仍是找到了接口URL,並經過HttpClient構造post請求獲取到了數據,但這種方式太麻煩,成本太大,那有沒有其餘的方式呢?css
經過查找資料發現一個神器:HtmlUnit 官網入口,猛戳這裏:http://htmlunit.sourceforge.nethtml
如下介紹摘自官網:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.web
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating Chrome, Firefox or Internet Explorer depending on the configuration used.ajax
It is typically used for testing purposes or to retrieve information from web sites.瀏覽器
HtmlUnit is not a generic unit testing framework. It is specifically a way to simulate a browser for testing purposes and is intended to be used within another testing framework such as JUnit or TestNG. Refer to the document "Getting Started with HtmlUnit" for an introduction.框架
HtmlUnit is used as the underlying "browser" by different Open Source tools like Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit, ...異步
HtmlUnit was originally written by Mike Bowler of Gargoyle Software and is released under the Apache 2 license. Since then, it has received many contributions from other developers, and would not be where it is today without their assistance.maven
HtmlUnit provides excellent JavaScript support, simulating the behavior of the configured browser (Firefox or Internet Explorer). It uses the Rhino JavaScript engine for the core language (plus workarounds for some Rhino bugs) and provides the implementation for the objects specific to execution in a browser.ide
中文翻譯:
HtmlUnit是一個「Java程序的無界面瀏覽器」。它爲HTML文檔建模,並提供一個API,容許您調用頁面、填寫表單、單擊連接等……就像你在「普通」瀏覽器中所作的同樣。工具
它有至關好的JavaScript支持(不斷改進),甚至可使用很是複雜的AJAX庫,根據使用的配置模擬Chrome、Firefox或Internet Explorer。
它一般用於測試或從web站點檢索信息。
HtmlUnit不是一個通用的單元測試框架。它是一種專門用於測試目的的模擬瀏覽器的方法,並打算在其餘測試框架(如JUnit或TestNG)中使用。請參閱「開始使用HtmlUnit」文檔以得到介紹。
HtmlUnit被不一樣的開源工具用做底層的「瀏覽器」,好比Canoo WebTest, JWebUnit, WebDriver, JSFUnit, WETATOR, Celerity, Spring MVC Test HtmlUnit…
HtmlUnit最初是由石像鬼軟件的Mike Bowler編寫的,在Apache 2許可證下發布。從那之後,它收到了其餘開發者的許多貢獻,若是沒有他們的幫助,它就不會有今天的成就。
HtmlUnit提供了出色的JavaScript支持,模擬了配置好的瀏覽器(Firefox或Internet Explorer)的行爲。它使用Rhino JavaScript引擎做爲核心語言(加上一些Rhino bug的解決方案),併爲特定於在瀏覽器中執行的對象提供實現。
快速上手,猛戳這裏:http://htmlunit.sourceforge.net/gettingStarted.html
maven引包:
<dependency> <groupId>net.sourceforge.htmlunit</groupId> <artifactId>htmlunit</artifactId> <version>2.32</version> </dependency>
那對應咱們以前獲取目錄,咱們能夠這樣作:
try { //建立一個WebClient,並模擬特定的瀏覽器 WebClient webClient = new WebClient(BrowserVersion.FIREFOX_52); //幾個重要配置 webClient.getOptions().setJavaScriptEnabled(true);//激活js webClient.setAjaxController(new NicelyResynchronizingAjaxController());//設置Ajax異步 webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);//拋出失敗的狀態碼 webClient.getOptions().setThrowExceptionOnScriptError(true);//拋出js異常 webClient.getOptions().setCssEnabled(false);//禁用css,無頁面,無需渲染 webClient.getOptions().setTimeout(10000); //設置鏈接超時時間 //獲取起點中文網書本詳情、目錄頁面 HtmlPage page = webClient.getPage("https://book.qidian.com/info/1209977"); //設置等待js響應時間 webClient.waitForBackgroundJavaScript(5000); //模擬點擊「目錄」 page = page.getHtmlElementById("j_catalogPage").click(); //獲取頁面源代碼 System.out.println(page.asXml()); } catch (IOException e) { e.printStackTrace(); }
未執行js以前
通過執行js請求渲染數據,再獲取頁面源代碼,這樣咱們就能拿到帶有目錄數據的html了
簡單的幾行代碼就能夠看出htmlUnit的強大,理論上,瀏覽器能作的它都能模擬;在這裏先記錄下來,等有空了再加到小說線上採集閱讀(猛戳這裏:http://www.javashuo.com/article/p-vrqcfmvu-bo.html)