目標:動態網頁爬取
javascript
說明:這裏的動態網頁指幾種可能:1)須要用戶交互,如常見的登陸操做;2)網頁經過JS / AJAX動態生成,如一個html裏有<div id="test"></div>,經過JS生成<div id="test"><span>aaa</span></div>。css
這裏用了WebCollector 2進行爬蟲,這東東也方便,不過要支持動態關鍵仍是要靠另一個API -- selenium 2(集成htmlunit 和 phantomjs).html
1)須要登陸後的爬取,如新浪微博
java
- import java.util.Set;
-
- import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;
- import cn.edu.hfut.dmic.webcollector.model.Links;
- import cn.edu.hfut.dmic.webcollector.model.Page;
- import cn.edu.hfut.dmic.webcollector.net.HttpRequesterImpl;
-
- import org.openqa.selenium.Cookie;
- import org.openqa.selenium.WebElement;
- import org.openqa.selenium.htmlunit.HtmlUnitDriver;
- import org.jsoup.nodes.Element;
- import org.jsoup.select.Elements;
-
- public class WebCollector1 extends DeepCrawler {
-
- public WebCollector1(String crawlPath) {
- super(crawlPath);
-
- try {
- String cookie=WebCollector1.WeiboCN.getSinaCookie("yourAccount", "yourPwd");
- HttpRequesterImpl myRequester=(HttpRequesterImpl) this.getHttpRequester();
- myRequester.setCookie(cookie);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
-
- @Override
- public Links visitAndGetNextLinks(Page page) {
-
- Elements weibos=page.getDoc().select("div.c");
- for(Element weibo:weibos){
- System.out.println(weibo.text());
- }
-
- return null;
- }
-
- public static void main(String[] args) {
- WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo");
- crawler.setThreads(3);
-
- for(int i=0;i<5;i++){
- crawler.addSeed("http://weibo.cn/zhouhongyi?vt=4&page="+i);
- }
- try {
- crawler.start(1);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
-
- public static class WeiboCN {
-
-
- public static String getSinaCookie(String username, String password) throws Exception{
- StringBuilder sb = new StringBuilder();
- HtmlUnitDriver driver = new HtmlUnitDriver();
- driver.setJavascriptEnabled(true);
- driver.get("http://login.weibo.cn/login/");
-
- WebElement mobile = driver.findElementByCssSelector("input[name=mobile]");
- mobile.sendKeys(username);
- WebElement pass = driver.findElementByCssSelector("input[name^=password]");
- pass.sendKeys(password);
- WebElement rem = driver.findElementByCssSelector("input[name=remember]");
- rem.click();
- WebElement submit = driver.findElementByCssSelector("input[name=submit]");
- submit.click();
-
- Set<Cookie> cookieSet = driver.manage().getCookies();
- driver.close();
- for (Cookie cookie : cookieSet) {
- sb.append(cookie.getName()+"="+cookie.getValue()+";");
- }
- String result=sb.toString();
- if(result.contains("gsid_CTandWM")){
- return result;
- }else{
- throw new Exception("weibo login failed");
- }
- }
- }
-
- }
* 這裏有個自定義路徑/home/hu/data/weibo(WebCollector1 crawler=new WebCollector1("/home/hu/data/weibo");),是用來保存到嵌入式數據庫Berkeley DB。node
* 整體上來自Webcollector 做者的sample。git
2)JS動態生成HTML元素的爬取
- import java.util.List;
-
- import org.openqa.selenium.By;
- import org.openqa.selenium.WebDriver;
- import org.openqa.selenium.WebElement;
-
- import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;
- import cn.edu.hfut.dmic.webcollector.model.Links;
- import cn.edu.hfut.dmic.webcollector.model.Page;
-
- public class WebCollector3 extends DeepCrawler {
-
- public WebCollector3(String crawlPath) {
- super(crawlPath);
-
- }
-
- @Override
- public Links visitAndGetNextLinks(Page page) {
-
- WebDriver driver = PageUtils.getWebDriver(page);
- List<WebElement> divInfos=driver.findElements(By.cssSelector("#feed_content span"));
- for(WebElement divInfo:divInfos){
- System.out.println("Text是:" + divInfo.getText());
- }
- return null;
- }
-
- public static void main(String[] args) {
- WebCollector3 crawler=new WebCollector3("/home/hu/data/wb");
- for(int page=1;page<=5;page++)
- crawler.addSeed("http://cq.qq.com/baoliao/detail.htm?294064");
- try {
- crawler.start(1);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
-
- }
PageUtils.java
- import java.io.BufferedReader;
- import java.io.IOException;
- import java.io.InputStream;
- import java.io.InputStreamReader;
-
- import org.openqa.selenium.JavascriptExecutor;
- import org.openqa.selenium.WebDriver;
- import org.openqa.selenium.chrome.ChromeDriver;
- import org.openqa.selenium.htmlunit.HtmlUnitDriver;
- import org.openqa.selenium.ie.InternetExplorerDriver;
- import org.openqa.selenium.phantomjs.PhantomJSDriver;
-
- import com.gargoylesoftware.htmlunit.BrowserVersion;
-
- import cn.edu.hfut.dmic.webcollector.model.Page;
-
- public class PageUtils {
- public static HtmlUnitDriver getDriver(Page page) {
- HtmlUnitDriver driver = new HtmlUnitDriver();
- driver.setJavascriptEnabled(true);
- driver.get(page.getUrl());
- return driver;
- }
-
- public static HtmlUnitDriver getDriver(Page page, BrowserVersion browserVersion) {
- HtmlUnitDriver driver = new HtmlUnitDriver(browserVersion);
- driver.setJavascriptEnabled(true);
- driver.get(page.getUrl());
- return driver;
- }
-
- public static WebDriver getWebDriver(Page page) {
-
-
- System.setProperty("phantomjs.binary.path", "D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe");
- WebDriver driver = new PhantomJSDriver();
- driver.get(page.getUrl());
-
- return driver;
- }
-
- public static String getPhantomJSDriver(Page page) {
- Runtime rt = Runtime.getRuntime();
- Process process = null;
- try {
- process = rt.exec("D:\\Installs\\Develop\\crawling\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe " +
- "D:\\workspace\\crawlTest1\\src\\crawlTest1\\parser.js " +
- page.getUrl().trim());
- InputStream in = process.getInputStream();
- InputStreamReader reader = new InputStreamReader(
- in, "UTF-8");
- BufferedReader br = new BufferedReader(reader);
- StringBuffer sbf = new StringBuffer();
- String tmp = "";
- while((tmp = br.readLine())!=null){
- sbf.append(tmp);
- }
- return sbf.toString();
- } catch (IOException e) {
- e.printStackTrace();
- }
-
- return null;
- }
- }
2.1)HtmlUnitDriver getDriver是selenium 1.x的做法,已經outdate了,如今用WebDriver getWebDriver
2.2)這裏用了幾種方法:HtmlUnitDriver, ChromeDriver, PhantomJSDriver, PhantomJS,參考 http://blog.csdn.net/five3/article/details/19085303,各自之間的優缺 點以下:github
driver類型 |
優勢 |
缺點 |
應用 |
真實瀏覽器driver |
真實模擬用戶行爲 |
效率、穩定性低 |
兼容性測試 |
HtmlUnit |
速度快 |
js引擎不是主流的瀏覽器支持的 |
包含少許js的頁面測試 |
PhantomJS |
速度中等、模擬行爲接近真實 |
不能模擬不一樣/特定瀏覽器的行爲 |
非GUI的功能性測試 |
* 真實瀏覽器driver 包括 Firefox, Chrome, IE
2.3)用PhantomJSDriver的時候,趕上錯 誤:ClassNotFoundException: org.openqa.selenium.browserlaunchers.Proxies,原 因居然是selenium 2.44 的bug,後來經過maven找到phantomjsdriver-1.2.1.jar 才解決了。
web
2.4)另外,我還試了PhantomJS 原生調用(也就是不用selenium,直接調用PhantomJS,見上面的方法),原生要調用JS,這裏的parser.js代碼以下:chrome
- system = require('system')
- address = system.args[1];
- var page = require('webpage').create();
- var url = address;
- page.open(url, function (status) {
-
- if (status !== 'success') {
- console.log('Unable to post!');
- } else {
-
- console.log(page.content);
- }
- phantom.exit();
- });
3)後話
3.1)HtmlUnitDriver + PhantomJSDriver是當前最可靠的動態抓取方案。數據庫
3.2)這過程當中用到不少包、exe,遇到不少的牆~,有須要的朋友能夠找我要。
Reference
http://www.ibm.com/developerworks/cn/web/1309_fengyq_seleniumvswebdriver/ http://blog.csdn.net/smilings/article/details/7395509 http://phantomjs.org/download.html http://blog.csdn.net/five3/article/details/19085303 http://phantomjs.org/quick-start.html