爲爬蟲框架構建Selenium模塊、DSL模塊(Kotlin實現)

時間 2019-11-24

標籤爬蟲框架構建 selenium 模塊 dsl kotlin 實現欄目網絡爬蟲简体版

原文原文鏈接

NetDiscover是一款基於Vert.x、RxJava2實現的爬蟲框架。我最近添加了兩個模塊：Selenium模塊、DSL模塊。css

一. Selenium模塊

添加這個模塊的目的是爲了讓它可以模擬人的行爲去操做瀏覽器，完成爬蟲抓取的目的。html

Selenium是一個用於Web應用程序測試的工具。Selenium測試直接運行在瀏覽器中，就像真正的用戶在操做同樣。支持的瀏覽器包括IE（7, 8, 9, 10, 11），Mozilla Firefox，Safari，Google Chrome，Opera等。這個工具的主要功能包括：測試與瀏覽器的兼容性——測試你的應用程序看是否可以很好得工做在不一樣瀏覽器和操做系統之上。測試系統功能——建立迴歸測試檢驗軟件功能和用戶需求。支持自動錄製動做和自動生成 .Net、Java、Perl等不一樣語言的測試腳本。java

Selenium包括了一組工具和API：Selenium IDE，Selenium RC，Selenium WebDriver，和Selenium Grid。node

其中，Selenium WebDriver 是一個支持瀏覽器自動化的工具。它包括一組爲不一樣語言提供的類庫和「驅動」（drivers）可使瀏覽器上的動做自動化。react

1.1 適配多個瀏覽器

正是得益於Selenium WebDriver ，Selenium模塊能夠適配多款瀏覽器。目前在該模塊中支持Chrome、Firefox、IE以及PhantomJS（PhantomJS是一個無界面的,可腳本編程的WebKit瀏覽器引擎）。git

package com.cv4j.netdiscovery.selenium;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.ie.InternetExplorerDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;

/** * Created by tony on 2018/1/28. */
public enum Browser implements WebDriverInitializer {

    CHROME {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.chrome.driver", path);
            return new ChromeDriver();
        }
    },
    FIREFOX {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.gecko.driver", path);
            return new FirefoxDriver();
        }
    },
    IE {
        @Override
        public WebDriver init(String path) {
            System.setProperty("webdriver.ie.driver", path);
            return new InternetExplorerDriver();
        }
    },
    PHANTOMJS {
        @Override
        public WebDriver init(String path) {

            DesiredCapabilities capabilities = new DesiredCapabilities();
            capabilities.setCapability("phantomjs.binary.path", path);
            capabilities.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
            capabilities.setJavascriptEnabled(true);
            capabilities.setCapability("takesScreenshot", true);
            capabilities.setCapability("cssSelectorsEnabled", true);

            return new PhantomJSDriver(capabilities);
        }
    }
}
複製代碼

1.2 WebDriverPool

之因此使用WebDriverPool，是由於每次打開一個WebDriver進程都比較耗費資源，因此建立一個對象池。我使用Apache的Commons Pool組件來實現對象池化。github

package com.cv4j.netdiscovery.selenium.pool;

import org.apache.commons.pool2.impl.GenericObjectPool;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/3/9. */
public class WebDriverPool {

    private static GenericObjectPool<WebDriver> webDriverPool = null;

    /** * 若是須要使用WebDriverPool，則必須先調用這個init()方法 * * @param config */
    public static void init(WebDriverPoolConfig config) {

        webDriverPool = new GenericObjectPool<>(new WebDriverPooledFactory(config));
        webDriverPool.setMaxTotal(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.total", "20"))); // 最多能放多少個對象
        webDriverPool.setMinIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.min.idle", "1")));   // 最少有幾個閒置對象
        webDriverPool.setMaxIdle(Integer.parseInt(System.getProperty(
                "webdriver.pool.max.idle", "20"))); // 最多容許多少個閒置對象

        try {
            webDriverPool.preparePool();
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static WebDriver borrowOne() {

        if (webDriverPool!=null) {

            try {
                return webDriverPool.borrowObject();
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        return null;
    }

    public static void returnOne(WebDriver driver) {

        if (webDriverPool!=null) {

            webDriverPool.returnObject(driver);
        }
    }

    public static void destory() {

        if (webDriverPool!=null) {

            webDriverPool.clear();
            webDriverPool.close();
        }
    }

    public static boolean hasWebDriverPool() {

        return webDriverPool!=null;
    }
}
複製代碼

1.3 SeleniumAction

Selenium 能夠模擬瀏覽器的行爲，例如點擊、滑動、返回等等。這裏抽象出一個SeleniumAction類，用於表示模擬的事件。web

package com.cv4j.netdiscovery.selenium.action;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/3/3. */
public abstract class SeleniumAction {

    public abstract SeleniumAction perform(WebDriver driver);

    public SeleniumAction doIt(WebDriver driver) {

        return perform(driver);
    }

    public static SeleniumAction clickOn(By by) {
        return new ClickOn(by);
    }

    public static SeleniumAction getUrl(String url) {
        return new GetURL(url);
    }

    public static SeleniumAction goBack() {
        return new GoBack();
    }

    public static SeleniumAction closeTabs() {
        return new CloseTab();
    }
}
複製代碼

1.4 SeleniumDownloader

Downloader是爬蟲框架的下載器組件，例如可使用vert.x的webclient、okhttp3等實現網絡請求的功能。若是須要使用Selenium，必需要使用SeleniumDownloader來完成網絡請求。chrome

SeleniumDownloader類能夠添加一個或者多個SeleniumAction。若是是多個SeleniumAction會按照順序執行。apache

尤其重要的是，SeleniumDownloader類中webDriver是從WebDriverPool中獲取，每次使用完了會將webDriver返回到鏈接池。

package com.cv4j.netdiscovery.selenium.downloader;

import com.cv4j.netdiscovery.core.config.Constant;
import com.cv4j.netdiscovery.core.domain.Request;
import com.cv4j.netdiscovery.core.domain.Response;
import com.cv4j.netdiscovery.core.downloader.Downloader;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool;
import com.safframework.tony.common.utils.Preconditions;
import io.reactivex.Maybe;
import io.reactivex.MaybeEmitter;
import io.reactivex.MaybeOnSubscribe;
import io.reactivex.functions.Function;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;

import java.util.LinkedList;
import java.util.List;

/** * Created by tony on 2018/1/28. */
public class SeleniumDownloader implements Downloader {

    private WebDriver webDriver;
    private List<SeleniumAction> actions = new LinkedList<>();

    public SeleniumDownloader() {

        this.webDriver = WebDriverPool.borrowOne(); // 從鏈接池中獲取webDriver
    }

    public SeleniumDownloader(SeleniumAction action) {

        this.webDriver = WebDriverPool.borrowOne(); // 從鏈接池中獲取webDriver
        this.actions.add(action);
    }

    public SeleniumDownloader(List<SeleniumAction> actions) {

        this.webDriver = WebDriverPool.borrowOne(); // 從鏈接池中獲取webDriver
        this.actions.addAll(actions);
    }

    @Override
    public Maybe<Response> download(Request request) {

        return Maybe.create(new MaybeOnSubscribe<String>(){

            @Override
            public void subscribe(MaybeEmitter emitter) throws Exception {

                if (webDriver!=null) {
                    webDriver.get(request.getUrl());

                    if (Preconditions.isNotBlank(actions)) {

                        actions.forEach(
                                action-> action.perform(webDriver)
                        );
                    }

                    emitter.onSuccess(webDriver.getPageSource());
                }
            }
        }).map(new Function<String, Response>() {

            @Override
            public Response apply(String html) throws Exception {

                Response response = new Response();
                response.setContent(html.getBytes());
                response.setStatusCode(Constant.OK_STATUS_CODE);
                response.setContentType(getContentType(webDriver));
                return response;
            }
        });
    }

    /** * @param webDriver * @return */
    private String getContentType(final WebDriver webDriver) {

        if (webDriver instanceof JavascriptExecutor) {

            final JavascriptExecutor jsExecutor = (JavascriptExecutor) webDriver;
            // TODO document.contentType does not exist.
            final Object ret = jsExecutor
                    .executeScript("return document.contentType;");
            if (ret != null) {
                return ret.toString();
            }
        }
        return "text/html";
    }


    @Override
    public void close() {

        if (webDriver!=null) {
            WebDriverPool.returnOne(webDriver); // 將webDriver返回到鏈接池
        }
    }
}
複製代碼

1.5 一些有用的工具類

此外，Selenium模塊還有一個工具類。它包含了一些scrollTo、scrollBy、clickElement等瀏覽器的操做。

還有一些有特點的功能是對當前網頁進行截幕，或者是截取某個區域。

public static void taskScreenShot(WebDriver driver,String pathName){

        //指定了OutputType.FILE作爲參數傳遞給getScreenshotAs()方法，其含義是將截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具類的copyFile()方法保存getScreenshotAs()返回的文件對象。

        try {
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void taskScreenShot(WebDriver driver,WebElement element,String pathName) {

        //指定了OutputType.FILE作爲參數傳遞給getScreenshotAs()方法，其含義是將截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具類的copyFile()方法保存getScreenshotAs()返回的文件對象。

        try {
            //獲取元素在所處frame中位置對象
            Point p = element.getLocation();
            //獲取元素的寬與高
            int width = element.getSize().getWidth();
            int height = element.getSize().getHeight();
            //矩形圖像對象
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(p.getX(), p.getY(), rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /** * 截取某個區域的截圖 * @param driver * @param x * @param y * @param width * @param height * @param pathName */
    public static void taskScreenShot(WebDriver driver,int x,int y,int width,int height,String pathName) {

        //指定了OutputType.FILE作爲參數傳遞給getScreenshotAs()方法，其含義是將截取的屏幕以文件形式返回。
        File srcFile = ((TakesScreenshot) driver).getScreenshotAs(OutputType.FILE);
        //利用IOUtils工具類的copyFile()方法保存getScreenshotAs()返回的文件對象。

        try {
            //矩形圖像對象
            Rectangle rect = new Rectangle(width, height);
            BufferedImage img = ImageIO.read(srcFile);
            BufferedImage dest = img.getSubimage(x, y, rect.width, rect.height);
            ImageIO.write(dest, "png", srcFile);
            IOUtils.copyFile(srcFile, new File(pathName));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
複製代碼

1.6 使用Selenium模塊的實例

在京東上搜索個人新書《RxJava 2.x 實戰》，並按照銷量進行排序，而後獲取前十個商品的信息。

1.6.1 建立多個Actions，並按照順序執行。

第一步，打開瀏覽器輸入關鍵字

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;

/** * Created by tony on 2018/6/12. */
public class BrowserAction extends SeleniumAction{

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String searchText = "RxJava 2.x 實戰";
            String searchInput = "//*[@id=\"keyword\"]";
            WebElement userInput = Utils.getWebElementByXpath(driver, searchInput);
            userInput.sendKeys(searchText);
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null;
    }
}
複製代碼

第二步，點擊搜索按鈕進行搜索

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * Created by tony on 2018/6/12. */
public class SearchAction extends SeleniumAction {

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String searchBtn = "/html/body/div[2]/form/input[4]";
            Utils.clickElement(driver, By.xpath(searchBtn));
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null;
    }
}
複製代碼

第三步，對搜索的結果點擊「銷量」進行排序

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.selenium.Utils;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;

/** * 按照銷量進行排序 * Created by tony on 2018/6/12. */
public class SortAction extends SeleniumAction{

    @Override
    public SeleniumAction perform(WebDriver driver) {

        try {
            String saleSortBtn = "//*[@id=\"J_filter\"]/div[1]/div[1]/a[2]";
            Utils.clickElement(driver, By.xpath(saleSortBtn));
            Thread.sleep(3000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        return null;
    }
}
複製代碼

1.6.2 建立解析類PriceParser

執行上述actions以後，並對返回的html進行解析。將解析後的商品信息傳給後面的Pipeline。

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.domain.Page;
import com.cv4j.netdiscovery.core.parser.Parser;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

/** * Created by tony on 2018/6/12. */
public class PriceParser implements Parser{

    @Override
    public void process(Page page) {

        String pageHtml = page.getHtml().toString();
        Document document = Jsoup.parse(pageHtml);
        Elements elements = document.select("div[id=J_goodsList] li[class=gl-item]");
        page.getResultItems().put("goods_elements",elements);
    }
}
複製代碼

1.6.3 建立Pileline類PricePipeline

用於打印銷量最高的前十個商品的信息。

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.domain.ResultItems;
import com.cv4j.netdiscovery.core.pipeline.Pipeline;

import lombok.extern.slf4j.Slf4j;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/** * Created by tony on 2018/6/12. */
@Slf4j
public class PricePipeline implements Pipeline {

    @Override
    public void process(ResultItems resultItems) {

        Elements elements = resultItems.get("goods_elements");
        if (elements != null && elements.size() >= 10) {
            for (int i = 0; i < 10; i++) {
                Element element = elements.get(i);
                String storeName = element.select("div[class=p-shop] a").first().text();
                String goodsName = element.select("div[class=p-name p-name-type-2] a em").first().text();
                String goodsPrice = element.select("div[class=p-price] i").first().text();
                log.info(storeName + " " + goodsName + " ￥" + goodsPrice);
            }
        }
    }
}
複製代碼

1.6.4 完成JDSpider

此時，多個action會按照順序執行，downloader採用SeleniumDownloader。

package com.cv4j.netdiscovery.example.jd;

import com.cv4j.netdiscovery.core.Spider;
import com.cv4j.netdiscovery.selenium.Browser;
import com.cv4j.netdiscovery.selenium.action.SeleniumAction;
import com.cv4j.netdiscovery.selenium.downloader.SeleniumDownloader;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPool;
import com.cv4j.netdiscovery.selenium.pool.WebDriverPoolConfig;

import java.util.ArrayList;
import java.util.List;

/** * Created by tony on 2018/6/12. */
public class JDSpider {

    public static void main(String[] args) {
        
        WebDriverPoolConfig config = new WebDriverPoolConfig("example/chromedriver",Browser.CHROME); //設置瀏覽器的驅動程序和瀏覽器的類型，瀏覽器的驅動程序要跟操做系統匹配。
        WebDriverPool.init(config); // 須要先使用init，才能使用WebDriverPool

        List<SeleniumAction> actions = new ArrayList<>();
        actions.add(new BrowserAction());
        actions.add(new SearchAction());
        actions.add(new SortAction());

        SeleniumDownloader seleniumDownloader = new SeleniumDownloader(actions);

        String url = "https://search.jd.com/";

        Spider.create()
                .name("jd")
                .url(url)
                .downloader(seleniumDownloader)
                .parser(new PriceParser())
                .pipeline(new PricePipeline())
                .run();
    }
}
複製代碼

二. DSL模塊

該模塊是由Kotlin編寫的，使用它的特性進行DSL的封裝。

package com.cv4j.netdiscovery.dsl

import com.cv4j.netdiscovery.core.Spider
import com.cv4j.netdiscovery.core.downloader.Downloader
import com.cv4j.netdiscovery.core.parser.Parser
import com.cv4j.netdiscovery.core.pipeline.Pipeline
import com.cv4j.netdiscovery.core.queue.Queue

/** * Created by tony on 2018/5/27. */
class SpiderWrapper {

    var name: String? = null

    var parser: Parser? = null

    var queue: Queue? = null

    var downloader: Downloader? = null

    var pipelines:Set<Pipeline>? = null

    var urls:List<String>? = null

}

fun spider(init: SpiderWrapper.() -> Unit):Spider {

    val wrap = SpiderWrapper()

    wrap.init()

    return configSpider(wrap)
}

private fun configSpider(wrap:SpiderWrapper):Spider {

    val spider = Spider.create(wrap?.queue)
            .name(wrap?.name)

    var urls = wrap?.urls

    urls?.let {

        spider.url(urls)
    }

    spider.downloader(wrap?.downloader)
            .parser(wrap?.parser)

    wrap?.pipelines?.let {

        it.forEach { // 這裏的it指wrap?.pipelines

            spider.pipeline(it) // 這裏的it指pipelines裏的各個pipeline
        }
    }

    return spider
}
複製代碼

舉個例子，使用DSL來建立一個爬蟲並運行。

val spider = spider {

            name = "tony"

            urls = listOf("http://www.163.com/","https://www.baidu.com/")

            pipelines = setOf(ConsolePipeline())
        }

        spider.run()
複製代碼

它等價於下面的java代碼

Spider.create().name("tony1")
                .url("http://www.163.com/", "https://www.baidu.com/")
                .pipeline(new ConsolePipeline())
                .run();
複製代碼

DSL能夠簡化代碼，提升開發效率，更抽象地構建模型。不過話說回來，DSL也有缺陷，可以表達的功能有限，而且不是圖靈完備的。