基於Vert.x和RxJava 2構建通用的爬蟲框架

時間 2019-11-17

標籤基於 vert.x vert rxjava 構建通用爬蟲框架欄目 Java 简体版

原文原文鏈接

最近因爲業務須要監控一些數據，雖然市面上有不少優秀的爬蟲框架，可是我仍然打算從頭開始實現一套完整的爬蟲框架。java

在技術選型上，我沒有選擇Spring來搭建項目，而是選擇了更輕量級的Vert.x。一方面感受Spring過重了，而Vert.x是一個基於JVM、輕量級、高性能的框架。它基於事件和異步，依託於全異步Java服務器Netty，並擴展了不少其餘特性。react

github地址：https://github.com/fengzhizi715/NetDiscoverygit

一. 爬蟲框架的功能

爬蟲框架包含爬蟲引擎(SpiderEngine)和爬蟲(Spider)。SpiderEngine能夠管理多個Spider。github

1.1 Spider

在Spider中，主要包含幾個組件：downloader、queue、parser、pipeline以及代理池IP(proxypool)，代理池是一個單獨的項目，我前段時間寫的，在使用爬蟲框架時常常須要切換代理IP，因此把它引入進來。web

proxypool地址：https://github.com/fengzhizi715/ProxyPooljson

其他四個組件都是接口，在爬蟲框架中內置了一些實現，例如內置了多個下載器(downloader)包括vertx的webclient、http client、okhttp三、selenium實現的下載器。開發者能夠根據自身狀況來選擇使用或者本身開發全新的downloader。瀏覽器

Downloader的download方法會返回一個Maybe。服務器

package com.cv4j.netdiscovery.core.downloader;

import com.cv4j.netdiscovery.core.domain.Request;
import com.cv4j.netdiscovery.core.domain.Response;
import io.reactivex.Maybe;

/** * Created by tony on 2017/12/23. */
public interface Downloader {

    Maybe<Response> download(Request request);

    void close();
}
複製代碼

在Spider中，經過Maybe對象來實現後續的一系列的鏈式調用，好比將Response轉換成Page對象，再對Page對象進行解析，Page解析完畢以後作一系列的pipeline操做。app

downloader.download(request)
                            .observeOn(Schedulers.io())
                            .map(new Function<Response, Page>() {

                                @Override
                                public Page apply(Response response) throws Exception {

                                    Page page = new Page();
                                    page.setHtml(new Html(response.getContent()));
                                    page.setRequest(request);
                                    page.setUrl(request.getUrl());
                                    page.setStatusCode(response.getStatusCode());

                                    return page;
                                }
                            })
                            .map(new Function<Page, Page>() {

                                @Override
                                public Page apply(Page page) throws Exception {

                                    if (parser != null) {

                                        parser.process(page);
                                    }

                                    return page;
                                }
                            })
                            .map(new Function<Page, Page>() {

                                @Override
                                public Page apply(Page page) throws Exception {

                                    if (Preconditions.isNotBlank(pipelines)) {

                                        pipelines.stream()
                                                .forEach(pipeline -> pipeline.process(page.getResultItems()));
                                    }

                                    return page;
                                }
                            })
                            .subscribe(new Consumer<Page>() {

                                @Override
                                public void accept(Page page) throws Exception {

                                    log.info(page.getUrl());

                                    if (request.getAfterRequest()!=null) {

                                        request.getAfterRequest().process(page);
                                    }
                                }
                            }, new Consumer<Throwable>() {
                                @Override
                                public void accept(Throwable throwable) throws Exception {

                                    log.error(throwable.getMessage());
                                }
                            });
複製代碼

在這裏使用RxJava 2可讓整個爬蟲框架看起來更加響應式：）框架

1.2 SpiderEngine

SpiderEngine能夠包含多個Spider，能夠經過addSpider()、createSpider()來將爬蟲添加到SpiderEngine和建立新的Spider並添加到SpiderEngine。

在SpiderEngine中，若是調用了httpd(port)方法，還能夠監控SpiderEngine中各個Spider。

1.2.1 獲取某個爬蟲的狀態

http://localhost:{port}/netdiscovery/spider/{spiderName}

類型：GET

1.2.2 獲取SpiderEngine中全部爬蟲的狀態

http://localhost:{port}/netdiscovery/spiders/

類型：GET

1.2.3 修改某個爬蟲的狀態

http://localhost:{port}/netdiscovery/spider/{spiderName}/status

類型：POST

參數說明：

{
    "status":2   //讓爬蟲暫停
}
複製代碼

status	做用
2	讓爬蟲暫停
3	讓爬蟲從暫停中恢復
4	讓爬蟲中止

使用框架的例子

建立一個SpiderEngine，而後建立三個Spider，每一個爬蟲每隔必定的時間去爬取一個頁面。

SpiderEngine engine = SpiderEngine.create();

        Spider spider = Spider.create()
                .name("tony1")
                .repeatRequest(10000,"http://www.163.com")
                .initialDelay(10000);

        engine.addSpider(spider);

        Spider spider2 = Spider.create()
                .name("tony2")
                .repeatRequest(10000,"http://www.baidu.com")
                .initialDelay(10000);

        engine.addSpider(spider2);

        Spider spider3 = Spider.create()
                .name("tony3")
                .repeatRequest(10000,"http://www.126.com")
                .initialDelay(10000);

        engine.addSpider(spider3);

        engine.httpd(8080);
        engine.run();
複製代碼

上述程序運行一段時間以後，在瀏覽器中輸入：http://localhost:8080/netdiscovery/spiders

咱們能看到三個爬蟲運行的結果。

將json格式化一下

{
	"code": 200,
	"data": [{
		"downloaderType": "VertxDownloader",
		"leftRequestSize": 0,
		"queueType": "DefaultQueue",
		"spiderName": "tony2",
		"spiderStatus": 1,
		"totalRequestSize": 7
	}, {
		"downloaderType": "VertxDownloader",
		"leftRequestSize": 0,
		"queueType": "DefaultQueue",
		"spiderName": "tony3",
		"spiderStatus": 1,
		"totalRequestSize": 7
	}, {
		"downloaderType": "VertxDownloader",
		"leftRequestSize": 0,
		"queueType": "DefaultQueue",
		"spiderName": "tony1",
		"spiderStatus": 1,
		"totalRequestSize": 7
	}],
	"message": "success"
}
複製代碼