Cendertron，安全爬蟲的分佈式與穩定性優化之路

時間 2019-11-06

原文原文鏈接

Cendertron，安全爬蟲的分佈式與穩定性優化之路

Cendertron 是基於 Puppeteer 的 Web 2.0 動態爬蟲與敏感信息泄露檢測工具，其爲 Chaos-Scanner 後續的基礎掃描與 POC 掃描提供的掃描的 URL 目標。咱們前文介紹了 Cendertron 的基礎使用，這裏咱們針對實際掃描場景下的爬蟲參數設計與集羣架構進行簡要描述。不得不說，再優雅的設計也須要通過大量的數據實踐與經驗沉澱，與前一個版本的 Cendertron 相比，更多的是來自於細節的適配。javascript

基於 Docker Swarm 的彈性化集羣部署

在 Docker 實戰系列中，咱們詳細介紹了 Docker 及 Docker Swarm 的概念與配置、這裏咱們也是使用 Docker 提供的 Route Mesh 機制，將多個節點以相同端口暴露出去，這也就要求咱們將各個爬蟲節點的部分狀態集中化存儲，這裏以 Redis 爲中心化存儲。前端

實際上，Chaos Scanner 中的 POC 節點與爬蟲節點都遵循該調度方式，不過 POC 掃描節點主要是依賴於 RabbitMQ 進行任務分發：java

總體爬蟲在掃描調度中的邏輯流以下：node

這裏咱們能夠基於基礎鏡像編輯 Compose 文件，即 docker-compose.yml:python

version: '3'
services:
 crawlers:
 image: cendertron
 ports:
 - '${CENDERTRON_PORT}:3000'
 deploy:
 replicas: 2
 volumes:
 - wsat_etc:/etc/wsat

volumes:
 wsat_etc:
 driver: local
 driver_opts:
 o: bind
 type: none
 device: /etc/wsat/
複製代碼

這裏咱們將 Redis 的配置以卷方式掛載進容器，在 Chaos Scanner 好，不一樣設備的統一註冊中心即簡化爲了這個統一的配置文件：git

{
  "db": {
    "redis": {
      "host": "x.x.x.x",
      "port": 6379,
      "password": "xx-xx-xx-xx"
    }
  }
}
複製代碼

Redis 配置完畢以後，咱們能夠經過以下的命令建立服務：github

# 建立服務
> docker stack deploy wsat --compose-file docker-compose.yml --resolve-image=changed

# 指定實例
> docker service scale wsat_crawlers=5
複製代碼

這裏咱們提供了同時掃描多個目標的建立方式，不一樣的 URL 之間以 | 做爲分隔符：web

POST /scrape

{
"urls":"http://baidu.com|http://google.com"
}
複製代碼

在集羣運行以後，經過 ctop 命令咱們能看到單機上啓動的容器狀態：redis

使用 htop 命令能夠發現整個系統的 CPU 調用很是飽滿：算法

面向失敗的設計與監控優先

在測試與高可用保障系列文章中，咱們特意討論過在高可用架構設計中的面向失敗的設計原則：

這些原則中極重要的一條就是監控覆蓋原則，咱們在設計階段，就假設線上系統會出問題，從而在管控系統添加相應措施來防止一旦系統出現某種狀況，能夠及時補救。而在爬蟲這樣業務場景多樣性的狀況下，咱們更是須要可以及時審視系統的現狀，以隨時瞭解當前策略、參數的不恰當的地方。

在集羣背景下，爬蟲的狀態信息是存放在了 Redis 中，每一個爬蟲會按期上報。上報的爬蟲信息會自動 Expire，若是查看系統當前狀態時，發現某個節點的狀態信息不存在，即表示該爬蟲在本事件窗口內已經假死：

咱們依然經過 GET /_ah/health 端口來查看整個系統的狀態，以下所示：

{
  "success": true,
  "mode": "cluster",
  "schedulers": [
    {
      "id": "a8621dc0-afb3-11e9-94e5-710fb88b1291",
      "browserStatus": [
        {
          "targetsCnt": 4,
          "useCount": 153,
          "urls": [
            {
              "url": ""
            },
            {
              "url": "about:blank"
            },
            {
              "url": ""
            },
            {
              "url": "http://180.100.134.161:8091/xygjitv-web/#/enter_index_db/film"
            }
          ]
        }
      ],
      "runingCrawlers": [
        {
          "id": "dabd6260-b216-11e9-94e5-710fb88b1291",
          "entryPage": "http://180.100.134.161:8091/xygjitv-web/",
          "progress": "0.44",
          "startedAt": 1564414684039,
          "option": {
            "depth": 4,
            "maxPageCount": 500,
            "timeout": 1200000,
            "navigationTimeout": 30000,
            "pageTimeout": 60000,
            "isSameOrigin": true,
            "isIgnoreAssets": true,
            "isMobile": false,
            "ignoredRegex": ".*logout.*",
            "useCache": true,
            "useWeakfile": false,
            "useClickMonkey": false,
            "cookies": [
              {
                "name": "PHPSESSID",
                "value": "fbk4vjki3qldv1os2v9m8d2nc4",
                "domain": "180.100.134.161:8091"
              },
              {
                "name": "security",
                "value": "low",
                "domain": "180.100.134.161:8091"
              }
            ]
          },
          "spiders": [
            {
              "url": "http://180.100.134.161:8091/xygjitv-web/",
              "type": "page",
              "option": {
                "allowRedirect": false,
                "depth": 1
              },
              "isClosed": true,
              "currentStep": "Finished"
            }
          ]
        }
      ],
      "localRunningCrawlerCount": 1,
      "localFinishedCrawlerCount": 96,
      "reportTime": "2019-7-29 23:38:34"
    }
  ],
  "cache": ["Crawler#http://baidu.com"],
  "pageQueueLen": 31
}
複製代碼

參數調優

由於網絡震盪等諸多緣由，Cendertron 很難保障絕對的穩定性與一致性，更多的也仍是在效率與性能之間的權衡。最後咱們仍是再列舉下目前 Cendertron 內置的參數配置，在 src/config.ts 中包含了全部的配置：

export interface ScheduleOption {
  // 併發爬蟲數
  maxConcurrentCrawler: number;
}

export const defaultScheduleOption: ScheduleOption = {
  maxConcurrentCrawler: 1
};

export const defaultCrawlerOption: CrawlerOption = {
  // 爬取深度
  depth: 4,

  // 單爬蟲最多爬取頁面數
  maxPageCount: 500,
  // 默認超時爲 20 分鐘
  timeout: 20 * 60 * 1000,
  // 跳轉超時爲 30s
  navigationTimeout: 30 * 1000,
  // 單頁超時爲 60s
  pageTimeout: 60 * 1000,

  isSameOrigin: true,
  isIgnoreAssets: true,
  isMobile: false,
  ignoredRegex: '.*logout.*',

  // 是否使用緩存
  useCache: true,
  // 是否進行敏感文件掃描
  useWeakfile: false,
  // 是否使用模擬操做
  useClickMonkey: false
};

export const defaultPuppeteerPoolConfig = {
  max: 1, // default
  min: 1, // default
  // how long a resource can stay idle in pool before being removed
  idleTimeoutMillis: Number.MAX_VALUE, // default.
  // maximum number of times an individual resource can be reused before being destroyed; set to 0 to disable
  acquireTimeoutMillis: defaultCrawlerOption.pageTimeout * 2,
  maxUses: 0, // default
  // function to validate an instance prior to use; see https://github.com/coopernurse/node-pool#createpool
  validator: () => Promise.resolve(true), // defaults to always resolving true
  // validate resource before borrowing; required for `maxUses and `validator`
  testOnBorrow: true // default
  // For all opts, see opts at https://github.com/coopernurse/node-pool#createpool
};
複製代碼