[系統集成] 部署 mesos-exporter 和 prometheus 監控 mesos task

前幾天我在mesos平臺上基於 cadvisor部署了 influxdb 和 grafana,用於監控 mesos 以及 docker app 運行信息,發現這套監控系統不太適合 mesos + docker 的架構,緣由是:git

1)mesos task id 和 docker container name 不一致github

cadvisor 的設計基於 docker host,沒有考慮到mesos 數據中心;web

cadvisor 用 docker name(docker ps能看到)來標記抓取的數據,而 mesos 用 task id(在mesos ui 或者metrics裏能看到) 來標記正在運行的任務。mesos task 的類型能夠是 docker 容器,也能夠是非容器。mesos task id 與docker container name 的命名也是徹底不同的。docker

上述問題致使 cadvisor 抓取到數據後,用戶難以識別屬於哪一個 mesos taskjson

2)cadvisor 和 grafana 不支持報警bash

 

通過查詢資料,發現 mesos-exporter + prometheus + alert-manager 是個很好的組合,能夠解決上述問題:架構

mesos-exporter 是 mesosphere 開發的工具,用於導出 mesos 集羣包括 task 的監控數據並傳遞給prometheus;prometheus是個集 db、graph、statistic 於一體的監控工具;alert-manager 是 prometheus 的報警工具app

搭建方法:

1. build mesos-exporter

git clone https://github.com/mesosphere/mesos_exporter.git
cd mesos_exporter
docker build -f Dockerfile -t mesosphere/mesos-exporter .

2. docker pull prometheus, alert-manager

3. 部署 mesos-exporter, alert-manager, prometheus

mesos-exporter:ssh

{
  "id": "mesos-exporter-slave",
  "instances": 6,
  "cpus": 0.2,
  "mem": 128,
  "args": [
      "-slave=http://127.0.0.1:5051",
      "-timeout=5s"
  ],
  "constraints": [
      ["hostname","UNIQUE"],
      ["hostname", "LIKE", "slave[1-6]"]
  ],
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "172.31.17.36:5000/mesos-exporter:latest",
      "network": "HOST"
    },
    "volumes": [
      {
        "containerPath": "/etc/localtime",
        "hostPath": "/etc/localtime",
        "mode": "RO"
      }
    ]
  }
}

請打開slave 防火牆的9110/tcp 端口tcp

 

alert-manager:

{
  "id": "alertmanager",
  "instances": 1,
  "cpus": 0.5,
  "mem": 128,
  "constraints": [
      ["hostname","UNIQUE"],
      ["hostname", "LIKE", "slave[1-6]"]
  ],
  "labels": {
    "HAPROXY_GROUP":"external",
    "HAPROXY_0_VHOST":"alertmanager.test.com"
  },
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "172.31.17.36:5000/alertmanager:latest",
      "network": "BRIDGE",
      "portMappings": [
        { "containerPort": 9093, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
      ]
    },
    "volumes": [
      {
        "containerPath": "/etc/localtime",
        "hostPath": "/etc/localtime",
        "mode": "RO"
      },
      {
        "containerPath": "/etc/alertmanager/config.yml",
        "hostPath": "/var/nfsshare/alertmanager/config.yml",
        "mode": "RO"
      },
      {
        "containerPath": "/alertmanager",
        "hostPath": "/var/nfsshare/alertmanager/data",
        "mode": "RW"
      }
    ]
  }
}

 

prometheus:

{
  "id": "prometheus",
  "instances": 1,
  "cpus": 0.5,
  "mem": 128,
  "args": [
      "-config.file=/etc/prometheus/prometheus.yml", 
      "-storage.local.path=/prometheus",
      "-web.console.libraries=/etc/prometheus/console_libraries",
      "-web.console.templates=/etc/prometheus/consoles",
      "-alertmanager.url=http://alertmanager.test.com"
  ],
  "constraints": [
      ["hostname","UNIQUE"],
      ["hostname", "LIKE", "slave[1-6]"]
  ],
  "labels": {
    "HAPROXY_GROUP":"external",
    "HAPROXY_0_VHOST":"prometheus.test.com"
  },
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "172.31.17.36:5000/prometheus:latest",
      "network": "BRIDGE",
      "portMappings": [
        { "containerPort": 9090, "hostPort": 0, "servicePort": 0, "protocol": "tcp" }
      ]
    },
    "volumes": [
      {
        "containerPath": "/etc/localtime",
        "hostPath": "/etc/localtime",
        "mode": "RO"
      },
      {
        "containerPath": "/etc/prometheus",
        "hostPath": "/var/nfsshare/prometheus/conf",
        "mode": "RO"
      },
      {
        "containerPath": "/prometheus",
        "hostPath": "/var/nfsshare/prometheus/data",
        "mode": "RW"
      }
    ]
  }
}

 

4. prometheus 配置

prometheus.yml

# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: 'mesos-slaves'
    scrape_interval: 5s
    metrics_path: '/metrics'
    scheme: 'http'
    target_groups:
      - targets: ['172.31.17.31:9110', '172.31.17.32:9110', '172.31.17.33:9110', '172.31.17.34:9110', '172.31.17.35:9110', '172.31.17.36:9110']
      - labels:
          group: 'office'

  

待補充 ...

 

5. 報警設置

待補充 ...

 

6. 與 grafana 集成

prometheus的 graph 功能不太完善,能夠與 grafana 集成,讓 grafana 承擔 graph 功能。

 

data source 設置: 

 

7. 附:mesos metrics 和 statics 地址

http://master1:5050/metrics/snapshot

http://slave4:5051/metrics/snapshot

http://master1:5050/master/state.json

http://slave4:5051/monitor/statistics.json

用戶能夠基於上述頁面的數據,編寫本身的監控程序。

相關文章
相關標籤/搜索