prometheus+ Grafana監控全家桶

時間 2021-05-17

標籤 java node linux git github golang spring docker apache json 欄目應用數學简体版

原文原文鏈接

在調研監控工具，以前一直用的zabbix很平穩（從沒出過問題），監控內容大概有系統級別的cpu、內存、硬盤之類的，也有服務的運行狀況、elasticsearch、hive、kafka的lag等等，但有幾個問題無法解決：java

一、zabbix無法監控flink（雖然能夠經過api調用方式，但flink指標有幾百個一個一個加，而且zabbix機制爲一個一個指標取的，極其麻煩和低效）
二、jvm級別的es、spring、kafka等運行狀況很差獲取
三、取kafka數據時，咱們有n多topic對應又有n多groupid來消費，若是要取一遍全部信息，zabbix也是隻能一個一個取（固然後邊發現有模板的方式也能夠，但若是增長一個也要手動加）

詳細對比能夠參考網上文章，說的很詳細 http://dockone.io/article/10437

我理了下對於咱們這種純java+大數據場景，而且設備通常都在客戶環境不能上網，隨時會斷電，狀況下的優劣勢

監控工具	優點	劣勢
zabbix	穩定、進程少(一個server+agent搞定、文檔齊全、全部操做界面均可以配、支持action動做觸發(這個很重要)	對於docker、flink、hive等動態運行的場景支持很差(也能夠自定義腳本搞定）
prometheus	部署簡單、社區插件比較多(各類exporter)，比較多的官方直接支持	每增長一個組件的監控就須要一個exporter獨立進程、預警action等頁面不可配（自己頁面功能特簡陋）

放個普羅米修斯的架構圖
其中：

Prometheus Server: 用數據的採集和存儲，PromQL查詢，報警配置。
Push gateway: 用於批量，短時間的監控數據的彙報總節點。
Exporters: 各類彙報數據的exporter，例如彙報機器數據的node_exporter，彙報MondogDB信息的 MongoDB_exporter 等等。
Alertmanager: 用於高級通知管理。

採集數據的主要流程以下：

1. Prometheus server 按期從靜態配置的主機或服務發現的 targets 拉取數據（zookeeper，consul，DNS SRV Lookup等方式）

2. 當新拉取的數據大於配置內存緩存區的時候，Prometheus會將數據持久化到磁盤，也能夠遠程持久化到雲端。

3. Prometheus經過PromQL、API、Console和其餘可視化組件如Grafana、Promdash展現數據。

4. Prometheus 能夠配置rules，而後定時查詢數據，當條件觸發的時候，會將告警推送到配置的Alertmanager。

5. Alertmanager收到告警的時候，會根據配置，聚合，去重，降噪，最後發出警告。

開始prometheus安裝

他自己是用go寫的，因此須要安裝個go環境，node

安裝go:

yum install epel-releaselinux

yum install golanggit

安裝普羅米修斯

下載 https://github.com/prometheus/prometheus/releases/tag/v2.26.0

tar zxvf prometheus-2.26.0.linux-amd64.tar.gz && cd prometheus-2.26.0.linux-amd64

./prometheus --config.file=prometheus.yml

安裝grafana

vim /etc/yum.repos.d/grafana.repo

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

開始安裝： yum install grafana -y

systemctl start grafana-server

systemctl enable grafana-server

這時候應該能訪問grafana頁面了： http://ip:3000  用戶名密碼 admin/admin

而後導入prometheus數據源

安裝node_exporter

下載： https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
tar zxvf node_exporter-1.1.2.linux-amd64.tar.gz && cd node_exporter-1.1.2.linux-amd64
./node_exporter   # 啓動node_exporter,默認是9100端口

修改普羅米修斯配置文件加入node_exporter數據獲取github

先關掉prometheus進程 ，修改 prometheus.yml ，在後邊增長以下配置：

- job_name: 'node_exporter_local'
    static_configs:
    - targets: ['127.0.0.1:9100']

修改完以後從新執行 ./prometheus --config.file=prometheus.yml

grafana 導入系統監控圖

到grafana網站找到最新的一個node exporter圖，點進去，能夠看到get this dashboard的id，而後導入，數據源選擇prometheus就能夠看到效果了golang

https://grafana.com/grafana/dashboards/8919spring

增長es監控

[https://github.com/justwatchcom/elasticsearch_exporter](https://github.com/justwatchcom/elasticsearch_exporter)

tar zxvf elasticsearch_exporter-1.1.0.linux-amd64.tar.gz
cd elasticsearch_exporter-1.1.0.linux-amd64
# 啓動es_exporter 默認監聽端口爲9114
./elasticsearch_exporter --es.uri="https://用戶名:密碼@10.0.81.101:9200" --es.ssl-skip-verify

再修改prometheus.yml  增長以下配置：
- job_name: 'es監控'
    static_configs:
    - targets: ['127.0.0.1:9114']

從新執行： ./prometheus --config.file=prometheus.yml

在grafana中導入elasticsearch監控模板： https://github.com/justwatchcom/elasticsearch_exporter/blob/master/examples/grafana/dashboard.jsondocker

導入選擇json格式便可apache

增長kafka監控

下載： [https://github.com/danielqsj/kafka_exporter/releases](https://github.com/danielqsj/kafka_exporter/releases)

tar zxvf kafka_exporter-1.3.0.tar.gz

cd kafka_exporter-1.3.0

#做者最新版忘了把編譯後的文件放進去，因此須要先執行編譯

go build # 生成一個kafka_exporter的可執行文件

./kafka_exporter --tls.enabled --tls.ca-file="/etc/ca-cert" --tls.cert-file="/etc/kafka.client.pem" --tls.key-file="/etc/kafka.client.key" --tls.insecure-skip-tls-verify --kafka.server=10.0.81.22:9091 --log.level="debug"

修改 prometheus.yml 增長如下內容，仍是同樣 從新執行： ./prometheus --config.file=prometheus.yml

- job_name: 'kafka監控'
    static_configs:
    - targets: ['127.0.0.1:9308']

grafana導入kafka模板： id爲7589

監控flink

我這裏用的是flink1.12.0 安裝在/opt/目錄下

flink在發佈時自帶了prometheus的jar包，先複製prometheus的jar包到flink的lib目錄下   cp /opt/flink-1.12.0/plugins/metrics-prometheus/flink-metrics-prometheus-1.12.0.jar /opt/flink-1.12.0/lib/

修改配置增長監控數據：/opt/flink-1.12.0/conf/flink-conf.yaml

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: 127.0.0.1
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: myJob
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: false

修改完成以後重啓flink，確保任務都正常運行

修改 prometheus.yml 增長如下內容，仍是同樣 從新執行： ./prometheus --config.file=prometheus.yml

- job_name: 'flink監控'
    static_configs:
    - targets: ['127.0.0.1:9091']

grafana導入模板：  https://grafana.com/grafana/dashboards?search=flink 

安裝pushgateway: https://github.com/prometheus/pushgateway/releases/tag/v1.4.0

wget https://github.com/prometheus/pushgateway/releases/download/v1.4.0/pushgateway-1.4.0.linux-amd64.tar.gz

tar zxvf pushgateway-1.4.0.linux-amd64.tar.gz
cd pushgateway-1.4.0.linux-amd64
./pushgateway