原文地址html
在前一篇文章中提到了如何使用Prometheus+Grafana來監控JVM。本文介紹如何使用Prometheus+Alertmanager來對JVM的某些狀況做出告警。java
本文所提到的腳本能夠在這裏下載。git
用到的工具:github
先講一下大體步驟:docker
配置Prometheus的告警觸發規則segmentfault
告警的大體過程以下:tomcat
1) 新建一個目錄,名字叫作prom-jvm-demo
。bash
2) 下載JMX exporter到這個目錄。dom
3) 新建一個文件simple-config.yml
內容以下:jvm
--- lowercaseOutputLabelNames: true lowercaseOutputName: true whitelistObjectNames: ["java.lang:type=OperatingSystem"] rules: - pattern: 'java.lang<type=OperatingSystem><>((?!process_cpu_time)\w+):' name: os_$1 type: GAUGE attrNameSnakeCase: true
4) 運行如下命令啓動3個Tomcat,記得把<path-to-prom-jvm-demo>
替換成正確的路徑(這裏故意把-Xmx
和-Xms
設置的很小,以觸發告警條件):
docker run -d \ --name tomcat-1 \ -v <path-to-prom-jvm-demo>:/jmx-exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \ -p 6060:6060 \ -p 8080:8080 \ tomcat:8.5-alpine docker run -d \ --name tomcat-2 \ -v <path-to-prom-jvm-demo>:/jmx-exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \ -p 6061:6060 \ -p 8081:8080 \ tomcat:8.5-alpine docker run -d \ --name tomcat-3 \ -v <path-to-prom-jvm-demo>:/jmx-exporter \ -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \ -p 6062:6060 \ -p 8082:8080 \ tomcat:8.5-alpine
5) 訪問http://localhost:8080|8081|8082
看看Tomcat是否啓動成功。
6) 訪問對應的http://localhost:6060|6061|6062
看看JMX exporter提供的metrics。
備註:這裏提供的simple-config.yml
僅僅提供了JVM的信息,更復雜的配置請參考JMX exporter文檔。
1) 在以前新建目錄prom-jvm-demo
,新建一個文件prom-jmx.yml
,內容以下:
scrape_configs: - job_name: 'java' static_configs: - targets: - '<host-ip>:6060' - '<host-ip>:6061' - '<host-ip>:6062' # alertmanager的地址 alerting: alertmanagers: - static_configs: - targets: - '<host-ip>:9093' # 讀取告警觸發條件規則 rule_files: - '/prometheus-config/prom-alert-rules.yml'
2) 新建文件prom-alert-rules.yml
,該文件是告警觸發規則:
# severity按嚴重程度由高到低:red、orange、yello、blue groups: - name: jvm-alerting rules: # down了超過30秒 - alert: instance-down expr: up == 0 for: 30s labels: severity: yellow annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds." # down了超過1分鐘 - alert: instance-down expr: up == 0 for: 1m labels: severity: orange annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes." # down了超過5分鐘 - alert: instance-down expr: up == 0 for: 5m labels: severity: red annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # 堆空間使用超過50% - alert: heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50 for: 1m labels: severity: yellow annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 50%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)" # 堆空間使用超過80% - alert: heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80 for: 1m labels: severity: orange annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 80%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{ $value }}%)" # 堆空間使用超過90% - alert: heap-usage-too-much expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90 for: 1m labels: severity: red annotations: summary: "JVM Instance {{ $labels.instance }} memory usage > 90%" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)" # 在5分鐘裏,Old GC花費時間超過30% - alert: old-gc-time-too-much expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3 for: 5m labels: severity: yellow annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)" # 在5分鐘裏,Old GC花費時間超過50% - alert: old-gc-time-too-much expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5 for: 5m labels: severity: orange annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)" # 在5分鐘裏,Old GC花費時間超過80% - alert: old-gc-time-too-much expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8 for: 5m labels: severity: red annotations: summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time" description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"
3) 啓動Prometheus:
docker run -d \ --name=prometheus \ -p 9090:9090 \ -v <path-to-prom-jvm-demo>:/prometheus-config \ prom/prometheus --config.file=/prometheus-config/prom-jmx.yml
4) 訪問http://localhost:9090/alerts應該能看到以前配置的告警規則:
若是沒有看到三個instance,那麼等一下子再試。
1) 新建一個文件alertmanager-config.yml
:
global: smtp_smarthost: '<smtp.host:ip>' smtp_from: '<from>' smtp_auth_username: '<username>' smtp_auth_password: '<password>' # The directory from which notification templates are read. templates: - '/alertmanager-config/*.tmpl' # The root route on which each incoming alert enters. route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname', 'instance'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3h # A default receiver receiver: "user-a" # Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match: severity: 'red' target_match_re: severity: ^(blue|yellow|orange)$ # Apply inhibition if the alertname and instance is the same. equal: ['alertname', 'instance'] - source_match: severity: 'orange' target_match_re: severity: ^(blue|yellow)$ # Apply inhibition if the alertname and instance is the same. equal: ['alertname', 'instance'] - source_match: severity: 'yellow' target_match_re: severity: ^(blue)$ # Apply inhibition if the alertname and instance is the same. equal: ['alertname', 'instance'] receivers: - name: 'user-a' email_configs: - to: '<user-a@domain.com>'
修改裏面關於smtp_*
的部分和最下面user-a
的郵箱地址。
備註:由於國內郵箱幾乎都不支持TLS,而Alertmanager目前又不支持SSL,所以請使用Gmail或其餘支持TLS的郵箱來發送告警郵件,見這個issue,這個問題已經修復,下面是阿里雲企業郵箱的配置例子:
smtp_smarthost: 'smtp.qiye.aliyun.com:465' smtp_hello: 'company.com' smtp_from: 'username@company.com' smtp_auth_username: 'username@company.com' smtp_auth_password: password smtp_require_tls: false
2) 新建文件alert-template.tmpl
,這個是郵件內容模板:
{{ define "email.default.html" }} <h2>Summary</h2> <p>{{ .CommonAnnotations.summary }}</p> <h2>Description</h2> <p>{{ .CommonAnnotations.description }}</p> {{ end}}
3) 運行下列命令啓動:
docker run -d \ --name=alertmanager \ -v <path-to-prom-jvm-demo>:/alertmanager-config \ -p 9093:9093 \ prom/alertmanager:master --config.file=/alertmanager-config/alertmanager-config.yml
4) 訪問http://localhost:9093,看看有沒有收到Prometheus發送過來的告警(若是沒有看到稍等一下):
等待一下子(最多5分鐘)看看是否收到郵件。若是沒有收到,檢查配置是否正確,或者docker logs alertmanager
看看alertmanager的日誌,通常來講都是郵箱配置錯誤致使。