使用Prometheus+Alertmanager告警JVM異常狀況

時間 2019-11-09

標籤使用 prometheus+alertmanager prometheus alertmanager 告警 jvm 異常狀況欄目 Java 简体版

原文原文鏈接

原文地址html

在前一篇文章中提到了如何使用Prometheus+Grafana來監控JVM。本文介紹如何使用Prometheus+Alertmanager來對JVM的某些狀況做出告警。java

本文所提到的腳本能夠在這裏下載。git

摘要

用到的工具：github

Docker，本文大量使用了Docker來啓動各個應用。
Prometheus，負責抓取/存儲指標信息，並提供查詢功能，本文重點使用它的告警功能。
Grafana，負責數據可視化（本文重點不在於此，只是爲了讓讀者可以直觀地看到異常指標）。
Alertmanager，負責將告警通知給相關人員。
JMX exporter，提供JMX中和JVM相關的metrics。
Tomcat，用來模擬一個Java應用。

先講一下大體步驟：docker

利用JMX exporter，在Java進程內啓動一個小型的Http server
配置Prometheus抓取那個Http server提供的metrics。
配置Prometheus的告警觸發規則segmentfault
- heap使用超過最大上限的50%、80%、90%
- instance down機時間超過30秒、1分鐘、5分鐘
- old gc時間在最近5分鐘裏超過50%、80%
配置Grafana鏈接Prometheus，配置Dashboard。
配置Alertmanager的告警通知規則

告警的大體過程以下：tomcat

Prometheus根據告警觸發規則查看是否觸發告警，若是是，就將告警信息發送給Alertmanager。
Alertmanager收到告警信息後，決定是否發送通知，若是是，則決定發送給誰。

第一步：啓動幾個Java應用

1) 新建一個目錄，名字叫作prom-jvm-demo。bash

2) 下載JMX exporter到這個目錄。dom

3) 新建一個文件simple-config.yml內容以下：jvm

---
lowercaseOutputLabelNames: true
lowercaseOutputName: true
whitelistObjectNames: ["java.lang:type=OperatingSystem"]
rules:
 - pattern: 'java.lang<type=OperatingSystem><>((?!process_cpu_time)\w+):'
   name: os_$1
   type: GAUGE
   attrNameSnakeCase: true

4) 運行如下命令啓動3個Tomcat，記得把<path-to-prom-jvm-demo>替換成正確的路徑（這裏故意把-Xmx和-Xms設置的很小，以觸發告警條件）：

docker run -d \
  --name tomcat-1 \
  -v <path-to-prom-jvm-demo>:/jmx-exporter \
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \
  -p 6060:6060 \
  -p 8080:8080 \
  tomcat:8.5-alpine

docker run -d \
  --name tomcat-2 \
  -v <path-to-prom-jvm-demo>:/jmx-exporter \
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \
  -p 6061:6060 \
  -p 8081:8080 \
  tomcat:8.5-alpine

docker run -d \
  --name tomcat-3 \
  -v <path-to-prom-jvm-demo>:/jmx-exporter \
  -e CATALINA_OPTS="-Xms32m -Xmx32m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.3.1.jar=6060:/jmx-exporter/simple-config.yml" \
  -p 6062:6060 \
  -p 8082:8080 \
  tomcat:8.5-alpine

5) 訪問http://localhost:8080|8081|8082看看Tomcat是否啓動成功。

6) 訪問對應的http://localhost:6060|6061|6062看看JMX exporter提供的metrics。

備註：這裏提供的simple-config.yml僅僅提供了JVM的信息，更復雜的配置請參考JMX exporter文檔。

第二步：啓動Prometheus

1) 在以前新建目錄prom-jvm-demo，新建一個文件prom-jmx.yml，內容以下：

scrape_configs:
  - job_name: 'java'
    static_configs:
    - targets:
      - '<host-ip>:6060'
      - '<host-ip>:6061'
      - '<host-ip>:6062'

# alertmanager的地址
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - '<host-ip>:9093'

# 讀取告警觸發條件規則
rule_files:
  - '/prometheus-config/prom-alert-rules.yml'

2) 新建文件prom-alert-rules.yml，該文件是告警觸發規則：

# severity按嚴重程度由高到低：red、orange、yello、blue
groups:
  - name: jvm-alerting
    rules:

    # down了超過30秒
    - alert: instance-down
      expr: up == 0
      for: 30s
      labels:
        severity: yellow
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."

    # down了超過1分鐘
    - alert: instance-down
      expr: up == 0
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

    # down了超過5分鐘
    - alert: instance-down
      expr: up == 0
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "Instance {{ $labels.instance }} down"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

    # 堆空間使用超過50%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 50
      for: 1m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 50%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 50%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 堆空間使用超過80%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 80
      for: 1m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 80%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 80%] for more than 1 minutes. current usage ({{ $value }}%)"
    
    # 堆空間使用超過90%
    - alert: heap-usage-too-much
      expr: jvm_memory_bytes_used{job="java", area="heap"} / jvm_memory_bytes_max * 100 > 90
      for: 1m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} memory usage > 90%"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [heap usage > 90%] for more than 1 minutes. current usage ({{ $value }}%)"

    # 在5分鐘裏，Old GC花費時間超過30%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.3
      for: 5m
      labels:
        severity: yellow
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 30% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 30% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘裏，Old GC花費時間超過50%        
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.5
      for: 5m
      labels:
        severity: orange
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 50% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

    # 在5分鐘裏，Old GC花費時間超過80%
    - alert: old-gc-time-too-much
      expr: increase(jvm_gc_collection_seconds_sum{gc="PS MarkSweep"}[5m]) > 5 * 60 * 0.8
      for: 5m
      labels:
        severity: red
      annotations:
        summary: "JVM Instance {{ $labels.instance }} Old GC time > 80% running time"
        description: "{{ $labels.instance }} of job {{ $labels.job }} has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ({{ $value }}%)"

3) 啓動Prometheus：

docker run -d \
  --name=prometheus \
  -p 9090:9090 \
  -v <path-to-prom-jvm-demo>:/prometheus-config \
  prom/prometheus --config.file=/prometheus-config/prom-jmx.yml

4) 訪問http://localhost:9090/alerts應該能看到以前配置的告警規則：

若是沒有看到三個instance，那麼等一下子再試。

第三步：配置Grafana

參考使用Prometheus+Grafana監控JVM

第四步：啓動Alertmanager

1) 新建一個文件alertmanager-config.yml：

global:
  smtp_smarthost: '<smtp.host:ip>'
  smtp_from: '<from>'
  smtp_auth_username: '<username>'
  smtp_auth_password: '<password>'

# The directory from which notification templates are read.
templates: 
- '/alertmanager-config/*.tmpl'

# The root route on which each incoming alert enters.
route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname', 'instance']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: "user-a"

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is 
# already critical.
inhibit_rules:
- source_match:
    severity: 'red'
  target_match_re:
    severity: ^(blue|yellow|orange)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']
- source_match:
    severity: 'orange'
  target_match_re:
    severity: ^(blue|yellow)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']
- source_match:
    severity: 'yellow'
  target_match_re:
    severity: ^(blue)$
  # Apply inhibition if the alertname and instance is the same.
  equal: ['alertname', 'instance']

receivers:
- name: 'user-a'
  email_configs:
  - to: '<user-a@domain.com>'

修改裏面關於smtp_*的部分和最下面user-a的郵箱地址。

~~備註：由於國內郵箱幾乎都不支持TLS，而Alertmanager目前又不支持SSL，所以請使用Gmail或其餘支持TLS的郵箱來發送告警郵件，見這個issue~~，這個問題已經修復，下面是阿里雲企業郵箱的配置例子：

smtp_smarthost: 'smtp.qiye.aliyun.com:465'
smtp_hello: 'company.com'
smtp_from: 'username@company.com'
smtp_auth_username: 'username@company.com'
smtp_auth_password: password
smtp_require_tls: false

2) 新建文件alert-template.tmpl，這個是郵件內容模板：

{{ define "email.default.html" }}
<h2>Summary</h2>
  
<p>{{ .CommonAnnotations.summary }}</p>

<h2>Description</h2>

<p>{{ .CommonAnnotations.description }}</p>
{{ end}}

3）運行下列命令啓動：

docker run -d \
  --name=alertmanager \
  -v <path-to-prom-jvm-demo>:/alertmanager-config \
  -p 9093:9093 \
  prom/alertmanager:master --config.file=/alertmanager-config/alertmanager-config.yml

4) 訪問http://localhost:9093，看看有沒有收到Prometheus發送過來的告警(若是沒有看到稍等一下)：