alertmanager 報警規則詳解

這篇文章介紹prometheus和alertmanager的報警和通知規則，prometheus的配置文件名爲prometheus.yml，alertmanager的配置文件名爲alertmanager.ymlhtml

報警：指prometheus將監測到的異常事件發送給alertmanager，而不是指發送郵件通知
通知：指alertmanager發送異常事件的通知（郵件、webhook等）node

報警規則

在prometheus.yml中指定匹配報警規則的間隔web

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

在prometheus.yml中指定規則文件（可以使用通配符，如rules/*.rules）express

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
 - "/etc/prometheus/alert.rules"

並基於如下模板：json

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]

其中：api

Alert name是警報標識符。它不須要是惟一的。app

Expression是爲了觸發警報而被評估的條件。它一般使用現有指標做爲/metrics端點返回的指標。ide

Duration是規則必須有效的時間段。例如，5s表示5秒。this

Label set是將在消息模板中使用的一組標籤。lua

在prometheus-k8s-statefulset.yaml 文件建立ruleSelector，標記報警規則角色。在prometheus-k8s-rules.yaml 報警規則文件中引用

ruleSelector:
    matchLabels:
      role: prometheus-rulefiles
      prometheus: k8s

在prometheus-k8s-rules.yaml 使用configmap 方式引用prometheus-rulefiles

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-k8s-rules
  namespace: monitoring
  labels:
    role: prometheus-rulefiles
    prometheus: k8s
data:
  pod.rules.yaml: |+
    groups:
    - name: noah_pod.rules
      rules:
      - alert: Pod_all_cpu_usage
        expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 10
        for: 5m
        labels:
          severity: critical
          service: pods
        annotations:
          description: 容器 {{ $labels.name }} CPU 資源利用率大於 75% , (current value is {{ $value }})
          summary: Dev CPU 負載告警
      - alert: Pod_all_memory_usage
        expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*2
        for: 10m
        labels:
          severity: critical
        annotations:
          description: 容器 {{ $labels.name }} Memory 資源利用率大於 2G , (current value is {{ $value }})
          summary: Dev Memory 負載告警
      - alert: Pod_all_network_receive_usage
        expr: sum by (name)(irate(container_network_receive_bytes_total{container_name="POD"}[1m])) > 1024*1024*50
        for: 10m
        labels:
          severity: critical
        annotations:
          description: 容器 {{ $labels.name }} network_receive 資源利用率大於 50M , (current value is {{ $value }})
          summary: network_receive 負載告警

配置文件設置好後，prometheus-opeartor自動從新讀取配置。
若是二次修改comfigmap 內容只須要apply

kubectl apply -f prometheus-k8s-rules.yaml

將郵件通知與rules對比一下（還須要配置alertmanager.yml才能收到郵件）

通知規則

設置alertmanager.yml的的route與receivers

global:
  # ResolveTimeout is the time after which an alert is declared resolved
  # if it has not been updated.
  resolve_timeout: 5m

  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: 'xxxxx'
  smtp_from: 'xxxxxxx'
  smtp_auth_username: 'xxxxx'
  smtp_auth_password: 'xxxxxx'
  # The API URL to use for Slack notifications.
  slack_api_url: 'https://hooks.slack.com/services/some/api/token'

# # The directory from which notification templates are read.
templates:
- '*.tmpl'

# The root route on which each incoming alert enters.
route:

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.

  group_by: ['alertname', 'cluster', 'service']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.

  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.

  group_interval: 5m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.

  #repeat_interval: 1m
  repeat_interval: 15m

  # A default receiver

  # If an alert isn't caught by a route, send it to default.
  receiver: default

  # All the above attributes are inherited by all child routes and can
  # overwritten on each.

  # The child route trees.
  routes:
  - match:
      severity: critical
    receiver: email_alert

receivers:
- name: 'default'
  email_configs:
  - to : 'yi.hu@dianrong.com'
    send_resolved: true

- name: 'email_alert'
  email_configs:
  - to : 'yi.hu@dianrong.com'
    send_resolved: true

名詞解釋

Route

route屬性用來設置報警的分發策略，它是一個樹狀結構，按照深度優先從左向右的順序進行匹配。

// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {

Alert

Alert是alertmanager接收到的報警，類型以下。

// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
    // Label value pairs for purpose of aggregation, matching, and disposition
    // dispatching. This must minimally include an "alertname" label.
    Labels LabelSet `json:"labels"`

    // Extra key/value information which does not define alert identity.
    Annotations LabelSet `json:"annotations"`

    // The known time range for this alert. Both ends are optional.
    StartsAt     time.Time `json:"startsAt,omitempty"`
    EndsAt       time.Time `json:"endsAt,omitempty"`
    GeneratorURL string    `json:"generatorURL"`
}

具備相同Lables的Alert（key和value都相同）纔會被認爲是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警

Group

alertmanager會根據group_by配置將Alert分組。以下規則，當go_goroutines等於4時會收到三條報警，alertmanager會將這三條報警分紅兩組向receivers發出通知。

ALERT test1
  IF go_goroutines > 1
  LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
  IF go_goroutines > 2
  LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
  IF go_goroutines > 3
  LABELS {label1="l2", label2="l1", status="test"}

主要處理流程

接收到Alert，根據labels判斷屬於哪些Route（可存在多個Route，一個Route有多個Group，一個Group有多個Alert）
將Alert分配到Group中，沒有則新建Group
新的Group等待group_wait指定的時間（等待時可能收到同一Group的Alert），根據resolve_timeout判斷Alert是否解決，而後發送通知
已有的Group等待group_interval指定的時間，判斷Alert是否解決，當上次發送通知到如今的間隔大於repeat_interval或者Group有更新時會發送通知

Alertmanager

Alertmanager是警報的緩衝區，它具備如下特徵：

能夠經過特定端點（不是特定於Prometheus）接收警報。

能夠將警報重定向到接收者，如hipchat、郵件或其餘人。

足夠智能，能夠肯定已經發送了相似的通知。因此，若是出現問題，你不會被成千上萬的電子郵件淹沒。

Alertmanager客戶端（在這種狀況下是Prometheus）首先發送POST消息，並將全部要處理的警報發送到/ api / v1 / alerts。例如：

[
 {
  "labels": {
     "alertname": "low_connected_users",
     "severity": "warning"
   },
   "annotations": {
      "description": "Instance play-app:9000 under lower load",
      "summary": "play-app:9000 of job playframework-app is under lower load"
    }
 }]

alert工做流程

一旦這些警報存儲在Alertmanager，它們可能處於如下任何狀態：

Inactive：這裏什麼都沒有發生。
Pending：客戶端告訴咱們這個警報必須被觸發。然而，警報能夠被分組、壓抑/抑制或者靜默/靜音。一旦全部的驗證都經過了，咱們就轉到Firing。
Firing：警報發送到Notification Pipeline，它將聯繫警報的全部接收者。而後客戶端告訴咱們警報解除，因此轉換到狀Inactive狀態。

Prometheus有一個專門的端點，容許咱們列出全部的警報，並遵循狀態轉換。Prometheus所示的每一個狀態以及致使過渡的條件以下所示：

規則不符合。警報沒有激活。

規則符合。警報如今處於活動狀態。執行一些驗證是爲了不淹沒接收器的消息。

警報發送到接收者

接收器 receiver

顧名思義，警報接收的配置。
通用配置格式

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
[ - <email_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
slack_config:
[ - <slack_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
webhook_configs:
[ - <webhook_config>, ... ]

郵件接收器 email_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

Slack接收器 slack_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]

# The channel or user to send notifications to.
channel: <tmpl_string>

# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器 webhook_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The endpoint to send HTTP POST requests to.
url: <string>

Alertmanager會使用如下的格式向配置端點發送HTTP POST請求：

{
"version": "2",
"status": "<resolved|firing>",
"alerts": [
    {
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>"
    },
    ...
]
}

Inhibition

抑制是指當警報發出後，中止重複發送由此警報引起其餘錯誤的警報的機制。

例如，當警報被觸發，通知整個集羣不可達，能夠配置Alertmanager忽略由該警報觸發而產生的全部其餘警報，這能夠防止通知數百或數千與此問題不相關的其餘警報。
抑制機制能夠經過Alertmanager的配置文件來配置。

Inhibition容許在其餘警報處於觸發狀態時，抑制一些警報的通知。例如，若是同一警報（基於警報名稱）已經很是緊急，那麼咱們能夠配置一個抑制來使任何警告級別的通知靜音。 alertmanager.yml文件的相關部分以下所示：

inhibit_rules:- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['low_connected_users']

配置抑制規則，是存在另外一組匹配器匹配的狀況下，靜音其餘被引起警報的規則。這兩個警報，必須有一組相同的標籤。

# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

Silences

Silences是快速地使警報暫時靜音的一種方法。咱們直接經過Alertmanager管理控制檯中的專用頁面來配置它們。在嘗試解決嚴重的生產問題時，這對避免收到垃圾郵件頗有用。

alertmanager 參考資料
 抑制規則 inhibit_rule參考資料

 https://www.kancloud.cn/huyipow/prometheus/527563