alertmanager報警規則詳解

時間 2019-11-09

標籤 alertmanager 報警規則詳解简体版

原文原文鏈接

轉載請註明出處，原文連接http://tailnode.tk/2017/03/al...html

說明

這篇文章介紹prometheus和alertmanager的報警和通知規則，prometheus的配置文件名爲prometheus.yml，alertmanager的配置文件名爲alertmanager.yml
報警：指prometheus將監測到的異常事件發送給alertmanager，而不是指發送郵件通知
通知：指alertmanager發送異常事件的通知（郵件、webhook等）node

報警規則

在prometheus.yml中指定匹配報警規則的間隔git

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

在prometheus.yml中指定規則文件（可以使用通配符，如rules/*.rules）github

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - rules/mengyuan.rules

在rules目錄中添加mengyuan.rulesgolang

ALERT goroutines_gt_70
  IF go_goroutines > 70
  FOR 5s  
  LABELS { status = "yellow" }
  ANNOTATIONS {
    summary = "goroutines 超過 70，當前值{{ $value }}",
    description = "當前實例 {{ $labels.instance }}",
  }

ALERT goroutines_gt_90
  IF go_goroutines > 90
  FOR 5s  
  LABELS { status = "red" }
  ANNOTATIONS {
    summary = "goroutines 超過 90，當前值{{ $value }}",
    description = "當前實例 {{ $labels.instance }}",
  }

配置文件設置好後，須要讓prometheus從新讀取，有兩種方法：web

經過HTTP API向/-/reload發送POST請求，例：curl -X POST http://localhost:9090/-/reloadjson
向prometheus進程發送SIGHUP信號curl

將郵件通知與rules對比一下（還須要配置alertmanager.yml才能收到郵件）
ide

通知規則

設置alertmanager.yml的的route與receiverspost

route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 5s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: mengyuan

receivers:
- name: 'mengyuan'
  webhook_configs:
  - url: http://192.168.0.53:8080
  email_configs:
  - to: 'mengyuan@tenxcloud.com'

名詞解釋

Route

route屬性用來設置報警的分發策略，它是一個樹狀結構，按照深度優先從左向右的順序進行匹配。

// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {

Alert

Alert是alertmanager接收到的報警，類型以下。

// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
    // Label value pairs for purpose of aggregation, matching, and disposition
    // dispatching. This must minimally include an "alertname" label.
    Labels LabelSet `json:"labels"`

    // Extra key/value information which does not define alert identity.
    Annotations LabelSet `json:"annotations"`

    // The known time range for this alert. Both ends are optional.
    StartsAt     time.Time `json:"startsAt,omitempty"`
    EndsAt       time.Time `json:"endsAt,omitempty"`
    GeneratorURL string    `json:"generatorURL"`
}

具備相同Lables的Alert（key和value都相同）纔會被認爲是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警

Group

alertmanager會根據group_by配置將Alert分組。以下規則，當go_goroutines等於4時會收到三條報警，alertmanager會將這三條報警分紅兩組向receivers發出通知。

ALERT test1
  IF go_goroutines > 1
  LABELS {label1="l1", label2="l2", status="test"}
ALERT test2
  IF go_goroutines > 2
  LABELS {label1="l2", label2="l2", status="test"}
ALERT test3
  IF go_goroutines > 3
  LABELS {label1="l2", label2="l1", status="test"}

主要處理流程

接收到Alert，根據labels判斷屬於哪些Route（可存在多個Route，一個Route有多個Group，一個Group有多個Alert）
將Alert分配到Group中，沒有則新建Group
新的Group等待group_wait指定的時間（等待時可能收到同一Group的Alert），根據resolve_timeout判斷Alert是否解決，而後發送通知
已有的Group等待group_interval指定的時間，判斷Alert是否解決，當上次發送通知到如今的間隔大於repeat_interval或者Group有更新時會發送通知

TODO

重啓對發送報警與通知的影響
可否組成集羣

參考

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。