Alertmanager告警組件

時間 2019-11-08

標籤 alertmanager 告警組件简体版

原文原文鏈接

Prometheus Alertmanagerhtml

概述node

Alertmanager與Prometheus是相互分離的兩個組件。Prometheus服務器根據報警規則將警報發送給Alertmanager，而後Alertmanager將silencing、inhibition、aggregation等消息經過電子郵件、PaperDuty和HipChat發送通知。mysql

設置警報和通知的主要步驟：web

安裝配置Alertmanager
配置Prometheus經過-alertmanager.url標誌與Alertmanager通訊
在Prometheus中建立告警規則

Alertmanager簡介及機制正則表達式

Alertmanager處理由例如Prometheus服務器等客戶端發來的警報。它負責刪除重複數據、分組，並將警報經過路由發送到正確的接收器，好比電子郵件、Slack等。Alertmanager還支持groups,silencing和警報抑制的機制。sql

分組數據庫

分組是指將同一類型的警報分類爲單個通知。當許多系統同時宕機時，頗有可能成百上千的警報會同時生成，這種機制特別有用。express

例如，當數十或數百個服務的實例在運行，網絡發生故障時，有可能一半的服務實例不能訪問數據庫。在prometheus告警規則中配置爲每個服務實例都發送警報的話，那麼結果是數百警報被髮送至Alertmanager。json

可是做爲用戶只想看到單一的報警頁面，同時仍然可以清楚的看到哪些實例受到影響，所以，能夠經過配置Alertmanager將警報分組打包，併發送一個相對看起來緊湊的通知。flask

分組警報、警報時間，以及接收警報的receiver是在alertmanager配置文件中經過路由樹配置的。

抑制(Inhibition)

抑制是指當警報發出後，中止重複發送由此警報引起其餘錯誤的警報的機制。(好比網絡不可達，致使其餘服務鏈接相關警報)

例如，當整個集羣網絡不可達，此時警報被觸發，能夠事先配置Alertmanager忽略由該警報觸發而產生的全部其餘警報，這能夠防止通知數百或數千與此問題不相關的其餘警報。

抑制機制也是經過Alertmanager的配置文件來配置。

沉默(Silences)

Silences是一種簡單的特定時間不告警的機制。silences警告是經過匹配器(matchers)來配置，就像路由樹同樣。傳入的警報會匹配RE，若是匹配，將不會爲此警報發送通知。

這個可視化編輯器能夠幫助構建路由樹。

silences報警機制能夠經過Alertmanager的Web頁面進行配置。

Alermanager的配置

Alertmanager經過命令行flag和一個配置文件進行配置。命令行flag配置不變的系統參數、配置文件定義的抑制(inhibition)規則、通知路由和通知接收器。

要查看全部可用的命令行flag，運行alertmanager -h。

Alertmanager支持在運行時加載配置，若是新配置語法格式不正確，更改將不會被應用，並記錄語法錯誤。經過向該進程發送SIGHUP或向/-/reload端點發送HTTP POST請求來觸發配置熱加載。

配置文件

要指定加載的配置文件，須要使用-config.file標誌。該文件使用YAML來完成，經過下面的描述來定義。帶括號的參數表示是可選的，對於非列表的參數的值，將被設置爲指定的缺省值。

通用佔位符定義解釋:

<duration> : 與正則表達式匹配的持續時間值,[0-9]+(ms|[smhdwy])
<labelname>: 與正則表達式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
<labelvalue>: unicode字符串
<filepath>: 有效的文件路徑
<boolean>: boolean類型，true或者false
<string>: 字符串
<tmpl_string>: 模板變量字符串

global全局配置文件參數在全部配置上下文生效，做爲其餘配置項的默認值,可被覆蓋.

global:
  # ResolveTimeout is the time after which an alert is declared resolved
  # if it has not been updated.
  #解決報警時間間隔
  [ resolve_timeout: <duration> | default = 5m ]
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails.
  [ smtp_smarthost: <string> ]
  # SMTP authentication information.
  [ smtp_auth_username: <string> ]
  [ smtp_auth_password: <string> ]
  [ smtp_auth_secret: <string> ]
  # The default SMTP TLS requirement.
  [ smtp_require_tls: <bool> | default = true ]
  # The API URL to use for Slack notifications.
  [ slack_api_url: <string> ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/generic/2010-04-15/create_event.json" ]
  [ opsgenie_api_host: <string> | default = "https://api.opsgenie.com/" ]
# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]
# The root node of the routing tree.
route: <route>
# A list of notification receivers.
receivers:
  - <receiver> ...
# A list of inhibition rules.
inhibit_rules:
  [ - <inhibit_rule> ... ]

路由(route)

路由塊定義了路由樹及其子節點。若是沒有設置的話，子節點的可選配置參數從其父節點繼承。

每一個警報都會在配置的頂級路由中進入路由樹，該路由樹必須匹配全部警報（即沒有任何配置的匹配器）。而後遍歷子節點。若是continue的值設置爲false，它在第一個匹配的子節點以後就中止；若是continue的值爲true，警報將繼續進行後續子節點的匹配。若是警報不匹配任何節點的任何子節點（沒有匹配的子節點，或不存在），該警報基於當前節點的配置處理。

路由配置格式

#報警接收器
[ receiver: <string> ]
#分組
[ group_by: '[' <labelname>, ... ']' ]
# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]
# A set of equality matchers an alert has to fulfill to match the node.
#根據匹配的警報，指定接收器
match:
  [ <labelname>: <labelvalue>, ... ]
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
#根據匹配正則符合的警告，指定接收器
  [ <labelname>: <regex>, ... ]
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> ]
# How long to wait before sending notification about new alerts that are
# in are added to a group of alerts for which an initial notification
# has already been sent. (Usually ~5min or more.)
[ group_interval: <duration> ]
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> ]
# Zero or more child routes.
routes:
  [ - <route> ... ]

例子：

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

抑制規則 inhibit_rule

抑制規則，是存在另外一組匹配器匹配的狀況下，使其餘被引起警報的規則靜音。這兩個警報，必須有一組相同的標籤。

抑制配置格式

# Matchers that have to be fulfilled in the alerts to be muted.
##必須在要須要靜音的警報中履行的匹配者
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
#必須存在一個或多個警報以使抑制生效的匹配者。
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
#在源和目標警報中必須具備相等值的標籤才能使抑制生效
[ equal: '[' <labelname>, ... ']' ]

接收器(receiver)

顧名思義，警報接收的配置。

通用配置格式

# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
slack_config:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]

郵件接收器email_config

# Whether or not to notify about resolved alerts.
#警報被解決以後是否通知
[ send_resolved: <boolean> | default = false ]
# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]
# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] 
# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

Slcack接收器slack_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]
# The channel or user to send notifications to.
channel: <tmpl_string>
# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器webhook_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]
 # The endpoint to send HTTP POST requests to.
url: <string>

Alertmanager會使用如下的格式向配置端點發送HTTP POST請求：

{
  "version": "3",
  "groupKey": <number>     // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backling to the Alertmanager.
  "alerts": [
    {
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>"
    },
    ...
  ]
}

能夠添加一個釘釘webhook，經過釘釘報警,因爲POST數據須要有要求，簡單實現一個數據轉發腳本。

from flask import Flask
from flask import request
import json
app = Flask(__name__)
@app.route('/',methods=['POST'])
def send():
    if request.method == 'POST':
        post_data = request.get_data()
        alert_data(post_data)
    return
def alert_data(data):
    from urllib2 import Request,urlopen
    url = 'https://oapi.dingtalk.com/robot/send?access_token=xxxx'
    send_data = '{"msgtype": "text","text": {"content": %s}}' %(data)
    request = Request(url, send_data)
    request.add_header('Content-Type','application/json')
    return urlopen(request).read()
if __name__ == '__main__':
    app.run(host='0.0.0.0')

報警規則

報警規則容許你定義基於Prometheus表達式語言的報警條件，併發送報警通知到外部服務

定義報警規則

報警規則經過如下格式定義：

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]

可選的FOR語句，使得Prometheus在表達式輸出的向量元素（例如高HTTP錯誤率的實例）之間等待一段時間，將警報計數做爲觸發此元素。若是元素是active，可是沒有firing的，就處於pending狀態。

LABELS（標籤）語句容許指定一組標籤附加警報上。將覆蓋現有衝突的任何標籤，標籤值也能夠被模板化。

ANNOTATIONS（註釋）它們被用於存儲更長的其餘信息，例如警報描述或者連接，註釋值也能夠被模板化。

Templating(模板) 標籤和註釋值可使用控制檯模板進行模板化。$labels變量保存警報實例的標籤鍵/值對，$value保存警報實例的評估值。

# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}

報警規則示例:

# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  LABELS { severity = "page" }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} down",
    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
  }
# Alert for any instance that have a median request latency >1s.
ALERT APIHighRequestLatency
  IF api_http_request_latencies_second{quantile="0.5"} > 1
  FOR 1m
  ANNOTATIONS {
    summary = "High request latency on {{ $labels.instance }}",
    description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)",
  }

運行時檢查警報

要手動檢查處於active狀態(pending或者firing)的警報,可在Prometheus實例web導航窗口的"alert"選項卡查看.

For pending and firing alerts, Prometheus also stores synthetic time series of the form ALERTS{alertname="<alert name>", alertstate="pending|firing", <additional alert labels>}. The sample value is set to 1 as long as the alert is in the indicated active (pending or firing) state, and a single 0 value gets written out when an alert transitions from active to inactive state. Once inactive, the time series does not get further updates.

發送報警通知

Prometheus的警報rules能夠很好的知道如今的故障狀況，但還不是一個完整的通知解決方案。在簡單的警報定義之上，須要另外一層級來實現報警彙總，通知速率限制，silences等基於rules之上，在prometheus生態系統中，Alertmanager發揮了這一做用。所以，

Prometheus能夠週期性的發送關於警報狀態的信息到Alertmanager實例，而後Alertmanager調度來發送正確的通知。該Alertmanager能夠經過-alertmanager.url命令行flag來配置。

連接：https://www.jianshu.com/p/239b145e2acc https://www.jianshu.com/p/b9dcdaa117c7

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。