轉載請註明出處,原文連接http://tailnode.tk/2017/03/al...html
這篇文章介紹prometheus和alertmanager的報警和通知規則,prometheus的配置文件名爲prometheus.yml
,alertmanager的配置文件名爲alertmanager.yml
報警:指prometheus將監測到的異常事件發送給alertmanager,而不是指發送郵件通知
通知:指alertmanager發送異常事件的通知(郵件、webhook等)node
在prometheus.yml
中指定匹配報警規則的間隔git
# How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ]
在prometheus.yml
中指定規則文件(可以使用通配符,如rules/*.rules)github
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - rules/mengyuan.rules
在rules目錄中添加mengyuan.rules
golang
ALERT goroutines_gt_70 IF go_goroutines > 70 FOR 5s LABELS { status = "yellow" } ANNOTATIONS { summary = "goroutines 超過 70,當前值{{ $value }}", description = "當前實例 {{ $labels.instance }}", } ALERT goroutines_gt_90 IF go_goroutines > 90 FOR 5s LABELS { status = "red" } ANNOTATIONS { summary = "goroutines 超過 90,當前值{{ $value }}", description = "當前實例 {{ $labels.instance }}", }
配置文件設置好後,須要讓prometheus從新讀取,有兩種方法:web
經過HTTP API向/-/reload
發送POST請求,例:curl -X POST http://localhost:9090/-/reload
json
向prometheus進程發送SIGHUP信號curl
將郵件通知與rules對比一下(還須要配置alertmanager.yml
才能收到郵件)
ide
設置alertmanager.yml
的的route與receiverspost
route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 5s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 1m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3h # A default receiver receiver: mengyuan receivers: - name: 'mengyuan' webhook_configs: - url: http://192.168.0.53:8080 email_configs: - to: 'mengyuan@tenxcloud.com'
route
屬性用來設置報警的分發策略,它是一個樹狀結構,按照深度優先從左向右的順序進行匹配。
// Match does a depth-first left-to-right search through the route tree // and returns the matching routing nodes. func (r *Route) Match(lset model.LabelSet) []*Route {
Alert是alertmanager接收到的報警,類型以下。
// Alert is a generic representation of an alert in the Prometheus eco-system. type Alert struct { // Label value pairs for purpose of aggregation, matching, and disposition // dispatching. This must minimally include an "alertname" label. Labels LabelSet `json:"labels"` // Extra key/value information which does not define alert identity. Annotations LabelSet `json:"annotations"` // The known time range for this alert. Both ends are optional. StartsAt time.Time `json:"startsAt,omitempty"` EndsAt time.Time `json:"endsAt,omitempty"` GeneratorURL string `json:"generatorURL"` }
具備相同Lables
的Alert(key和value都相同)纔會被認爲是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警
alertmanager會根據group_by
配置將Alert分組。以下規則,當go_goroutines等於4時會收到三條報警,alertmanager會將這三條報警分紅兩組向receivers發出通知。
ALERT test1 IF go_goroutines > 1 LABELS {label1="l1", label2="l2", status="test"} ALERT test2 IF go_goroutines > 2 LABELS {label1="l2", label2="l2", status="test"} ALERT test3 IF go_goroutines > 3 LABELS {label1="l2", label2="l1", status="test"}
接收到Alert,根據labels判斷屬於哪些Route(可存在多個Route,一個Route有多個Group,一個Group有多個Alert)
將Alert分配到Group中,沒有則新建Group
新的Group等待group_wait
指定的時間(等待時可能收到同一Group的Alert),根據resolve_timeout
判斷Alert是否解決,而後發送通知
已有的Group等待group_interval
指定的時間,判斷Alert是否解決,當上次發送通知到如今的間隔大於repeat_interval
或者Group有更新時會發送通知
重啓對發送報警與通知的影響
可否組成集羣