監控系統中很是重要的一環,就是告警,系統得在故障發生的第一時間將事件發送出來,通知干係人,prometheus提供了alertmanager來實現這個功能。node
第一步:prometheus.yml配置文件,配置alertmanager地址web
第二步:編寫觸發器,也就是在什麼狀況下產生告警。docker
Prometheus.yml填寫觸發器配置文件路徑post
alert_rule.yml內容url
groups: - name: node rules: - alert: node_cpu>80% expr: (1-rate(node_cpu_seconds_total{mode="idle"}[1m]))*100 > 80 labels: severity: 3 - alert: node_mem_availble<20% expr: node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100 < 20 labels: severity: 3 - alert: node_cpu_load>10 expr: node_load1 > 10 labels: severity: 3 - alert: node_disk<20% expr: node_filesystem_avail_bytes{device!='nsfs'}/node_filesystem_size_bytes{device!='nsfs'}*100 < 20 labels: severity: 3 - name: docker rules: - alert: docker_cpu>50% expr: rate(container_cpu_usage_seconds_total{image!=''}[1m])*100 > 50 labels: severity: 3 - alert: docker_restarted expr: changes(container_start_time_seconds[1m]) != 0 labels: severity: 4
其中expr就是產生告警的條件,即當這個語句條件成立時,觸發告警,下面的labels是告警內容中的標籤,這裏添加了一個標籤,即告警等級severity,能夠自定義1-5,來區分不一樣級別的告警。spa
第三步:產生的告警怎麼處理,是發消息?發送給誰?經過什麼發送?都是在這裏配置。alertmanager.yml配置文件3d
內容以下:rest
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'wechat'
routes:
- match_re:
severity: 1|2|3|4|5
receiver: 'wechat'
continue: true
- match:
severity: 5
receiver: 'message'
continue: true
- match:
severity: 5
receiver: 'call'
continue: true
receivers:
- name: 'wechat'
webhook_configs:
- url: 'http://localhost/alert_wechat'
- name: 'message'
webhook_configs:
- url: 'http://localhost/alert_message'
- name: 'call'
webhook_configs:
- url: 'http://localhost/alert_call'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
這裏用了一個receiver,即web_hook,Prometheus會把告警內容post到指定的url地址。code