基於Prometheus+prometheus-webhook-dingtalk+Alertmanager監控Pulsar並經過釘釘發告警,當pulsar出現積壓,或者發生故障,能夠第一時間處理解決。具體的安裝方法,請查看以前的博客《基於Prometheus+Grafana+Alertmanager監控Pulsar發釘釘告警》node
在非生產環境,pulsar的積壓可以正常告警,生產環境積壓告警值,發現觸發積壓告警,可是釘釘就沒法收到告警。web
首先,想到的是會不會Alertmanager告警會有問題,能夠先測試下Alertmanager是否能夠正常發出告警,在告警規則配置文件中,修改告警值0變成1bash
groups: - name: node rules: - alert: InstanceDown expr: up == 1 for: 1m labels: status: danger annotations: summary: "Instance {{ $labels.instance }} down." description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
查看釘釘告警,收到了告警,那說明告警功能是正常的ide
接着檢查全部的組件配置文件,Prometheus,prometheus-webhook-dingtalk,Alertmanager,配置信息都是正常的,將全部的組件重啓一遍,問題仍然存在,在重啓的時候,查看各組件的日誌,Prometheus打印日誌都是正常的,prometheus-webhook-dingtalk的日誌有點異常測試
level=info ts=2021-08-12T06:12:23.622Z caller=entry.go:22 component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=10.7.7.48:7510 user_agent=Alertmanager/0.21.0 uri=http://10.7.7.28:8060/dingtalk/webhook1/send resp_status=400 resp_bytes_length=27 resp_elapsed_ms=228.690575 msg="request complete" level=error ts=2021-08-12T06:12:33.620Z caller=dingtalk.go:103 component=web target=webhook1 msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes"
msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes",說是告警消息過長,過大,致使了消息沒法發送成功日誌
到Prometheus控制檯去檢查下監控項信息,code
觸發的結果有60個,還有lable信息好多,一次性要把這些信息都發出,超過了釘釘的大小限制,才致使釘釘沒法接收到告警信息component
知道是因爲告警值的內容過大這個緣由,那就好辦了,只要把告警值的內容減少,就能夠正常的發出告警了,在規則配置文件中使用promQL語句來過慮掉一些不要值blog
- alert: TooManyBacklogsOnTopic expr: pulsar_msg_backlog{job="node-broker"} > 40000 for: 30s labels: status: warning annotations: #summary: "Backlogs of topic are more than 50000." #description: "Backlogs of topic {{ $labels.topic }} is more than 50000 , current value is {{ $value }}."
根據job精確匹配出job的值,job的值請根據本身具體的值修改,把一些沒必要要的值過慮掉,如今匹配出來的結果爲30個,是原來的一半ip
咱們再確認釘釘是否能夠正常收到告警了
釘釘收到告警了,解決!