alertmanager 發釘釘告警報400錯誤

時間 2021-08-12

標籤 node web bash ide 測試日誌 code component blog 欄目 HTML 简体版

原文原文鏈接

背景

基於Prometheus+prometheus-webhook-dingtalk+Alertmanager監控Pulsar並經過釘釘發告警，當pulsar出現積壓，或者發生故障，能夠第一時間處理解決。具體的安裝方法，請查看以前的博客《基於Prometheus+Grafana+Alertmanager監控Pulsar發釘釘告警》node

現象

在非生產環境，pulsar的積壓可以正常告警，生產環境積壓告警值，發現觸發積壓告警，可是釘釘就沒法收到告警。web

排查步驟

首先，想到的是會不會Alertmanager告警會有問題，能夠先測試下Alertmanager是否能夠正常發出告警，在告警規則配置文件中，修改告警值0變成1bash

groups:
  - name: node
    rules:     
      - alert: InstanceDown
        expr: up == 1
        for: 1m
        labels:
          status: danger
        annotations:
          summary: "Instance {{ $labels.instance }} down."
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

查看釘釘告警，收到了告警，那說明告警功能是正常的ide

接着檢查全部的組件配置文件，Prometheus，prometheus-webhook-dingtalk，Alertmanager，配置信息都是正常的，將全部的組件重啓一遍，問題仍然存在，在重啓的時候，查看各組件的日誌，Prometheus打印日誌都是正常的，prometheus-webhook-dingtalk的日誌有點異常測試

level=info ts=2021-08-12T06:12:23.622Z caller=entry.go:22 component=web http_scheme=http http_proto=HTTP/1.1 http_method=POST remote_addr=10.7.7.48:7510 user_agent=Alertmanager/0.21.0 uri=http://10.7.7.28:8060/dingtalk/webhook1/send resp_status=400 resp_bytes_length=27 resp_elapsed_ms=228.690575 msg="request complete"
level=error ts=2021-08-12T06:12:33.620Z caller=dingtalk.go:103 component=web target=webhook1 msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes"

msg="Failed to send notification to DingTalk" respCode=460101 respMsg="message too long, exceed 20000 bytes"，說是告警消息過長，過大，致使了消息沒法發送成功日誌

到Prometheus控制檯去檢查下監控項信息，code

觸發的結果有60個，還有lable信息好多，一次性要把這些信息都發出，超過了釘釘的大小限制，才致使釘釘沒法接收到告警信息component

解決方法

知道是因爲告警值的內容過大這個緣由，那就好辦了，只要把告警值的內容減少，就能夠正常的發出告警了，在規則配置文件中使用promQL語句來過慮掉一些不要值blog

- alert: TooManyBacklogsOnTopic
        expr: pulsar_msg_backlog{job="node-broker"} > 40000
        for: 30s
        labels:
          status: warning
        annotations:
          #summary: "Backlogs of topic are more than 50000."
          #description: "Backlogs of topic {{ $labels.topic }} is more than 50000 , current value is {{ $value }}."

根據job精確匹配出job的值，job的值請根據本身具體的值修改，把一些沒必要要的值過慮掉，如今匹配出來的結果爲30個，是原來的一半ip

咱們再確認釘釘是否能夠正常收到告警了

釘釘收到告警了，解決！