釘釘報警-prometheus-alertmanager

alertmanager

alertmanager能夠放在遠程服務器上html

報警機制

在 prometheus 中定義你的監控規則,即配置一個觸發器,某個值超過了設置的閾值就觸發告警, prometheus 會推送當前的告警規則到 alertmanager,alertmanager 收到了會進行一系列的流程處理,而後發送到接收人手裏node

配置安裝

wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar zxf alertmanager-0.19.0.linux-amd64.tar.gz
mv alertmanager-0.19.0.linux-amd64.tar.gz  /usr/local/alertmanager && cd /usr/local/alertmanager && ls

配置文件
cat alertmanager.yml
python

global:
  resolve_timeout: 5m   ##全局配置,設置解析超時時間

route:
  group_by: ['alertname']  ##alertmanager中的分組,選哪一個標籤做爲分組的依據
  group_wait: 10s          ##分組等待時間,拿到第一條告警後等待10s,若是有其餘的一塊兒發送出去
  group_interval: 10s    ##各個分組以前發搜告警的間隔時間
  repeat_interval: 1h    ##重複告警時間,默認1小時
  receiver: 'web.hook'   ##接收者

##配置告警接受者
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'

##配置告警收斂
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

郵件接收配置

cat alertmanager.yml 
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'         #smtp服務地址
  smtp_from: 'xxx@163.com'                  #發送郵箱
  smtp_auth_username: 'xxx@163.com'         #認證用戶名
  smtp_auth_password: 'xxxx'                #認證密碼
  smtp_require_tls: false                   #禁用tls

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1m
  receiver: 'email'                      #定義接受告警組名
receivers:                                  
- name: 'email'                          #定義組名
  email_configs:                         #配置郵件
  - to: 'xx@xxx.com'                     #收件人

檢查配置文件
./amtool check-config alertmanager.yml
mysql

配置爲系統服務linux

cat > /usr/lib/systemd/system/alertmanager.service <<EOF
> [Unit]
> Description=alertmanager
> 
> [Service]
> Restart=on-failure
> ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
> 
> [Install]
> WantedBy=multi-user.target
> EOF

和prometheus 結合配置nginx

alerting:
  alertmanagers:
  - static_configs:
    - targets:
       - 127.0.0.1:9093   ##配置alertmanager地址

rule_files:
  - "rules/*.yml"         ##配置告警規則的文件

配置報警規則
報警規則的目錄 /usr/local/prometheus/rules
git

/usr/local/prometheus/rules]# cat example.yml
groups:
- name: exports.rules     ##定義這組告警的組名,同性質的,都是監控實例exports是否開啓的模板
  rules:

  - alert: 採集器掛了      ## 告警名稱
    expr: up == 0        ## 告警表達式,監控up指標,若是等於0,表示監控的節點沒有起來,而後進行下面的操做
    for: 1m              ## 持續一分鐘爲0就進行告警
    labels:              ## 定義告警級別
      severity: ERROR
    annotations:         ## 定義告警通知怎麼寫,默認調用了{$labels.instance&$labels.job}的值
      summary: "實例 {{ $labels.instance }} 掛了"
      description: "實例 {{ $labels.instance }} job 名爲 {{ $labels.job }} 的掛了"


配置的變量解釋:
github

{{ $labels.instance }}  #提取了up裏的instance 值
{{ $labels.job }}

相同的報警名稱即 alertname (根據配置文件 alert 歸類)會被合併到同一個郵件裏一併發出golang

告警的分配

分配策略,在報警的配置文件中設定web

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1m
  receiver: 'email'

告警分配示例

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.163.com:25'
  smtp_from: 'xxx@163.com'
  smtp_auth_username: 'xxx@163.com'
  smtp_auth_password: 'xxx'
  smtp_require_tls: false

route:
  receiver: 'default-receiver'                  ##定義默認接收器名,若是其餘的匹配不到走這個
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]                ##分組設置
  routes:                                       ##子路由
  - receiver: 'database-pager'                  ##定義接收器名字          
    group_wait: 10s                             ##分組設置
    match_re:                                   ##正則匹配
      service: mysql|cassandra                  ##接收標籤service值爲mysql&&cassandra的告警
  - receiver: 'frontend-pager'                  ##接收器名
    group_by: [product, environment]            ##分組設置
    match:                                      ##直接匹配
      team: frontend                            ##匹配標籤team值爲frontend的告警
receivers:                                      ##定義接收器
- name: 'default-receiver'                      ##接收器名字
  email_configs:                                ##郵件接口
  - to: 'xxx.xx.com'                            ##接收人,下面以此類推
- name: 'database-pager'
  email_configs:
  - to: 'xxx.xx.com'
- name: 'frontend-pager'
  email_configs:
  - to: 'xxx@.xx.com'

告警收斂

收斂就是儘可能壓縮告警郵件的數量,防止關鍵信息淹沒,alertmanager 中有不少收斂機制,最主要的就是分組抑制靜默,alertmanager 收到告警以後會先進行分組,而後進入通知隊列,這個隊列會對通知的郵件進行抑制靜默,再根據 router 將告警路由到不一樣的接收器

機制	                   說明
分組 (group)	            將相似性質的告警合併爲單個進行通知
抑制 (Inhibition)	    當告警發生後,中止重複發送由此告警引起的其餘告警
靜默 (Silences)	            一種簡單的特定時間靜音提醒的機制

分組:根據報警名稱分組,若是相同的報警名稱的信息有多條,會合併到一個郵件裏發出。
匹配的報警名稱:
prometheus 監控的報警規則
/usr/local/prometheus/rules/*.yml


- alert: 節點掛了

抑制:消除冗餘告警,在 alertmanager 中配置的

inhibit_rules:
  - source_match:          
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

##當我收到一個告警級別爲 critical 時,他就會抑制掉 warning 這個級別的告警,這個告警等級是在你編寫規則的時候定義的,最後一行就是要對哪些告警作抑制,經過標籤匹配的,我這裏只留了一個 instance,舉個最簡單的例子,當如今 alertmanager 先收到一條 critical、又收到一條 warning 且 instance 值一致的兩條告警他的處理邏輯是怎樣的。

##在監控 nginx,nginx 宕掉的告警級別爲 warning,宿主機宕掉的告警級別爲 critical,譬如說如今我跑 nginx 的服務器涼了,這時候 nginx 確定也涼了,普羅米修斯發現後通知 alertmanager,普羅米修斯發過來的是兩條告警信息,一條是宿主機涼了的,一條是 nginx 涼了的,alertmanager 收到以後,發現告警級別一條是 critical,一條是 warning,並且 instance 標籤值一致,也就是說這是在一臺機器上發生的,因此他就會只發一條 critical 的告警出來,warning 的就被抑制掉了,咱們收到的就是服務器涼了的通知

靜默:

特定時間靜音提醒的機制,主要是使用標籤匹配這一批不發送告警,假如某天要對服務器進行維護,可能會涉及到服務器重啓,在這期間確定會有 N 多告警發出來, 在這期間配置一個靜默,這類的告警就不要發了

告警示例

監控內存
promsql
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80

編寫規則:

CD  /usr/local/prometheus/rules
cat memory.yml 
groups:
- name: memeory_rules
  rules:

  - alert: 內存沒了
    expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80    #表達式成立,便可以查詢到數據
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "{{ $labels.instance }} 內存沒了"
      description: "{{ $labels.instance }} 內存沒了,當前使用率爲 {{ $value }}"

配置告警分配

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 5m
  receiver: 'default-receiver'
  routes: 
    - group_by: ['mysql']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 5m
      receiver: 'mysql-pager'
      match_re:
        job: mysql

receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'xxx@xx.com'
- name: 'mysql-pager'
  email_configs:
  - to: 'xxx@xx.cn'

inhibit_rules:
  - source_match:          
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['instance']

釘釘報警

編譯釘釘webhook接口

#安裝go環境
wget -c https://storage.googleapis.com/golang/go1.8.3.linux-amd64.tar.gz
tar -C /usr/local/ -zxvf go1.8.3.linux-amd64.tar.gz 
mkdir -p /home/gocode
cat << EOF >> /etc/profile
export GOROOT=/usr/local/go         #設置爲go安裝的路徑
export GOPATH=/home/gocode          #默認安裝包的路徑
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
EOF
source  /etc/profile

----------------------------------------
#安裝釘釘插件
cd /home/gocode/
mkdir -p src/github.com/timonwong/
cd /home/gocode/src/github.com/timonwong/
git clone  https://github.com/timonwong/prometheus-webhook-dingtalk.git
cd prometheus-webhook-dingtalk
make
#編譯成功
[root@mini-install prometheus-webhook-dingtalk]# make 
>> formatting code
>> building binaries
 >   prometheus-webhook-dingtalk
>> checking code style
>> running tests
?       github.com/timonwong/prometheus-webhook-dingtalk/chilog [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/cmd/prometheus-webhook-dingtalk        [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/models [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/notifier       [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/template       [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/template/internal/deftmpl      [no test files]
?       github.com/timonwong/prometheus-webhook-dingtalk/webrouter      [no test files]

#建立軟鏈接
ln -s  /home/gocode/src/github.com/timonwong/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk /usr/local/bin/prometheus-webhook-dingtalk


##查看
prometheus-webhook-dingtalk --help
usage: prometheus-webhook-dingtalk --ding.profile=DING.PROFILE [<flags>]

Flags:
  -h, --help              Show context-sensitive help (also try --help-long and --help-man).
      --web.listen-address=":8060"  
                          The address to listen on for web interface.
      --ding.profile=DING.PROFILE ...  
                          Custom DingTalk profile (can be given multiple times, <profile>=<dingtalk-url>).
      --ding.timeout=5s   Timeout for invoking DingTalk webhook.
      --template.file=""  Customized template file (see template/default.tmpl for example)
      --log.level=info    Only log messages with the given severity or above. One of: [debug, info, warn, error]
      --version           Show application version.

啓動釘釘插件

根據已申請的釘釘接口啓動釘釘插件

prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=OOOOOOXXXXXXOXOXOX9b46d54e780d43b98a1951489e3a0a5b1c6b48e891e86bd"

#注意:能夠配置多個webhook名字,這個名字和後續的報警url相關聯
#關於這裏的 -ding.profile 參數:爲了支持同時往多個釘釘自定義機器人發送報警消息,所以 -ding.profile 能夠在命令行中指定屢次,好比:
prometheus-webhook-dingtalk \
    --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx" \
    --ding.profile="webhook2=https://oapi.dingtalk.com/robot/send?access_token=yyyyyyyyyyy"
這裏就定義了兩個 WebHook,一個 webhook1,一個 webhook2,用來往不一樣的釘釘組發送報警消息。
而後在 AlertManager 的配置裏面,加入相應的 receiver(注意下面的 url):
receivers:
- name: send_to_dingding_webhook1
  webhook_configs:
  - send_resolved: false
    url: http://localhost:8060/dingtalk/webhook1/send
- name: send_to_dingding_webhook2
  webhook_configs:
  - send_resolved: false
    url: http://localhost:8060/dingtalk/webhook2/send

##配置釘釘插件爲系統服務
cat >  dingtalk.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXOOOOOOO0d43b98a1951489e3a0a5b1c6b48e891e86bd"
[Install]
WantedBy=multi-user.target
EFO

systemctl   daemon-reload
systemctl   status   dingtalk  會報錯,請忽略,直接start dingtalk

##看端口監聽
[root@mini-install system]# ss  -tanlp  | grep  80
LISTEN     0      128         :::8060                    :::*                   users:(("prometheus-webh",pid=18541,fd=3))

##簡單測試
curl   -H "Content-Type: application/json"  -d '{ "version": "4", "status": "firing", "description":"description_content"}'   http://localhost:8060/dingtalk/webhook/send

##prometheus  webhook 傳遞數據格式
The webhook receiver allows configuring a generic receiver:
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The endpoint to send HTTP POST requests to.
url: <string>

# The HTTP client's configuration.
[ http_config: <http_config> | default = global.http_config ]

The Alertmanager will send HTTP POST requests in the following JSON format to the configured endpoint:
{
  "version": "4",
  "groupKey": <string>,    // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backlink to the Alertmanager.
  "alerts": [
    {
      "status": "<resolved|firing>",
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>",
      "generatorURL": <string> // identifies the entity that caused the alert
    },
    ...
  ]
}

alertmanager

配置

wget  https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz
tar  zxvf  alertmanager-0.19.0.linux-amd64.tar.gz  
ln  -sv  `pwd`/alertmanager-0.19.0.linux-amd64     /usr/local/alertmanager  

#配置爲系統服務
cat  >>  /usr/lib/systemd/system/alertmanager.service <<EFO
[Unit]
Description=alertmanager
[Service]
Restart=on-failure
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
[Install]
WantedBy=multi-user.target
EFO 

systemctl   daemon-reload  後啓動

#編輯配置文件
cd    /usr/local/alertmanager
vim  alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://localhost:8060/dingtalk/webhook/send'

和prometheus 結合

pwd 
/usr/local/prometheus

mkdir  rules  &&  cd  !$
cat example.yml 
groups:
- name: exports.rules     ##定義這組告警的組名,同性質的,都是監控實例exports是否開啓的模板
  rules:

  - alert: 採集器黃了     ## 告警名稱
    expr: up == 0        ## 告警表達式,監控up指標,若是等於0就進行下面的操做
    for: 1m              ## 持續一分鐘爲0進行告警
    labels:              ## 定義告警級別
      severity: ERROR
    annotations:         ## 定義了告警通知怎麼寫,默認調用了{$labels.instance&$labels.job}的值
      summary: "實例 {{ $labels.instance }} 採集器 黃!!"
      description: "實例 {{ $labels.instance }} job 名爲 {{ $labels.job }} 的採集器 黃了有一分鐘!!"

cat  prometheus.yml
 # Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
        - 127.0.0.1:9093
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"


##啓動服務各個服務

節點監控正常後關閉一個節點。效果

參考:
https://blog.rj-bai.com/post/158.html
釘釘插件做者:
https://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
https://github.com/timonwong/prometheus-webhook-dingtalk
釘釘插件編譯: https://blog.51cto.com/9406836/2419876




http://ylzheng.com/2018/03/01/alertmanager-webhook-dingtalk/

釘釘報警python版

釘釘的報警數據格式比較嚴格(別人講的)爲了使用釘釘報警的markdown格式,本身寫一個api 將alertmanager 發送的數據優化後發送到釘釘機器人

釘釘報警 python 版本

import os
import json
import requests
import arrow

from flask import Flask
from flask import request

app = Flask(__name__)


@app.route('/', methods=['POST', 'GET'])
def send():
    if request.method == 'POST':
        post_data = request.get_data()
        send_alert(bytes2json(post_data))
        return 'success'
    else:
        return 'weclome to use prometheus alertmanager dingtalk webhook server!'


def bytes2json(data_bytes):
    data = data_bytes.decode('utf8').replace("'", '"')
    return json.loads(data)


def send_alert(data):
    token = os.getenv('ROBOT_TOKEN')
    if not token:
        print('you must set ROBOT_TOKEN env')
        return
    url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
    for output in data['alerts'][:]:
        # annotations

        send_data = {
            "msgtype": "markdown",
            "markdown": {
                "title": "prometheus_alert",
                "text": "## 告警程序: prometheus_alertmanager \n" +
                        "**告警級別**: %s \n\n" % output['labels']['status'] +
                        "**告警類型**: %s \n\n" % output['labels']['alertname'] +
                        "**告警實例**: %s \n\n" % output['labels']['instance'] +
                        "**告警詳情**: %s \n\n" % output['annotations']['summary'] +
                        "**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
                        "**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
            }
        }
        req = requests.post(url, json=send_data)
        result = req.json()
        if result['errcode'] != 0:
            print('notify dingtalk error: %s' % result['errcode'])


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8060)

將該程序打包成容器

#工做目錄
# tree
.
├── Dockerfile
└── main.py

main.py 爲flask代碼

#cat  Dockerfile
FROM tiangolo/uwsgi-nginx-flask:python3.7
#設置環境變量  釘釘的令牌
ENV ROBOT_TOKEN  47f07271e8a24b6a63486bBSJDFKj346556jhjk9892fk545jjf234jFJ89489JFKSDLF2KgfhsJK234
RUN pip install requests flask arrow  -i  https://pypi.tuna.tsinghua.edu.cn/simple some-package --no-cache-dir
COPY main.py  /app
EXPOSE 80


##打成鏡像

##啓動容器
docker  run  -d  --restart=always  -p 8060:80  dingding 



##測試成功
curl localhost:8060
weclome to use prometheus alertmanager dingtalk webhook server!




測試數據:
[root@t1 ~]# cat  data.json 
{
  "version": "3",   
  "status": "firing",
  "receiver": "jdhf",
  "alerts": [
    {
      "labels": {'instance':"192.168.1.145:9100",'alertname':"home目錄可用量", 'status':"嚴重告警"},
      "annotations": {'summary': "實例在root掛載點磁盤可用量小於4G!, 當前可用: 2G"}
    }
  ]
}

curl  127.0.0.1:8060  -X POST -d @data.json  --header "Content-Type: application/json"   
#測試有問題
是由於Alertmanager 發送給釘釘報警器的數據裏有額外的數據,咱們的測試數據不足,若是但願成功須要修改main.py,去除觸發時間和觸發結束時間


##錯誤
#釘釘發羣通知報{"errcode":310000,"errmsg":"keywords not in content" 解決辦法
釘釘安全設置的的自定義關鍵字未配置或公網ip未添加

#######################
##此時alertmanager配置文件爲
/usr/local/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - send_resolved: true
    url: 'http://localhost:8060'
    #url: 'http://localhost:8060/dingtalk/webhook/send'

python 報警版修改的這個:

import os
import json
import requests
import arrow

from flask import Flask
from flask import request

app = Flask(__name__)


@app.route('/', methods=['POST', 'GET'])
def send():
    if request.method == 'POST':
        post_data = request.get_data()
        send_alert(bytes2json(post_data))
        return 'success'
    else:
        return 'weclome to use prometheus alertmanager dingtalk webhook server!'


def bytes2json(data_bytes):
    data = data_bytes.decode('utf8').replace("'", '"')
    return json.loads(data)


def send_alert(data):
    token = os.getenv('ROBOT_TOKEN')
    if not token:
        print('you must set ROBOT_TOKEN env')
        return
    url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token
    for output in data['alerts'][:]:
        try:
            pod_name = output['labels']['pod']
        except KeyError:
            try:
                pod_name = output['labels']['pod_name']
            except KeyError:
                pod_name = 'null'
                
        try:
            namespace = output['labels']['namespace']
        except KeyError:
            namespace = 'null'

        try:
            message = output['annotations']['message']
        except KeyError:
            try:
                message = output['annotations']['description']
            except KeyError:
                message = 'null'

        send_data = {
            "msgtype": "markdown",
            "markdown": {
                "title": "prometheus_alert",
                "text": "## 告警程序: prometheus_alert \n" +
                        "**告警級別**: %s \n\n" % output['labels']['severity'] +
                        "**告警類型**: %s \n\n" % output['labels']['alertname'] +
                        "**故障pod**: %s \n\n" % pod_name +
                        "**故障namespace**: %s \n\n" % namespace +
                        "**告警詳情**: %s \n\n" % message +
                        "**告警狀態**: %s \n\n" % output['status'] +
                        "**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') +
                        "**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ')
            }
        }
        req = requests.post(url, json=send_data)
        result = req.json()
        if result['errcode'] != 0:
            print('notify dingtalk error: %s' % result['errcode'])


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

來源:
https://www.jianshu.com/p/ed014d15aec8

相關文章
相關標籤/搜索