alertmanager
alertmanager能夠放在遠程服務器上html
報警機制
在 prometheus 中定義你的監控規則,即配置一個觸發器,某個值超過了設置的閾值就觸發告警, prometheus 會推送當前的告警規則到 alertmanager,alertmanager 收到了會進行一系列的流程處理,而後發送到接收人手裏node
配置安裝
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz tar zxf alertmanager-0.19.0.linux-amd64.tar.gz mv alertmanager-0.19.0.linux-amd64.tar.gz /usr/local/alertmanager && cd /usr/local/alertmanager && ls
配置文件
cat alertmanager.yml
python
global: resolve_timeout: 5m ##全局配置,設置解析超時時間 route: group_by: ['alertname'] ##alertmanager中的分組,選哪一個標籤做爲分組的依據 group_wait: 10s ##分組等待時間,拿到第一條告警後等待10s,若是有其餘的一塊兒發送出去 group_interval: 10s ##各個分組以前發搜告警的間隔時間 repeat_interval: 1h ##重複告警時間,默認1小時 receiver: 'web.hook' ##接收者 ##配置告警接受者 receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' ##配置告警收斂 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
郵件接收配置
cat alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' #smtp服務地址 smtp_from: 'xxx@163.com' #發送郵箱 smtp_auth_username: 'xxx@163.com' #認證用戶名 smtp_auth_password: 'xxxx' #認證密碼 smtp_require_tls: false #禁用tls route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m receiver: 'email' #定義接受告警組名 receivers: - name: 'email' #定義組名 email_configs: #配置郵件 - to: 'xx@xxx.com' #收件人
檢查配置文件
./amtool check-config alertmanager.yml
mysql
配置爲系統服務linux
cat > /usr/lib/systemd/system/alertmanager.service <<EOF > [Unit] > Description=alertmanager > > [Service] > Restart=on-failure > ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml > > [Install] > WantedBy=multi-user.target > EOF
和prometheus 結合配置nginx
alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 ##配置alertmanager地址 rule_files: - "rules/*.yml" ##配置告警規則的文件
配置報警規則
報警規則的目錄 /usr/local/prometheus/rules
git
/usr/local/prometheus/rules]# cat example.yml groups: - name: exports.rules ##定義這組告警的組名,同性質的,都是監控實例exports是否開啓的模板 rules: - alert: 採集器掛了 ## 告警名稱 expr: up == 0 ## 告警表達式,監控up指標,若是等於0,表示監控的節點沒有起來,而後進行下面的操做 for: 1m ## 持續一分鐘爲0就進行告警 labels: ## 定義告警級別 severity: ERROR annotations: ## 定義告警通知怎麼寫,默認調用了{$labels.instance&$labels.job}的值 summary: "實例 {{ $labels.instance }} 掛了" description: "實例 {{ $labels.instance }} job 名爲 {{ $labels.job }} 的掛了"
配置的變量解釋:
github
{{ $labels.instance }} #提取了up裏的instance 值 {{ $labels.job }}
相同的報警名稱即 alertname (根據配置文件 alert 歸類)會被合併到同一個郵件裏一併發出golang
告警的分配
分配策略,在報警的配置文件中設定web
route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1m receiver: 'email'
告警分配示例
global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxx@163.com' smtp_auth_username: 'xxx@163.com' smtp_auth_password: 'xxx' smtp_require_tls: false route: receiver: 'default-receiver' ##定義默認接收器名,若是其餘的匹配不到走這個 group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] ##分組設置 routes: ##子路由 - receiver: 'database-pager' ##定義接收器名字 group_wait: 10s ##分組設置 match_re: ##正則匹配 service: mysql|cassandra ##接收標籤service值爲mysql&&cassandra的告警 - receiver: 'frontend-pager' ##接收器名 group_by: [product, environment] ##分組設置 match: ##直接匹配 team: frontend ##匹配標籤team值爲frontend的告警 receivers: ##定義接收器 - name: 'default-receiver' ##接收器名字 email_configs: ##郵件接口 - to: 'xxx.xx.com' ##接收人,下面以此類推 - name: 'database-pager' email_configs: - to: 'xxx.xx.com' - name: 'frontend-pager' email_configs: - to: 'xxx@.xx.com'
告警收斂
收斂就是儘可能壓縮告警郵件的數量,防止關鍵信息淹沒,alertmanager 中有不少收斂機制,最主要的就是分組抑制靜默,alertmanager 收到告警以後會先進行分組,而後進入通知隊列,這個隊列會對通知的郵件進行抑制靜默,再根據 router 將告警路由到不一樣的接收器 機制 說明 分組 (group) 將相似性質的告警合併爲單個進行通知 抑制 (Inhibition) 當告警發生後,中止重複發送由此告警引起的其餘告警 靜默 (Silences) 一種簡單的特定時間靜音提醒的機制
分組:根據報警名稱分組,若是相同的報警名稱的信息有多條,會合併到一個郵件裏發出。
匹配的報警名稱:
prometheus 監控的報警規則
/usr/local/prometheus/rules/*.yml
- alert: 節點掛了
抑制:消除冗餘告警,在 alertmanager 中配置的
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['instance'] ##當我收到一個告警級別爲 critical 時,他就會抑制掉 warning 這個級別的告警,這個告警等級是在你編寫規則的時候定義的,最後一行就是要對哪些告警作抑制,經過標籤匹配的,我這裏只留了一個 instance,舉個最簡單的例子,當如今 alertmanager 先收到一條 critical、又收到一條 warning 且 instance 值一致的兩條告警他的處理邏輯是怎樣的。 ##在監控 nginx,nginx 宕掉的告警級別爲 warning,宿主機宕掉的告警級別爲 critical,譬如說如今我跑 nginx 的服務器涼了,這時候 nginx 確定也涼了,普羅米修斯發現後通知 alertmanager,普羅米修斯發過來的是兩條告警信息,一條是宿主機涼了的,一條是 nginx 涼了的,alertmanager 收到以後,發現告警級別一條是 critical,一條是 warning,並且 instance 標籤值一致,也就是說這是在一臺機器上發生的,因此他就會只發一條 critical 的告警出來,warning 的就被抑制掉了,咱們收到的就是服務器涼了的通知
靜默:
特定時間靜音提醒的機制,主要是使用標籤匹配這一批不發送告警,假如某天要對服務器進行維護,可能會涉及到服務器重啓,在這期間確定會有 N 多告警發出來, 在這期間配置一個靜默,這類的告警就不要發了
告警示例
監控內存
promsql
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80
編寫規則:
CD /usr/local/prometheus/rules cat memory.yml groups: - name: memeory_rules rules: - alert: 內存沒了 expr: (node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )* 100 > 80 #表達式成立,便可以查詢到數據 for: 1m labels: severity: warning annotations: summary: "{{ $labels.instance }} 內存沒了" description: "{{ $labels.instance }} 內存沒了,當前使用率爲 {{ $value }}"
配置告警分配
route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'default-receiver' routes: - group_by: ['mysql'] group_wait: 10s group_interval: 10s repeat_interval: 5m receiver: 'mysql-pager' match_re: job: mysql receivers: - name: 'default-receiver' email_configs: - to: 'xxx@xx.com' - name: 'mysql-pager' email_configs: - to: 'xxx@xx.cn' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['instance']
釘釘報警
編譯釘釘webhook接口
#安裝go環境 wget -c https://storage.googleapis.com/golang/go1.8.3.linux-amd64.tar.gz tar -C /usr/local/ -zxvf go1.8.3.linux-amd64.tar.gz mkdir -p /home/gocode cat << EOF >> /etc/profile export GOROOT=/usr/local/go #設置爲go安裝的路徑 export GOPATH=/home/gocode #默認安裝包的路徑 export PATH=$PATH:$GOROOT/bin:$GOPATH/bin EOF source /etc/profile ---------------------------------------- #安裝釘釘插件 cd /home/gocode/ mkdir -p src/github.com/timonwong/ cd /home/gocode/src/github.com/timonwong/ git clone https://github.com/timonwong/prometheus-webhook-dingtalk.git cd prometheus-webhook-dingtalk make #編譯成功 [root@mini-install prometheus-webhook-dingtalk]# make >> formatting code >> building binaries > prometheus-webhook-dingtalk >> checking code style >> running tests ? github.com/timonwong/prometheus-webhook-dingtalk/chilog [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/cmd/prometheus-webhook-dingtalk [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/models [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/notifier [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/template [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/template/internal/deftmpl [no test files] ? github.com/timonwong/prometheus-webhook-dingtalk/webrouter [no test files] #建立軟鏈接 ln -s /home/gocode/src/github.com/timonwong/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk /usr/local/bin/prometheus-webhook-dingtalk ##查看 prometheus-webhook-dingtalk --help usage: prometheus-webhook-dingtalk --ding.profile=DING.PROFILE [<flags>] Flags: -h, --help Show context-sensitive help (also try --help-long and --help-man). --web.listen-address=":8060" The address to listen on for web interface. --ding.profile=DING.PROFILE ... Custom DingTalk profile (can be given multiple times, <profile>=<dingtalk-url>). --ding.timeout=5s Timeout for invoking DingTalk webhook. --template.file="" Customized template file (see template/default.tmpl for example) --log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error] --version Show application version.
啓動釘釘插件
根據已申請的釘釘接口啓動釘釘插件
prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=OOOOOOXXXXXXOXOXOX9b46d54e780d43b98a1951489e3a0a5b1c6b48e891e86bd" #注意:能夠配置多個webhook名字,這個名字和後續的報警url相關聯 #關於這裏的 -ding.profile 參數:爲了支持同時往多個釘釘自定義機器人發送報警消息,所以 -ding.profile 能夠在命令行中指定屢次,好比: prometheus-webhook-dingtalk \ --ding.profile="webhook1=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx" \ --ding.profile="webhook2=https://oapi.dingtalk.com/robot/send?access_token=yyyyyyyyyyy" 這裏就定義了兩個 WebHook,一個 webhook1,一個 webhook2,用來往不一樣的釘釘組發送報警消息。 而後在 AlertManager 的配置裏面,加入相應的 receiver(注意下面的 url): receivers: - name: send_to_dingding_webhook1 webhook_configs: - send_resolved: false url: http://localhost:8060/dingtalk/webhook1/send - name: send_to_dingding_webhook2 webhook_configs: - send_resolved: false url: http://localhost:8060/dingtalk/webhook2/send ##配置釘釘插件爲系統服務 cat > dingtalk.service <<EFO [Unit] Description=alertmanager [Service] Restart=on-failure ExecStart=/usr/local/bin/prometheus-webhook-dingtalk --ding.profile="webhook=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXOOOOOOO0d43b98a1951489e3a0a5b1c6b48e891e86bd" [Install] WantedBy=multi-user.target EFO systemctl daemon-reload systemctl status dingtalk 會報錯,請忽略,直接start dingtalk ##看端口監聽 [root@mini-install system]# ss -tanlp | grep 80 LISTEN 0 128 :::8060 :::* users:(("prometheus-webh",pid=18541,fd=3)) ##簡單測試 curl -H "Content-Type: application/json" -d '{ "version": "4", "status": "firing", "description":"description_content"}' http://localhost:8060/dingtalk/webhook/send ##prometheus webhook 傳遞數據格式 The webhook receiver allows configuring a generic receiver: # Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = true ] # The endpoint to send HTTP POST requests to. url: <string> # The HTTP client's configuration. [ http_config: <http_config> | default = global.http_config ] The Alertmanager will send HTTP POST requests in the following JSON format to the configured endpoint: { "version": "4", "groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate) "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backlink to the Alertmanager. "alerts": [ { "status": "<resolved|firing>", "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": <string> // identifies the entity that caused the alert }, ... ] }
alertmanager
配置
wget https://github.com/prometheus/alertmanager/releases/download/v0.19.0/alertmanager-0.19.0.linux-amd64.tar.gz tar zxvf alertmanager-0.19.0.linux-amd64.tar.gz ln -sv `pwd`/alertmanager-0.19.0.linux-amd64 /usr/local/alertmanager #配置爲系統服務 cat >> /usr/lib/systemd/system/alertmanager.service <<EFO [Unit] Description=alertmanager [Service] Restart=on-failure ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml [Install] WantedBy=multi-user.target EFO systemctl daemon-reload 後啓動 #編輯配置文件 cd /usr/local/alertmanager vim alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - url: 'http://localhost:8060/dingtalk/webhook/send'
和prometheus 結合
pwd /usr/local/prometheus mkdir rules && cd !$ cat example.yml groups: - name: exports.rules ##定義這組告警的組名,同性質的,都是監控實例exports是否開啓的模板 rules: - alert: 採集器黃了 ## 告警名稱 expr: up == 0 ## 告警表達式,監控up指標,若是等於0就進行下面的操做 for: 1m ## 持續一分鐘爲0進行告警 labels: ## 定義告警級別 severity: ERROR annotations: ## 定義了告警通知怎麼寫,默認調用了{$labels.instance&$labels.job}的值 summary: "實例 {{ $labels.instance }} 採集器 黃!!" description: "實例 {{ $labels.instance }} job 名爲 {{ $labels.job }} 的採集器 黃了有一分鐘!!" cat prometheus.yml # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - 127.0.0.1:9093 # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: - "rules/*.yml" # - "first_rules.yml" # - "second_rules.yml" ##啓動服務各個服務
節點監控正常後關閉一個節點。效果
參考:
https://blog.rj-bai.com/post/158.html
釘釘插件做者:
https://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
https://github.com/timonwong/prometheus-webhook-dingtalk
釘釘插件編譯: https://blog.51cto.com/9406836/2419876
http://ylzheng.com/2018/03/01/alertmanager-webhook-dingtalk/
釘釘報警python版
釘釘的報警數據格式比較嚴格(別人講的)爲了使用釘釘報警的markdown格式,本身寫一個api 將alertmanager 發送的數據優化後發送到釘釘機器人
釘釘報警 python 版本
import os import json import requests import arrow from flask import Flask from flask import request app = Flask(__name__) @app.route('/', methods=['POST', 'GET']) def send(): if request.method == 'POST': post_data = request.get_data() send_alert(bytes2json(post_data)) return 'success' else: return 'weclome to use prometheus alertmanager dingtalk webhook server!' def bytes2json(data_bytes): data = data_bytes.decode('utf8').replace("'", '"') return json.loads(data) def send_alert(data): token = os.getenv('ROBOT_TOKEN') if not token: print('you must set ROBOT_TOKEN env') return url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token for output in data['alerts'][:]: # annotations send_data = { "msgtype": "markdown", "markdown": { "title": "prometheus_alert", "text": "## 告警程序: prometheus_alertmanager \n" + "**告警級別**: %s \n\n" % output['labels']['status'] + "**告警類型**: %s \n\n" % output['labels']['alertname'] + "**告警實例**: %s \n\n" % output['labels']['instance'] + "**告警詳情**: %s \n\n" % output['annotations']['summary'] + "**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') + "**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') } } req = requests.post(url, json=send_data) result = req.json() if result['errcode'] != 0: print('notify dingtalk error: %s' % result['errcode']) if __name__ == '__main__': app.run(host='0.0.0.0', port=8060)
將該程序打包成容器
#工做目錄 # tree . ├── Dockerfile └── main.py main.py 爲flask代碼 #cat Dockerfile FROM tiangolo/uwsgi-nginx-flask:python3.7 #設置環境變量 釘釘的令牌 ENV ROBOT_TOKEN 47f07271e8a24b6a63486bBSJDFKj346556jhjk9892fk545jjf234jFJ89489JFKSDLF2KgfhsJK234 RUN pip install requests flask arrow -i https://pypi.tuna.tsinghua.edu.cn/simple some-package --no-cache-dir COPY main.py /app EXPOSE 80 ##打成鏡像 ##啓動容器 docker run -d --restart=always -p 8060:80 dingding ##測試成功 curl localhost:8060 weclome to use prometheus alertmanager dingtalk webhook server! 測試數據: [root@t1 ~]# cat data.json { "version": "3", "status": "firing", "receiver": "jdhf", "alerts": [ { "labels": {'instance':"192.168.1.145:9100",'alertname':"home目錄可用量", 'status':"嚴重告警"}, "annotations": {'summary': "實例在root掛載點磁盤可用量小於4G!, 當前可用: 2G"} } ] } curl 127.0.0.1:8060 -X POST -d @data.json --header "Content-Type: application/json" #測試有問題 是由於Alertmanager 發送給釘釘報警器的數據裏有額外的數據,咱們的測試數據不足,若是但願成功須要修改main.py,去除觸發時間和觸發結束時間 ##錯誤 #釘釘發羣通知報{"errcode":310000,"errmsg":"keywords not in content" 解決辦法 釘釘安全設置的的自定義關鍵字未配置或公網ip未添加 ####################### ##此時alertmanager配置文件爲 /usr/local/alertmanager/alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' webhook_configs: - send_resolved: true url: 'http://localhost:8060' #url: 'http://localhost:8060/dingtalk/webhook/send'
python 報警版修改的這個:
import os import json import requests import arrow from flask import Flask from flask import request app = Flask(__name__) @app.route('/', methods=['POST', 'GET']) def send(): if request.method == 'POST': post_data = request.get_data() send_alert(bytes2json(post_data)) return 'success' else: return 'weclome to use prometheus alertmanager dingtalk webhook server!' def bytes2json(data_bytes): data = data_bytes.decode('utf8').replace("'", '"') return json.loads(data) def send_alert(data): token = os.getenv('ROBOT_TOKEN') if not token: print('you must set ROBOT_TOKEN env') return url = 'https://oapi.dingtalk.com/robot/send?access_token=%s' % token for output in data['alerts'][:]: try: pod_name = output['labels']['pod'] except KeyError: try: pod_name = output['labels']['pod_name'] except KeyError: pod_name = 'null' try: namespace = output['labels']['namespace'] except KeyError: namespace = 'null' try: message = output['annotations']['message'] except KeyError: try: message = output['annotations']['description'] except KeyError: message = 'null' send_data = { "msgtype": "markdown", "markdown": { "title": "prometheus_alert", "text": "## 告警程序: prometheus_alert \n" + "**告警級別**: %s \n\n" % output['labels']['severity'] + "**告警類型**: %s \n\n" % output['labels']['alertname'] + "**故障pod**: %s \n\n" % pod_name + "**故障namespace**: %s \n\n" % namespace + "**告警詳情**: %s \n\n" % message + "**告警狀態**: %s \n\n" % output['status'] + "**觸發時間**: %s \n\n" % arrow.get(output['startsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') + "**觸發結束時間**: %s \n" % arrow.get(output['endsAt']).to('Asia/Shanghai').format('YYYY-MM-DD HH:mm:ss ZZ') } } req = requests.post(url, json=send_data) result = req.json() if result['errcode'] != 0: print('notify dingtalk error: %s' % result['errcode']) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)