- 本次任務是用alertmanaer發一個報警郵件
- 本次環境採用二進制普羅組件
- 本次準備監控一個節點的內存,當使用率大於2%時候(測試),發郵件報警.
k8s集羣使用普羅官方文檔html
環境準備
下載二進制https://prometheus.io/download/node
https://github.com/prometheus/prometheus/releases/download/v2.0.0/prometheus-2.0.0.windows-amd64.tar.gz https://github.com/prometheus/alertmanager/releases/download/v0.12.0/alertmanager-0.12.0.windows-amd64.tar.gz https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz
解壓mysql
/root/ ├── alertmanager -> alertmanager-0.12.0.linux-amd64 ├── alertmanager-0.12.0.linux-amd64 ├── alertmanager-0.12.0.linux-amd64.tar.gz ├── node_exporter-0.15.2.linux-amd64 ├── node_exporter-0.15.2.linux-amd64.tar.gz ├── prometheus -> prometheus-2.0.0.linux-amd64 ├── prometheus-2.0.0.linux-amd64 └── prometheus-2.0.0.linux-amd64.tar.gz
實驗架構
配置alertmanager
建立 alert.ymllinux
[root@n1 alertmanager]# ls alertmanager alert.yml amtool data LICENSE NOTICE simple.yml
alert.yml 裏面定義下: 誰發送 什麼事件 發給誰 怎麼發等.git
cat alert.yml global: smtp_smarthost: 'smtp.163.com:25' smtp_from: 'maotai@163.com' smtp_auth_username: 'maotai@163.com' smtp_auth_password: '123456' templates: - '/root/alertmanager/template/*.tmpl' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 10m receiver: default-receiver receivers: - name: 'default-receiver' email_configs: - to: 'maotai@foxmail.com' - 配置好後啓動便可 ./alertmanager -config.file=./alert.yml
配置prometheus
報警規則rule.yml配置(將被prometheus.yml調用)
當使用率大於2%時候(測試),發郵件報警github
$ cat rule.yml groups: - name: test-rule rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
關鍵在於這個公式sql
(node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 2
labels 給這個規則打個標籤windows
annotations(報警說明)這部分是報警內容api
監控k從哪裏獲取?(後面有說) node_memory_MemTotal/node_memory_Buffers/node_memory_Cached微信
prometheus.yml配置
-
添加node_expolore這個job
-
添加rule_files的報警規則,rule_files部分調用rule.yml
$ cat prometheus.yml global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"] rule_files: - /root/prometheus/rule.yml scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['192.168.14.11:9090'] - job_name: linux static_configs: - targets: ['192.168.14.11:9100'] labels: instance: db1
配置好後啓動普羅而後訪問,能夠看到了node target了.
查看node_explore拋出的metric
查看alert,能夠看到告警規則發生的狀態
這些公式的key從這裏能夠看到(前提是當你安裝了對應的explore),按照這個k來寫告警公式
查看收到的郵件
微信報警配置
global: # The smarthost and SMTP sender used for mail notifications. resolve_timeout: 6m smtp_smarthost: '172.16.100.14:25' smtp_from: 'svnbuild_yf@iflytek.com' smtp_auth_username: 'svnbuild_yf' smtp_auth_password: 'tag#write@2015313' smtp_require_tls: false # The auth token for Hipchat. hipchat_auth_token: '1234556789' # Alternative host for Hipchat. hipchat_api_url: 'https://hipchat.foobar.org/' wechat_api_url: "https://qyapi.weixin.qq.com/cgi-bin/" wechat_api_secret: "4tQroVeB0xUcccccccc65Yfkj2Nkt90a80MH3ayI" wechat_api_corp_id: "wxaf5acxxxx5f8eb98" # The directory from which notification templates are read. templates: - 'templates/*.tmpl' # The root route on which each incoming alert enters. route: # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 3s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 5m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 1h # A default receiver receiver: ybyang2 routes: - match: job: "11" #service: "node_exporter" routes: - match: status: yellow receiver: ybyang2 - match: status: orange receiver: berlin # Inhibition rules allow to mute a set of alerts given that another alert is # firing. # We use this to mute any warning-level notifications if the same alert is # already critical. inhibit_rules: - source_match: service: 'up' target_match: service: 'mysql' # Apply inhibition if the alerqtname is the same. equal: ["instance"] - source_match: service: "mysql" target_match: service: "mysql-query" equal: ['instance'] - source_match: service: "A" target_match: service: "B" equal: ["instance"] - source_match: service: "B" target_match: service: "C" equal: ["instance"] receivers: - name: 'ybyang2' email_configs: - to: 'ybyang2@iflytek.com' send_resolved: true html: '{{ template "email.default.html" . }}' headers: { Subject: "[mail] 測試技術部監控告警郵件" } - name: "berlin" wechat_configs: - send_resolved: true to_user: "@all" to_party: "" to_tag: "" agent_id: "1" corp_id: "wxaf5a99ccccc5f8eb98"