kubernetes之監控Prometheus實戰--郵件告警--微信告警（二）

時間 2020-05-26

標籤 kubernetes 監控 prometheus 實戰郵件告警微信简体版

原文原文鏈接

kubernetes插件

上面是咱們最經常使用的 grafana 當中的 dashboard 的功能的使用，而後咱們也能夠來進行一些其餘的系統管理，好比添加用戶，爲用戶添加權限等等，咱們也能夠安裝一些其餘插件，好比 grafana 就有一個專門針對 Kubernetes 集羣監控的插件：grafana-kubernetes-appnode

要安裝這個插件，須要到 grafana 的 Pod 裏面去執行安裝命令：git

kubectl get pods -n kube-ops NAME READY STATUS RESTARTS AGE grafana-75f64c6759-gknxz      1/1       Running   0 1d node-exporter-48b6g           1/1       Running   0 7d node-exporter-4swrs           1/1       Running   0 7d node-exporter-4w2dd           1/1       Running   0 7d node-exporter-fcp9x           1/1       Running   0 7d prometheus-56b6d68c48-6xpvw   1/1       Running   0 7d [root@k8s-master ~]# kubectl exec -it grafana-75f64c6759-gknxz /bin/bash -n kube-ops grafana@grafana-75f64c6759-gknxz:/usr/share/grafana$ grafana-cli plugins install grafana-kubernetes-app installing grafana-kubernetes-app @ 1.0.1 from url: https://grafana.com/api/plugins/grafana-kubernetes-app/versions/1.0.1/download
into: /var/lib/grafana/plugins ✔ Installed grafana-kubernetes-app successfully Restart grafana after installing plugins . <service grafana-server restart>

安裝完成後須要重啓 grafana 纔會生效，咱們這裏直接刪除 Pod，重建便可，而後回到 grafana 頁面中，切換到 plugins 頁面能夠發現下面多了一個 Kubernetes 的插件，點擊進來啓用便可，而後點擊Next up旁邊的連接配置集羣github

這裏咱們能夠添加一個新的 Kubernetes 集羣，這裏須要填寫集羣的訪問地址：https://kubernetes.default，而後比較重要的是集羣訪問的證書，勾選上TLS Client Auth和With CA Cert這兩項。正則表達式

集羣訪問的證書文件，用咱們訪問集羣的 kubectl 的配置文件中的證書信息(~/.kube/config)便可，其中屬性certificate-authority-data、client-certificate-data、client-key-data就對應這 CA 證書、Client 證書、Client 私鑰，不過 config 文件裏面的內容是base64編碼事後的，因此咱們這裏填寫的時候要作base64解碼。docker

配置完成後，能夠直接點擊Deploy(實際上前面的課程中咱們都已經部署過相關的資源了)，而後點擊Save，就能夠獲取到集羣的監控資源信息了。api

能夠看到上面展現了整個集羣的狀態，能夠查看上面的一些 Dashboard:bash

AlertManager

Alertmanager 主要用於接收 Prometheus 發送的告警信息，它支持豐富的告警通知渠道，並且很容易作到告警信息進行去重，降噪，分組等，是一款前衛的告警通知系統。服務器

安裝

從官方文檔https://prometheus.io/docs/alerting/configuration/中咱們能夠看到下載AlertManager二進制文件後，能夠經過下面的命令運行：微信

$ ./alertmanager --config.file=simple.yml

其中-config.file參數是用來指定對應的配置文件的，因爲咱們這裏一樣要運行到 Kubernetes 集羣中來，因此咱們使用docker鏡像的方式來安裝，使用的鏡像是：prom/alertmanager:v0.15.3app

首先，指定配置文件，一樣的，咱們這裏使用一個 ConfigMap 資源對象：(alertmanager-conf.yaml)

apiVersion: v1 kind: ConfigMap metadata: name: alert-config namespace: kube-ops data: config.yml: |- global: # 在沒有報警的狀況下聲明爲已解決的時間 resolve_timeout: 5m # 配置郵件發送信息 smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'zhaikun1992@qq.com' smtp_auth_username: 'zhaikun1992@qq.com' smtp_auth_password: '**' #改爲本身的密碼 smtp_hello: 'qq.com' smtp_require_tls: false # 全部報警信息進入後的根路由，用來設置報警的分發策略 route: # 這裏的標籤列表是接收到報警信息後的從新分組標籤，例如，接收到的報警信息裏面有許多具備 cluster=A 和 alertname=LatncyHigh 這樣的標籤的報警信息將會批量被聚合到一個分組裏面 group_by: ['alertname', 'cluster'] # 當一個新的報警分組被建立後，須要等待至少group_wait時間來初始化通知，這種方式能夠確保您能有足夠的時間爲同一分組來獲取多個警報，而後一塊兒觸發這個報警信息。 group_wait: 30s # 當第一個報警發送後，等待'group_interval'時間來發送新的一組報警信息。 group_interval: 5m # 若是一個報警信息已經發送成功了，等待'repeat_interval'時間來從新發送他們 repeat_interval: 5m # 默認的receiver：若是一個報警沒有被一個route匹配，則發送給默認的接收器 receiver: default # 上面全部的屬性都由全部子路由繼承，而且能夠在每一個子路由上進行覆蓋。 routes: - receiver: email group_wait: 10s match: team: node receivers: - name: 'default' email_configs: - to: 'zhai_kun@suixingpay.com' send_resolved: true
    - name: 'email' email_configs: - to: 'zhaikun1992@qq.com' send_resolved: true

這裏有一個坑。若是咱們郵件服務器使用的是25或者465端口的話，或報以下錯誤：

smtp.*****.com:465 fail to send mail alert due to 'does not advertise the STARTTLS extension

這裏是一個BUG，能夠參考

建立

$ kubectl create -f alertmanager-conf.yaml configmap "alert-config" created

而後配置 AlertManager 的容器，咱們能夠直接在以前的 Prometheus 的 Pod 中添加這個容器，對應的 YAML 資源聲明以下：

- name: alertmanager image: prom/alertmanager:v0.15.3 imagePullPolicy: IfNotPresent args: - "--config.file=/etc/alertmanager/config.yml"
 - "--storage.path=/alertmanager/data" ports: - containerPort: 9093 name: http volumeMounts: - mountPath: "/etc/alertmanager" name: alertcfg resources: requests: cpu: 100m memory: 256Mi limits: cpu: 100m memory: 256Mi volumes: - name: alertcfg configMap: name: alert-config

這裏咱們將上面建立的 alert-config 這個 ConfigMap 資源對象以 Volume 的形式掛載到 /etc/alertmanager 目錄下去，而後在啓動參數中指定了配置文件--config.file=/etc/alertmanager/config.yml，而後咱們能夠來更新這個 Prometheus 的 Pod：

$ kubectl apply -f prome-deploy.yaml deployment.extensions "prometheus" configured

AlertManager 的容器啓動起來後，咱們還須要在 Prometheus 中配置下 AlertManager 的地址，讓 Prometheus 可以訪問到 AlertManager，在 Prometheus 的 ConfigMap 資源清單中添加以下配置：

alerting: alertmanagers: - static_configs: - targets: ["localhost:9093"]

更新這個資源對象後，稍等一小會兒，執行 reload 操做：

$ kubectl delete -f prome-cm.yaml configmap "prometheus-config" deleted $ kubectl create -f prome-cm.yaml configmap "prometheus-config" created kubectl get svc -n kube-ops NAME TYPE CLUSTER-IP      EXTERNAL-IP PORT(S) AGE grafana NodePort 10.97.81.127    <none>        3000:30489/TCP 75d prometheus NodePort 10.111.210.47   <none>        9090:31990/TCP,9093:30250/TCP 77d $ curl -X POST "http://10.111.210.47:9090/-/reload"

報警規則

如今咱們只是把AlertManager運行起來了。可是它並不知道要報什麼警。由於沒有任何地方告訴咱們要報警，因此咱們還須要配置一些報警規則來告訴咱們對哪些數據進行報警。

警報規則容許你基於 Prometheus 表達式語言的表達式來定義報警報條件，並在觸發警報時發送通知給外部的接收者。

一樣在 Prometheus 的配置文件中添加以下報警規則配置：

rule_files: - /etc/prometheus/rules.yml

其中rule_files就是用來指定報警規則的，這裏咱們一樣將rules.yml文件用 ConfigMap 的形式掛載到/etc/prometheus目錄下面便可:

apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-ops data: prometheus.yml: | ... rules.yml: |
    groups: - name: test-rule rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 20
        for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 20% (current value is: {{ $value }}"

上面咱們定義了一個名爲NodeMemoryUsage的報警規則，其中：

for語句會使 Prometheus 服務等待指定的時間, 而後執行查詢表達式。
labels語句容許指定額外的標籤列表，把它們附加在告警上。
annotations語句指定了另外一組標籤，它們不被當作告警實例的身份標識，它們常常用於存儲一些額外的信息，用於報警信息的展現之類的。

爲了方便演示，咱們將的表達式判斷報警臨界值設置爲20，從新更新 ConfigMap 資源對象，因爲咱們在 Prometheus 的 Pod 中已經經過 Volume 的形式將 prometheus-config 這個一個 ConfigMap 對象掛載到了/etc/prometheus目錄下面，因此更新後，該目錄下面也會出現rules.yml文件，因此前面配置的rule_files路徑也是正常的，更新完成後，從新執行reload操做，這個時候咱們去 Prometheus 的 Dashboard 中切換到alerts路徑下面就能夠看到有報警配置規則的數據了：

由於咱們配置了for等待時間，由於如今是PENDING狀態。過兩分鐘後，咱們會發現進入FIRING狀態

咱們能夠看到頁面中出現了咱們剛剛定義的報警規則信息，並且報警信息中還有狀態顯示。一個報警信息在生命週期內有下面3種狀態：

inactive: 表示當前報警信息既不是firing狀態也不是pending狀態
pending: 表示在設置的閾值時間範圍內被激活了
firing: 表示超過設置的閾值時間被激活了

咱們這裏的狀態如今是firing就表示這個報警已經被激活了，咱們這裏的報警信息有一個team=node這樣的標籤，而最上面咱們配置 alertmanager 的時候就有以下的路由配置信息了：

routes: - receiver: email group_wait: 10s match: team: node

因此咱們這裏的報警信息會被email這個接收器來進行報警，咱們上面配置的是郵箱，因此正常來講這個時候咱們會收到一封以下的報警郵件：

咱們能夠看到收到的郵件內容中包含一個View In AlertManager的連接，咱們一樣能夠經過 NodePort 的形式去訪問到 AlertManager 的 Dashboard 頁面：

$ kubectl get svc -n kube-ops NAME TYPE CLUSTER-IP      EXTERNAL-IP PORT(S) AGE grafana NodePort 10.97.81.127    <none>        3000:30489/TCP 75d prometheus NodePort 10.111.210.47   <none>        9090:31990/TCP,9093:30250/TCP   77d

經過任意節點IP:30250進行訪問，咱們就能夠查看到 AlertManager 的 Dashboard 頁面：

在這個頁面中咱們能夠進行一些操做，好比過濾、分組等等，裏面還有兩個新的概念：Inhibition(抑制)和 Silences(靜默)。

Inhibition：若是某些其餘警報已經觸發了，則對於某些警報，Inhibition 是一個抑制通知的概念。例如：一個警報已經觸發，它正在通知整個集羣是不可達的時，Alertmanager 則能夠配置成關心這個集羣的其餘警報無效。這能夠防止與實際問題無關的數百或數千個觸發警報的通知，Inhibition 須要經過上面的配置文件進行配置。
Silences：靜默是一個很是簡單的方法，能夠在給定時間內簡單地忽略全部警報。Silences 基於 matchers配置，相似路由樹。來到的警告將會被檢查，判斷它們是否和活躍的 Silences 相等或者正則表達式匹配。若是匹配成功，則不會將這些警報發送給接收者。

因爲全局配置中咱們配置的repeat_interval: 5m，因此正常來講，上面的測試報警若是一直知足報警條件(CPU使用率大於20%)的話，那麼每5分鐘咱們就能夠收到一條報警郵件。

如今咱們添加一個 Silences，以下圖所示，匹配 k8s-master 節點的內存報警：

添加完成後，等下一次的報警信息觸發後，咱們能夠看到報警信息裏面已經沒有了節點 k8s-master 的報警信息了(報警節點變成3個了)：

因爲咱們上面添加的 Silences 是有過時時間的，因此在這個時間段事後，k8s-master 的報警信息就會恢復了。

微信告警

一、登陸企業微信，建立第三方應用，點擊建立應用按鈕 -> 填寫應用信息：

咱們打開已經建立好的promethues應用。獲取一下信息：

新增告警信息模板

apiVersion: v1 kind: ConfigMap metadata: name: wechat-tmpl namespace: kube-ops data: wechat.tmpl: | {{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警級別: {{ .Labels.severity }} 告警類型: {{ .Labels.alertname }} 故障主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警詳情: {{ .Annotations.description }} 觸發時間: {{ .StartsAt.Format "2006-01-02 15:04:05" }} ========end========== {{ end }} {{ end }}

把模板掛載到alertmanager 的pod中的/etc/alertmanager-tmpl

 volumeMounts: ........ - mountPath: "/etc/alertmanager-tmpl" name: wechattmpl ........ volumes: ......... - name: wechattmpl configMap: name: wechat-tmpl

在rules.yml中添加一個新的告警信息。

- alert: NodeFilesystemUsage expr: (node_filesystem_size_bytes{device="rootfs"} - node_filesystem_free_bytes{device="rootfs"}) / node_filesystem_size_bytes{device="rootfs"} * 100 > 10
        for: 2m labels: team: wechat annotations: summary: "{{$labels.instance}}: High Filesystem usage detected" description: "{{$labels.instance}}: Filesystem usage is above 10% (current value is: {{ $value }}"

alertmanager中默認支持微信告警通知。咱們能夠經過官網查看咱們的配置以下：

apiVersion: v1 kind: ConfigMap metadata: name: alert-config namespace: kube-ops data: config.yml: |- global: resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'zhaikun1992@qq.com' smtp_auth_username: 'zhaikun1992@qq.com' smtp_auth_password: 'jrmujtmydxtibaid' smtp_hello: 'qq.com' smtp_require_tls: true #指定wechar告警模板 templates: - "/etc/alertmanager-tmpl/wechat.tmpl" route: group_by: ['alertname', 'cluster', 'alertname_wechat'] group_wait: 30s group_interval: 5m repeat_interval: 5m receiver: default routes: - receiver: email group_wait: 10s match: team: node #配置微信的路由分組 - receiver: 'wechat' group_wait: 10s match: team: wechat receivers: - name: 'default' email_configs: - to: 'zhai_kun@suixingpay.com' send_resolved: true
    - name: 'email' email_configs: - to: 'zhaikun1992@qq.com' send_resolved: true #配置微信接收器 - name: 'wechat' wechat_configs: - corp_id: '***' to_party: '**' to_user: "***" agent_id: '***' api_secret: '****' send_resolved: true

wechat_configs 配置詳情

send_resolved 告警解決是否通知，默認是false
api_secret 建立微信上應用的Secret
api_url wechat的url。默認便可
corp_id 企業微信---個人企業---最下面的企業ID
message 告警消息模板：默認 template "wechat.default.message"
agent_id 建立微信上應用的agent_id
to_user 接受消息的用戶全部用戶可使用 @all
to_party 接受消息的部門

咱們部署驗證一下：

kubectl create -f alertmanager-cm.yaml kubectl create -f prome-server.yaml kubectl create -f prome-confmap.yaml kubectl create -f wechat-tmpl.yaml

稍等兩分鐘，咱們的微信企業號會接收到告警信息。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。