使用kube-state-metrics監控kubernetes並微信告警

前言

監控指標 具體實現 舉例
Pod性能 cAdvisor 容器CPU,內存利用率
Node性能 node-exporter 節點CPU,內存利用率
K8S資源對象 kube-state-metrics Pod/Deployment/Service

數據收集

咱們這裏使用kube-state-metricsk8s資源數據進行收集。node

架構圖

使用kube-state-metrics監控kubernetes並微信告警

監控指標

指標類別包括:python

  • CronJob Metrics
  • DaemonSet Metrics
  • Deployment Metrics
  • Job Metrics
  • LimitRange Metrics
  • Node Metrics
  • PersistentVolume Metrics
  • PersistentVolumeClaim Metrics
  • Pod Metrics
  • Pod Disruption Budget Metrics
  • ReplicaSet Metrics
  • ReplicationController Metrics
  • ResourceQuota Metrics
  • Service Metrics
  • StatefulSet Metrics
  • Namespace Metrics
  • Horizontal Pod Autoscaler Metrics
  • Endpoint Metrics
  • Secret Metrics
  • ConfigMap Metrics

以pod爲例:git

  • kube_pod_info
  • kube_pod_owner
  • kube_pod_status_phase
  • kube_pod_status_ready
  • kube_pod_status_scheduled
  • kube_pod_container_status_waiting
  • kube_pod_container_status_terminated_reason
  • ...

部署 kube-state-metrics

默認會在kube-system命名空間下建立對應的資源,最好不要更換yaml文件中的命名空間。web

# 獲取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git
# 部署kube-state-metrics
kubectl apply -f kube-state-metrics-configs
# 查看pod狀態
kubectl get pod -n kube-system
NAME                                   READY   STATUS    RESTARTS   AGE
kube-state-metrics-c698dc7b5-zstz9     1/1     Running   0          32m

獲取kube-state-metrics-c698dc7b5-zstz9容器日誌,如圖:api

使用kube-state-metrics監控kubernetes並微信告警

數據對接prometheus

prometheus準備好了以後,添加對應的採集job便可:微信

# 添加在sidecar/cm-kube-mon-sidecar.yaml最後
- job_name: 'kube-state-metrics'
      static_configs:
        - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
# 從新apply configmap文件,無需重啓prometheus(配置的有熱加載)
kubectl apply -f sidecar/cm-kube-mon-sidecar.yaml

打開promethues的web看一下target裏的配置是否生效,以下圖:架構

使用kube-state-metrics監控kubernetes並微信告警

數據對接Grafana

Grafana準備好以後,咱們在Grafana中導入選定/自定義的dashboard,添加Prometheus數據源,便可:app

使用kube-state-metrics監控kubernetes並微信告警

prometheus報警規則

修改sidecar/rules-cm-kube-mon-sidecar.yaml配置文件,添加以下報警指標。負載均衡

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: kube-mon
data:
  alert-rules.yaml: |-
    groups:
    - name: White box monitoring
      rules:
      - alert: Pod-重啓
        expr: changes(kube_pod_container_status_restarts_total{pod !~ "analyzer.*"}[10m]) > 0
        for: 1m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Pod: {{ $labels.pod }}  Restart"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          pod: "{{ $labels.pod }}"
          container: "{{ $labels.container }}"

      - alert: Pod-未知錯誤/失敗
        expr: kube_pod_status_phase{phase="Unknown"} == 1 or kube_pod_status_phase{phase="Failed"} == 1
        for: 1m
        labels:
          severity: 緊急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Pod: {{ $labels.pod }} 未知錯誤/失敗"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}" 
          pod: "{{ $labels.pod }}"
          container: "{{ $labels.container }}"

      - alert: Daemonset Unavailable
        expr: kube_daemonset_status_number_unavailable > 0
        for: 5m
        labels:
          severity: 緊急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Daemonset: {{ $labels.daemonset }} 守護進程不可用"
          k8scluster: "{{ $labels.k8scluster}}" 
          namespace: "{{ $labels.namespace }}"  
          daemonset: "{{ $labels.daemonset }}" 

      - alert: Job-失敗
        expr: kube_job_status_failed == 1
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Job: {{ $labels.job_name }} Failed"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          job: "{{ $labels.job_name }}"

      - alert: Pod NotReady
        expr: sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id)(kube_pod_status_phase{job=~".*kubernetes-service-endpoints",phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id)group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod,owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "pod: {{ $labels.pod }} 處於 NotReady 狀態超過15分鐘"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"

      - alert: Deployment副本數
        expr: (kube_deployment_spec_replicas{job=~".*kubernetes-service-endpoints"} !=kube_deployment_status_replicas_available{job=~".*kubernetes-service-endpoints"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Deployment: {{ $labels.deployment }} 實際副本數和設置副本數不一致"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          deployment: "{{ $labels.deployment }}"

      - alert: Statefulset副本數
        expr: (kube_statefulset_status_replicas_ready{job=~".*kubernetes-service-endpoints"} !=kube_statefulset_status_replicas{job=~".*kubernetes-service-endpoints"}) and (changes(kube_statefulset_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
        for: 5m
        labels:
          severity: 警告
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Statefulset: {{ $labels.statefulset }} 實際副本數和設置副本數不一致"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          statefulset: "{{ $labels.statefulset }}"

      - alert: 存儲卷PV
        expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kubernetes-service-endpoints"} > 0
        for: 5m
        labels:
          severity: 緊急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "存儲卷PV: {{ $labels.persistentvolume }} 處於Failed或Pending狀態"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolume: "{{ $labels.persistentvolume }}"

      - alert: 存儲卷PVC
        expr: kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost",job=~".*kubernetes-service-endpoints"} > 0
        for: 5m
        labels:
          severity: 緊急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "存儲卷PVC: {{ $labels.persistentvolumeclaim }} Failed或Pending狀態"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolumeclaim: "{{ $labels.persistentvolumeclaim }}"

      - alert: k8s service
        expr: kube_service_status_load_balancer_ingress != 1
        for: 5m
        labels:
          severity: 緊急
          service: prometheus_bot
          receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
        annotations:
          summary: "Service: {{ $labels.service }} 服務負載均衡器入口狀態DOWN!"
          k8scluster: "{{ $labels.k8scluster}}"
          namespace: "{{ $labels.namespace }}"
          persistentvolumeclaim: "{{ $labels.service }}"

更新sidecar/rules-cm-kube-mon-sidecar.yaml配置文件,以下:ide

# 稍等待一分鐘左右,prometheus已定義熱更新,無需apply
kubectl apply -f rules-cm-kube-mon-sidecar.yaml

對接AlertManager

部署微信告警

# 獲取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git && cd thanos/AlertManager
# 部署AlertManager
## 更改成本身wechat信息
kubectl apply -f cm-kube-mon-alertmanager.yaml
kubectl apply -f wechat-template-kube-mon.yaml
kubectl apply -f deploy-kube-mon-alertmanager.yaml
kubectl apply -f svc-kube-mon-alertmanager.yaml

查看報警狀態

使用kube-state-metrics監控kubernetes並微信告警

微信報警和恢復信息

報警信息

使用kube-state-metrics監控kubernetes並微信告警

恢復信息

使用kube-state-metrics監控kubernetes並微信告警

相關文章
相關標籤/搜索