監控指標 | 具體實現 | 舉例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU,內存利用率 |
Node性能 | node-exporter | 節點CPU,內存利用率 |
K8S資源對象 | kube-state-metrics | Pod/Deployment/Service |
咱們這裏使用kube-state-metrics
對k8s
資源數據進行收集。node
指標類別包括:python
以pod爲例:git
默認會在kube-system命名空間下建立對應的資源,最好不要更換yaml
文件中的命名空間。web
# 獲取yml文件 git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git # 部署kube-state-metrics kubectl apply -f kube-state-metrics-configs # 查看pod狀態 kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE kube-state-metrics-c698dc7b5-zstz9 1/1 Running 0 32m
獲取kube-state-metrics-c698dc7b5-zstz9
容器日誌,如圖:api
當prometheus
準備好了以後,添加對應的採集job
便可:微信
# 添加在sidecar/cm-kube-mon-sidecar.yaml最後 - job_name: 'kube-state-metrics' static_configs: - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']
# 從新apply configmap文件,無需重啓prometheus(配置的有熱加載) kubectl apply -f sidecar/cm-kube-mon-sidecar.yaml
打開promethues的web看一下target裏的配置是否生效,以下圖:架構
當Grafana
準備好以後,咱們在Grafana中導入選定/自定義的dashboard
,添加Prometheus
數據源,便可:app
修改sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,添加以下報警指標。負載均衡
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-mon data: alert-rules.yaml: |- groups: - name: White box monitoring rules: - alert: Pod-重啓 expr: changes(kube_pod_container_status_restarts_total{pod !~ "analyzer.*"}[10m]) > 0 for: 1m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Pod: {{ $labels.pod }} Restart" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" pod: "{{ $labels.pod }}" container: "{{ $labels.container }}" - alert: Pod-未知錯誤/失敗 expr: kube_pod_status_phase{phase="Unknown"} == 1 or kube_pod_status_phase{phase="Failed"} == 1 for: 1m labels: severity: 緊急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Pod: {{ $labels.pod }} 未知錯誤/失敗" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" pod: "{{ $labels.pod }}" container: "{{ $labels.container }}" - alert: Daemonset Unavailable expr: kube_daemonset_status_number_unavailable > 0 for: 5m labels: severity: 緊急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Daemonset: {{ $labels.daemonset }} 守護進程不可用" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" daemonset: "{{ $labels.daemonset }}" - alert: Job-失敗 expr: kube_job_status_failed == 1 for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Job: {{ $labels.job_name }} Failed" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" job: "{{ $labels.job_name }}" - alert: Pod NotReady expr: sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id)(kube_pod_status_phase{job=~".*kubernetes-service-endpoints",phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id)group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod,owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0 for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "pod: {{ $labels.pod }} 處於 NotReady 狀態超過15分鐘" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" - alert: Deployment副本數 expr: (kube_deployment_spec_replicas{job=~".*kubernetes-service-endpoints"} !=kube_deployment_status_replicas_available{job=~".*kubernetes-service-endpoints"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0) for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Deployment: {{ $labels.deployment }} 實際副本數和設置副本數不一致" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" deployment: "{{ $labels.deployment }}" - alert: Statefulset副本數 expr: (kube_statefulset_status_replicas_ready{job=~".*kubernetes-service-endpoints"} !=kube_statefulset_status_replicas{job=~".*kubernetes-service-endpoints"}) and (changes(kube_statefulset_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0) for: 5m labels: severity: 警告 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Statefulset: {{ $labels.statefulset }} 實際副本數和設置副本數不一致" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" statefulset: "{{ $labels.statefulset }}" - alert: 存儲卷PV expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kubernetes-service-endpoints"} > 0 for: 5m labels: severity: 緊急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "存儲卷PV: {{ $labels.persistentvolume }} 處於Failed或Pending狀態" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolume: "{{ $labels.persistentvolume }}" - alert: 存儲卷PVC expr: kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost",job=~".*kubernetes-service-endpoints"} > 0 for: 5m labels: severity: 緊急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "存儲卷PVC: {{ $labels.persistentvolumeclaim }} Failed或Pending狀態" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolumeclaim: "{{ $labels.persistentvolumeclaim }}" - alert: k8s service expr: kube_service_status_load_balancer_ingress != 1 for: 5m labels: severity: 緊急 service: prometheus_bot receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}" annotations: summary: "Service: {{ $labels.service }} 服務負載均衡器入口狀態DOWN!" k8scluster: "{{ $labels.k8scluster}}" namespace: "{{ $labels.namespace }}" persistentvolumeclaim: "{{ $labels.service }}"
更新sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,以下:ide
# 稍等待一分鐘左右,prometheus已定義熱更新,無需apply kubectl apply -f rules-cm-kube-mon-sidecar.yaml
# 獲取yml文件 git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git && cd thanos/AlertManager # 部署AlertManager ## 更改成本身wechat信息 kubectl apply -f cm-kube-mon-alertmanager.yaml kubectl apply -f wechat-template-kube-mon.yaml kubectl apply -f deploy-kube-mon-alertmanager.yaml kubectl apply -f svc-kube-mon-alertmanager.yaml