以前部署web網站的時候,架構圖中有一環節是監控部分,而且搭建一套有效的監控平臺對於運維來講很是之重要,只有這樣才能更有效率的保證咱們的服務器和服務的穩定運行,常見的開源監控軟件有好幾種,如zabbix、Nagios、open-flcon還有prometheus,每一種有着各自的優劣勢,感謝的童鞋能夠自行百度,可是與k8s集羣監控,相對於而已更加友好的是Prometheus,今天咱們就看看如何部署一套Prometheus全方位監控K8Snode
1.Prometheus架構linux
2.K8S監控指標及實現思路ios
3.在K8S平臺部署Prometheusgit
4.基於K8S服務發現的配置解析github
5.在K8S平臺部署Grafanaweb
6.監控K8S集羣中Pod、Node、資源對象docker
7.使用Grafana可視化展現Prometheus監控數據數據庫
8.告警規則與告警通知vim
Prometheus(普羅米修斯)是一個最初在SoundCloud上構建的監控系統。自2012年成爲社區開源項目,擁有很是活躍的開發人員和用戶社區。爲強調開源及獨立維護,Prometheus於2016年加入雲原生雲計算基金會(CNCF),成爲繼Kubernetes以後的第二個託管項目。 官網地址: https://prometheus.io https://github.com/prometheusapi
Prometheus將全部數據存儲爲時間序列;具備相同度量名稱以及標籤屬於同一個指標。 每一個時間序列都由度量標準名稱和一組鍵值對(也成爲標籤)惟一標識。 時間序列格式:
<metric name>{<label name>=<label value>, ...} 示例:api_http_requests_total{method="POST", handler="/messages"}
實例:能夠抓取的目標稱爲實例(Instances) 做業:具備相同目標的實例集合稱爲做業(Job) scrape_configs: -job_name: 'prometheus' static_configs: -targets: ['localhost:9090'] -job_name: 'node' static_configs: -targets: ['192.168.1.10:9090']
Kubernetes自己監控
Pod監控
監控指標 | 具體實現 | 舉例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU |
Node性能 | node-exporter 節點CPU | 內存利用率 |
K8S資源對象 | kube-state-metrics | Pod/Deployment/Service |
服務發現: | ||
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config |
ip地址 | 角色 | 備註 |
---|---|---|
192.168.73.136 | nfs | |
192.168.73.138 | k8s-master | |
192.168.73.139 | k8s-node01 | |
192.168.73.140 | k8s-node02 | |
192.168.73.135 | k8s-node03 |
[root@k8s-master src]# git clone https://github.com/zhangdongdong7/k8s-prometheus.git Cloning into 'k8s-prometheus'... remote: Enumerating objects: 3, done. remote: Counting objects: 100% (3/3), done. remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (3/3), done. [root@k8s-master src]# cd k8s-prometheus/ [root@k8s-master k8s-prometheus]# ls alertmanager-configmap.yaml kube-state-metrics-rbac.yaml prometheus-rbac.yaml alertmanager-deployment.yaml kube-state-metrics-service.yaml prometheus-rules.yaml alertmanager-pvc.yaml node_exporter-0.17.0.linux-amd64.tar.gz prometheus-service.yaml alertmanager-service.yaml node_exporter.sh prometheus-statefulset-static-pv.yaml grafana.yaml OWNERS prometheus-statefulset.yaml kube-state-metrics-deployment.yaml prometheus-configmap.yaml README.md
RBAC(Role-Based Access Control,基於角色的訪問控制):負責完成受權(Authorization)工做。 編寫受權yaml
[root@k8s-master prometheus-k8s]# vim prometheus-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile "prometheus-rbac.yaml" 55L, 1080C 1,1 Top apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile 1,1 Top apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: - "" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch - apiGroups: - "" resources: - configmaps verbs: - get - nonResourceURLs: - "/metrics" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rbac.yaml
使用Configmap保存不須要加密配置信息 其中須要把nodes中ip地址根據本身的地址進行修改
[root@k8s-master prometheus-k8s]# vim prometheus-configmap.yaml # Prometheus configuration format https://prometheus.io/docs/prometheus/latest/configuration/configuration/ apiVersion: v1 kind: ConfigMap # metadata: name: prometheus-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: prometheus.yml: | rule_files: - /etc/config/rules/*.rules scrape_configs: - job_name: prometheus static_configs: - targets: - localhost:9090 - job_name: kubernetes-nodes scrape_interval: 30s static_configs: - targets: - 192.168.73.135:9100 - 192.168.73.138:9100 - 192.168.73.139:9100 - 192.168.73.140:9100 - job_name: kubernetes-apiservers kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: default;kubernetes;https source_labels: - __meta_kubernetes_namespace - __meta_kubernetes_service_name - __meta_kubernetes_endpoint_port_name scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-kubelet kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-nodes-cadvisor kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __metrics_path__ replacement: /metrics/cadvisor scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt insecure_skip_verify: true bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token - job_name: kubernetes-service-endpoints kubernetes_sd_configs: - role: endpoints relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scrape - action: replace regex: (https?) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_scheme target_label: __scheme__ - action: replace regex: (.+) source_labels: - __meta_kubernetes_service_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_service_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_service_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-services kubernetes_sd_configs: - role: service metrics_path: /probe params: module: - http_2xx relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_service_annotation_prometheus_io_probe - source_labels: - __address__ target_label: __param_target - replacement: blackbox target_label: __address__ - source_labels: - __param_target target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - source_labels: - __meta_kubernetes_service_name target_label: kubernetes_name - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - action: keep regex: true source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_scrape - action: replace regex: (.+) source_labels: - __meta_kubernetes_pod_annotation_prometheus_io_path target_label: __metrics_path__ - action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 source_labels: - __address__ - __meta_kubernetes_pod_annotation_prometheus_io_port target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - action: replace source_labels: - __meta_kubernetes_namespace target_label: kubernetes_namespace - action: replace source_labels: - __meta_kubernetes_pod_name target_label: kubernetes_pod_name alerting: alertmanagers: - static_configs: - targets: ["alertmanager:80"]
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-configmap.yaml
這裏使用storageclass進行動態供給,給prometheus的數據進行持久化,具體實現辦法,能夠查看以前的文章《k8s中的NFS動態存儲供給》,除此以外可使用靜態供給的prometheus-statefulset-static-pv.yaml進行持久化
[root@k8s-master prometheus-k8s]# vim prometheus-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: prometheus namespace: kube-system labels: k8s-app: prometheus kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v2.2.1 spec: serviceName: "prometheus" replicas: 1 podManagementPolicy: "Parallel" updateStrategy: type: "RollingUpdate" selector: matchLabels: k8s-app: prometheus template: metadata: labels: k8s-app: prometheus annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: prometheus initContainers: - name: "init-chown-data" image: "busybox:latest" imagePullPolicy: "IfNotPresent" command: ["chown", "-R", "65534:65534", "/data"] volumeMounts: - name: prometheus-data mountPath: /data subPath: "" containers: - name: prometheus-server-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9090/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi - name: prometheus-server image: "prom/prometheus:v2.2.1" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/prometheus.yml - --storage.tsdb.path=/data - --web.console.libraries=/etc/prometheus/console_libraries - --web.console.templates=/etc/prometheus/consoles - --web.enable-lifecycle ports: - containerPort: 9090 readinessProbe: httpGet: path: /-/ready port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 livenessProbe: httpGet: path: /-/healthy port: 9090 initialDelaySeconds: 30 timeoutSeconds: 30 # based on 10 running nodes with 30 pods each resources: limits: cpu: 200m memory: 1000Mi requests: cpu: 200m memory: 1000Mi volumeMounts: - name: config-volume mountPath: /etc/config - name: prometheus-data mountPath: /data subPath: "" - name: prometheus-rules mountPath: /etc/config/rules terminationGracePeriodSeconds: 300 volumes: - name: config-volume configMap: name: prometheus-config - name: prometheus-rules configMap: name: prometheus-rules volumeClaimTemplates: - metadata: name: prometheus-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "16Gi"
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-statefulset.yaml
檢查狀態
[root@k8s-master prometheus-k8s]# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE alertmanager-5d75d5688f-fmlq6 2/2 Running 0 8d coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8d grafana-0 1/1 Running 2 14d kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 13d kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14d prometheus-0 2/2 Running 6 14d
能夠看到一個prometheus-0的pod,這就剛纔使用statefulset控制器進行的有狀態部署,狀態爲Runing則是正常,若是不爲Runing可使用kubectl describe pod prometheus-0 -n kube-system查看報錯詳情
此處使用nodePort固定一個訪問端口,便於記憶
[root@k8s-master prometheus-k8s]# vim prometheus-service.yaml kind: Service apiVersion: v1 metadata: name: prometheus namespace: kube-system labels: kubernetes.io/name: "Prometheus" kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile spec: type: NodePort ports: - name: http port: 9090 protocol: TCP targetPort: 9090 nodePort: 30090 selector: k8s-app: prometheus
建立
[root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-service.yaml
檢查
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 8d pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 14d pod/prometheus-0 2/2 Running 6 14d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 13d service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 16d service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 13d
使用任意一個NodeIP加端口進行訪問,訪問地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30090 訪問成功的界面如圖所示:
經過上面的web訪問,能夠看出prometheus自帶的UI界面是沒有多少功能的,可視化展現的功能不完善,不能知足平常的監控所需,所以經常咱們須要再結合Prometheus+Grafana的方式來進行可視化的數據展現 官網地址: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus https://grafana.com/grafana/download 剛纔下載的項目中已經寫好了Grafana的yaml,根據本身的環境進行修改
[root@k8s-master prometheus-k8s]# vim grafana.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: grafana namespace: kube-system spec: serviceName: "grafana" replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana ports: - containerPort: 3000 protocol: TCP resources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi volumeMounts: - name: grafana-data mountPath: /var/lib/grafana subPath: grafana securityContext: fsGroup: 472 runAsUser: 472 volumeClaimTemplates: - metadata: name: grafana-data spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "1Gi" --- apiVersion: v1 kind: Service metadata: name: grafana namespace: kube-system spec: type: NodePort ports: - port : 80 targetPort: 3000 nodePort: 30091 selector: app: grafana
使用任意一個NodeIP加端口進行訪問,訪問地址:http://NodeIP:Port ,此例就是:http://192.168.73.139:30091 成功訪問界面以下,會須要進行帳號密碼登錄,默認帳號密碼都爲admin,登錄以後會讓修改密碼
登錄以後的界面以下
第一步須要進行數據源添加,點擊create your first data source數據庫圖標,根據下圖所示進行添加便可
第二步,添加完了以後點擊底部的綠色的Save&Test,會成功提示Data sourse is working,則表示數據源添加成功
Pod kubelet的節點使用cAdvisor提供的metrics接口獲取該節點全部Pod和容器相關的性能指標數據。 暴露接口地址: https://NodeIP:10255/metrics/cadvisor https://NodeIP:10250/metrics/cadvisor
Node 使用node_exporter收集器採集節點資源利用率。 https://github.com/prometheus/node_exporter 使用文檔:https://prometheus.io/docs/guides/node-exporter/ 使用node_exporter.sh腳本分別在三臺服務器上部署node_exporter收集器,不須要修改可直接運行腳本
[root@k8s-master prometheus-k8s]# cat node_exporter.sh #!/bin/bash wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz tar zxf node_exporter-0.17.0.linux-amd64.tar.gz mv node_exporter-0.17.0.linux-amd64 /usr/local/node_exporter cat <<EOF >/usr/lib/systemd/system/node_exporter.service [Unit] Description=https://prometheus.io [Service] Restart=on-failure ExecStart=/usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl restart node_exporter [root@k8s-master prometheus-k8s]# ./node_exporter.sh
[root@k8s-master prometheus-k8s]# ps -ef|grep node_exporter root 6227 1 0 Oct08 ? 00:06:43 /usr/local/node_exporter/node_exporter --collector.systemd --collector.systemd.unit-whitelist=(docker|kubelet|kube-proxy|flanneld).service root 118269 117584 0 23:27 pts/0 00:00:00 grep --color=auto node_exporter
資源對象 kube-state-metrics採集了k8s中各類資源對象的狀態信息,只須要在master節點部署就行 https://github.com/kubernetes/kube-state-metrics
[root@k8s-master prometheus-k8s]# vim kube-state-metrics-rbac.yaml apiVersion: v1 kind: ServiceAccount metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: ["list", "watch"] - apiGroups: ["extensions"] resources: - daemonsets - deployments - replicasets verbs: ["list", "watch"] - apiGroups: ["apps"] resources: - statefulsets verbs: ["list", "watch"] - apiGroups: ["batch"] resources: - cronjobs - jobs verbs: ["list", "watch"] - apiGroups: ["autoscaling"] resources: - horizontalpodautoscalers verbs: ["list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: kube-state-metrics-resizer namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile rules: - apiGroups: [""] resources: - pods verbs: ["get"] - apiGroups: ["extensions"] resources: - deployments resourceNames: ["kube-state-metrics"] verbs: ["get", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kube-state-metrics labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: kube-state-metrics-resizer subjects: - kind: ServiceAccount name: kube-state-metrics namespace: kube-system [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-rbac.yaml
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: kube-state-metrics namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v1.3.0 spec: selector: matchLabels: k8s-app: kube-state-metrics version: v1.3.0 replicas: 1 template: metadata: labels: k8s-app: kube-state-metrics version: v1.3.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: lizhenliang/kube-state-metrics:v1.3.0 ports: - name: http-metrics containerPort: 8080 - name: telemetry containerPort: 8081 readinessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 - name: addon-resizer image: lizhenliang/addon-resizer:1.8.3 resources: limits: cpu: 100m memory: 30Mi requests: cpu: 100m memory: 30Mi env: - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace volumeMounts: - name: config-volume mountPath: /etc/config command: - /pod_nanny - --config-dir=/etc/config - --container=kube-state-metrics - --cpu=100m - --extra-cpu=1m - --memory=100Mi - --extra-memory=2Mi - --threshold=5 - --deployment=kube-state-metrics volumes: - name: config-volume configMap: name: kube-state-metrics-config --- # Config map for resource configuration. apiVersion: v1 kind: ConfigMap metadata: name: kube-state-metrics-config namespace: kube-system labels: k8s-app: kube-state-metrics kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile data: NannyConfiguration: |- apiVersion: nannyconfig/v1alpha1 kind: NannyConfiguration [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-deployment.yaml
[root@k8s-master prometheus-k8s]# cat kube-state-metrics-service.yaml apiVersion: v1 kind: Service metadata: name: kube-state-metrics namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "kube-state-metrics" annotations: prometheus.io/scrape: 'true' spec: ports: - name: http-metrics port: 8080 targetPort: http-metrics protocol: TCP - name: telemetry port: 8081 targetPort: telemetry protocol: TCP selector: k8s-app: kube-state-metrics [root@k8s-master prometheus-k8s]# kubectl apply -f kube-state-metrics-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system NAME READY STATUS RESTARTS AGE pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 0 9d pod/coredns-5bd5f9dbd9-wv45t 1/1 Running 1 9d pod/grafana-0 1/1 Running 2 15d pod/kube-state-metrics-7c76bdbf68-kqqgd 2/2 Running 6 14d pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 5 16d pod/prometheus-0 2/2 Running 6 15d NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 13d service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 15d service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 14d service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 14d service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 17d service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 14d [root@k8s-master prometheus-k8s]#
一般在使用Prometheus採集數據的時候咱們須要監控K8S集羣中Pod、Node、資源對象,所以咱們須要安裝對應的插件和資源採集器來提供api進行數據獲取,在4.3中咱們已經配置好,咱們也可使用Prometheus的UI界面中的Staus菜單下的Target中的各個採集器的狀態狀況,如圖所示:
只有當咱們各個Target的狀態都是UP狀態時,咱們可使用自帶的的界面去獲取到某一監控項的相關的數據,如圖所示:
從上面的圖中能夠看出Prometheus的界面可視化展現的功能較單一,不能知足需求,所以咱們須要結合Grafana來進行可視化展現Prometheus監控數據,在上一章節,已經成功部署了Granfana,所以須要在使用的時候添加dashboard和Panel來設計展現相關的監控項,可是實際上在Granfana社區裏面有不少成熟的模板,咱們能夠直接使用,而後根據本身的環境修改Panel中的查詢語句來獲取數據 https://grafana.com/grafana/dashboards
推薦模板:
當模板添加以後若是某一個Panel不顯示數據,能夠點擊Panel上的編輯,查詢PromQL語句,而後去Prometheus本身的界面上進行調試PromQL語句是否能夠獲取到值,最後調整以後的監控界面如圖所示
資源狀態監控:6417 同理,添加資源狀態的監控模板,而後通過調整以後的監控界面如圖所示,能夠獲取到k8s中各類資源狀態的監控展現
Node監控:9276 同理,添加資源狀態的監控模板,而後通過調整以後的監控界面如圖所示,能夠獲取到各個node上的基本狀況
咱們以Email來進行實現告警信息的發送
[root@k8s-master prometheus-k8s]# vim prometheus-rules.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: kube-system data: general.rules: | groups: - name: general.rules rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: error annotations: summary: "Instance {{ $labels.instance }} 中止工做" description: "{{ $labels.instance }} job {{ $labels.job }} 已經中止5分鐘以上." node.rules: | groups: - name: node.rules rules: - alert: NodeFilesystemUsage expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分區使用率太高" description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分區使用大於80% (當前值: {{ $value }})" - alert: NodeMemoryUsage expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 10 0 > 80 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} 內存使用率太高" description: "{{ $labels.instance }}內存使用大於80% (當前值: {{ $value }})" - alert: NodeCPUUsage expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 60 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU使用率太高" description: "{{ $labels.instance }}CPU使用大於60% (當前值: {{ $value }})" [root@k8s-master prometheus-k8s]# kubectl apply -f prometheus-rules.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: alertmanager-config namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists data: alertmanager.yml: | global: resolve_timeout: 5m smtp_smarthost: 'mail.goldwind.com.cn:587' #登錄郵件進行查看 smtp_from: 'goldwindscada@goldwind.com.cn' #根據本身申請的發件郵箱進行配置 smtp_auth_username: 'goldwindscada@goldwind.com.cn' smtp_auth_password: 'Dbadmin@123' receivers: - name: default-receiver email_configs: - to: "zhangdongdong27459@goldwind.com.cn" route: group_interval: 1m group_wait: 10s receiver: default-receiver repeat_interval: 1m [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-configmap.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-pvc.yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: EnsureExists spec: storageClassName: managed-nfs-storage accessModes: - ReadWriteOnce resources: requests: storage: "2Gi" [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-pvc.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: alertmanager namespace: kube-system labels: k8s-app: alertmanager kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile version: v0.14.0 spec: replicas: 1 selector: matchLabels: k8s-app: alertmanager version: v0.14.0 template: metadata: labels: k8s-app: alertmanager version: v0.14.0 annotations: scheduler.alpha.kubernetes.io/critical-pod: '' spec: priorityClassName: system-cluster-critical containers: - name: prometheus-alertmanager image: "prom/alertmanager:v0.14.0" imagePullPolicy: "IfNotPresent" args: - --config.file=/etc/config/alertmanager.yml - --storage.path=/data - --web.external-url=/ ports: - containerPort: 9093 readinessProbe: httpGet: path: /#/status port: 9093 initialDelaySeconds: 30 timeoutSeconds: 30 volumeMounts: - name: config-volume mountPath: /etc/config - name: storage-volume mountPath: "/data" subPath: "" resources: limits: cpu: 10m memory: 50Mi requests: cpu: 10m memory: 50Mi - name: prometheus-alertmanager-configmap-reload image: "jimmidyson/configmap-reload:v0.1" imagePullPolicy: "IfNotPresent" args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9093/-/reload volumeMounts: - name: config-volume mountPath: /etc/config readOnly: true resources: limits: cpu: 10m memory: 10Mi requests: cpu: 10m memory: 10Mi volumes: - name: config-volume configMap: name: alertmanager-config - name: storage-volume persistentVolumeClaim: claimName: alertmanager [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-deployment.yaml
[root@k8s-master prometheus-k8s]# vim alertmanager-service.yaml apiVersion: v1 kind: Service metadata: name: alertmanager namespace: kube-system labels: kubernetes.io/cluster-service: "true" addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/name: "Alertmanager" spec: ports: - name: http port: 80 protocol: TCP targetPort: 9093 selector: k8s-app: alertmanager type: "ClusterIP" [root@k8s-master prometheus-k8s]# kubectl apply -f alertmanager-service.yaml
[root@k8s-master prometheus-k8s]# kubectl get pod,svc -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/alertmanager-5d75d5688f-fmlq6 2/2 Running 4 10d 172.17.15.2 192.168.73.140 <none> <none> pod/coredns-5bd5f9dbd9-qxvmz 1/1 Running 0 42m 172.17.33.2 192.168.73.138 <none> <none> pod/grafana-0 1/1 Running 3 16d 172.17.31.2 192.168.73.139 <none> <none> pod/kube-state-metrics-7c76bdbf68-hv56m 2/2 Running 0 23h 172.17.15.3 192.168.73.140 <none> <none> pod/kubernetes-dashboard-7d77666777-d5ng4 1/1 Running 6 17d 172.17.31.4 192.168.73.139 <none> <none> pod/prometheus-0 2/2 Running 8 16d 172.17.83.2 192.168.73.135 <none> <none> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/alertmanager ClusterIP 10.0.0.207 <none> 80/TCP 14d k8s-app=alertmanager service/grafana NodePort 10.0.0.74 <none> 80:30091/TCP 16d app=grafana service/kube-dns ClusterIP 10.0.0.2 <none> 53/UDP,53/TCP 42m k8s-app=kube-dns service/kube-state-metrics ClusterIP 10.0.0.194 <none> 8080/TCP,8081/TCP 15d k8s-app=kube-state-metrics service/kubernetes-dashboard NodePort 10.0.0.127 <none> 443:30001/TCP 18d k8s-app=kubernetes-dashboard service/prometheus NodePort 10.0.0.33 <none> 9090:30090/TCP 15d k8s-app=prometheus