第一章和第二章中咱們配置Prometheus的成本很是高,並且也很是麻煩。可是咱們要考慮Prometheus、AlertManager 這些組件服務自己的高可用的話,成本就更高了,固然咱們也徹底能夠用自定義的方式來實現這些需求,咱們也知道 Promethues 在代碼上就已經對 Kubernetes 有了原生的支持,能夠經過服務發現的形式來自動監控集羣,所以咱們可使用另一種更加高級的方式來部署 Prometheus:Operator
框架。node
Operator
Operator是由CoreOS開發的,用來擴展Kubernetes API,特定的應用程序控制器,它用來建立、配置和管理複雜的有狀態應用,如數據庫、緩存和監控系統。Operator基於Kubernetes的資源和控制器概念之上構建,但同時又包含了應用程序特定的領域知識。建立Operator的關鍵是CRD(自定義資源)的設計。linux
Operator
是將運維人員對軟件操做的知識給代碼化,同時利用 Kubernetes 強大的抽象來管理大規模的軟件應用。目前CoreOS
官方提供了幾種Operator
的實現,其中就包括咱們今天的主角:Prometheus Operator
,Operator
的核心實現就是基於 Kubernetes 的如下兩個概念:git
當前CoreOS提供的如下四種Operator:github
接下來咱們將使用Operator建立Prometheus。
web
咱們這裏直接經過 Prometheus-Operator 的源碼來進行安裝,固然也能夠用 Helm 來進行一鍵安裝,咱們採用源碼安裝能夠去了解更多的實現細節。首頁將源碼 Clone 下來:shell
git clone https://github.com/coreos/prometheus-operator cd prometheus-operator/contrib/kube-prometheus/manifests
進入到 manifests 目錄下面,這個目錄下面包含咱們全部的資源清單文件,直接在該文件夾下面執行建立資源命令便可:數據庫
kubectl apply -f .
部署完成後,會建立一個名爲monitoring
的 namespace,因此資源對象對將部署在改命名空間下面,此外 Operator 會自動建立4個 CRD 資源對象:vim
kubectl get crd |grep coreos alertmanagers.monitoring.coreos.com 2019-03-18T02:43:57Z prometheuses.monitoring.coreos.com 2019-03-18T02:43:58Z prometheusrules.monitoring.coreos.com 2019-03-18T02:43:58Z servicemonitors.monitoring.coreos.com 2019-03-18T02:43:58Z
能夠在 monitoring 命名空間下面查看全部的 Pod,其中 alertmanager 和 prometheus 是用 StatefulSet 控制器管理的,其中還有一個比較核心的 prometheus-operator 的 Pod,用來控制其餘資源對象和監聽對象變化的:後端
kubectl get pods -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 37m alertmanager-main-1 2/2 Running 0 34m alertmanager-main-2 2/2 Running 0 33m grafana-7489c49998-pkl8w 1/1 Running 0 40m kube-state-metrics-d6cf6c7b5-7dwpg 4/4 Running 0 27m node-exporter-dlp25 2/2 Running 0 40m node-exporter-fghlp 2/2 Running 0 40m node-exporter-mxwdm 2/2 Running 0 40m node-exporter-r9v92 2/2 Running 0 40m prometheus-adapter-84cd9c96c9-n92n4 1/1 Running 0 40m prometheus-k8s-0 3/3 Running 1 37m prometheus-k8s-1 3/3 Running 1 37m prometheus-operator-7b74946bd6-vmbcj 1/1 Running 0 40m
查看建立的 Service:api
kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE alertmanager-main ClusterIP 10.110.43.207 <none> 9093/TCP 40m alertmanager-operated ClusterIP None <none> 9093/TCP,6783/TCP 38m grafana ClusterIP 10.109.160.0 <none> 3000/TCP 40m kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 40m node-exporter ClusterIP None <none> 9100/TCP 40m prometheus-adapter ClusterIP 10.105.174.21 <none> 443/TCP 40m prometheus-k8s ClusterIP 10.97.195.143 <none> 9090/TCP 40m prometheus-operated ClusterIP None <none> 9090/TCP 38m prometheus-operator ClusterIP None <none> 8080/TCP 40m
能夠看到上面針對 grafana 和 prometheus 都建立了一個類型爲 ClusterIP 的 Service,固然若是咱們想要在外網訪問這兩個服務的話能夠經過建立對應的 Ingress 對象或者使用 NodePort 類型的 Service,咱們這裏爲了簡單,直接使用 NodePort 類型的服務便可,編輯 grafana 和 prometheus-k8s 這兩個 Service,將服務類型更改成 NodePort:
kubectl edit svc grafana -n monitoring kubectl edit svc prometheus-k8s -n monitoring kubectl get svc -n monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ..... grafana NodePort 10.109.160.0 <none> 3000:31740/TCP 42m prometheus-k8s NodePort 10.97.195.143 <none> 9090:31310/TCP 42m
更改完成後,咱們就能夠經過去訪問上面的兩個服務了,好比查看 prometheus 的 targets 頁面:
咱們能夠看到大部分的配置都是正常的,只有兩三個沒有管理到對應的監控目標,好比 kube-controller-manager 和 kube-scheduler 這兩個系統組件,這就和 ServiceMonitor 的定義有關係了,咱們先來查看下 kube-scheduler 組件對應的 ServiceMonitor 資源的定義:(prometheus-serviceMonitorKubeScheduler.yaml)
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-scheduler name: kube-scheduler namespace: monitoring spec: endpoints: - interval: 30s #30s獲取一次信息 port: http-metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: kube-scheduler# 對應service的端口名# 表示去匹配某一命名空間中的service,若是想從全部的namespace中匹配用any: true# 匹配的 Service 的labels,若是使用mathLabels,則下面的全部標籤都匹配時纔會匹配該service,若是使用matchExpressions,則至少匹配一個標籤的service都會被選擇
上面是一個典型的 ServiceMonitor 資源文件的聲明方式,上面咱們經過selector.matchLabels
在 kube-system 這個命名空間下面匹配具備k8s-app=kube-scheduler
這樣的 Service,可是咱們系統中根本就沒有對應的 Service,因此咱們須要手動建立一個 Service:(prometheus-kubeSchedulerService.yaml)
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-scheduler labels: k8s-app: kube-scheduler spec: selector: component: kube-scheduler ports: - name: http-metrics port: 10251 targetPort: 10251 protocol: TCP
其中最重要的是上面 labels 和 selector 部分,labels 區域的配置必須和咱們上面的 ServiceMonitor 對象中的 selector 保持一致,selector
下面配置的是component=kube-scheduler
,爲何會是這個 label 標籤呢,咱們能夠去 describe 下 kube-scheduelr 這個 Pod:
$ kubectl describe pod kube-scheduler-k8s-master -n kube-system Name: kube-scheduler-k8s-master Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master/172.16.138.40 Start Time: Tue, 19 Feb 2019 21:15:05 -0500 Labels: component=kube-scheduler tier=control-plane ......
咱們能夠看到這個 Pod 具備component=kube-scheduler
和tier=control-plane
這兩個標籤,而前面這個標籤具備更惟一的特性,因此使用前面這個標籤較好,這樣上面建立的 Service 就能夠和咱們的 Pod 進行關聯了,直接建立便可:
$ kubectl create -f prometheus-kubeSchedulerService.yaml $ kubectl get svc -n kube-system -l k8s-app=kube-scheduler NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-scheduler ClusterIP 10.103.165.58 <none> 10251/TCP 4m
建立完成後,隔一小會兒後去 prometheus 查看 targets 下面 kube-scheduler 的狀態:
咱們能夠看到如今已經發現了 target,可是抓取數據結果出錯了,這個錯誤是由於咱們集羣是使用 kubeadm 搭建的,其中 kube-scheduler 默認是綁定在127.0.0.1
上面的,而上面咱們這個地方是想經過節點的 IP 去訪問,因此訪問被拒絕了,咱們只要把 kube-scheduler 綁定的地址更改爲0.0.0.0
便可知足要求,因爲 kube-scheduler 是以靜態 Pod 的形式運行在集羣中的,因此咱們只須要更改靜態 Pod 目錄下面對應的 YAML (kube-scheduler.yaml
)文件便可:
$ cd /etc/kubernetes/manifests 將 kube-scheduler.yaml 文件中-command的--address地址更改爲0.0.0.0 $ vim kube-scheduler.yaml apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --address=0.0.0.0 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true ....
修改完成後咱們將該文件從當前文件夾中移除,隔一下子再移回該目錄,就能夠自動更新了,而後再去看 prometheus 中 kube-scheduler 這個 target 是否已經正常了:
咱們來查看一下kube-controller-manager的ServiceMonitor資源的定義:
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: kube-controller-manager name: kube-controller-manager namespace: monitoring spec: endpoints: - interval: 30s metricRelabelings: - action: drop regex: etcd_(debugging|disk|request|server).* sourceLabels: - __name__ port: http-metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: kube-controller-manager
上面咱們能夠看到是經過k8s-app: kube-controller-manager這個標籤選擇的service,但系統中沒有這個service。這裏咱們手動建立一個:
建立前咱們須要看肯定pod的標籤:
$ kubectl describe pod kube-controller-manager-k8s-master -n kube-system Name: kube-controller-manager-k8s-master Namespace: kube-system Priority: 2000000000 PriorityClassName: system-cluster-critical Node: k8s-master/172.16.138.40 Start Time: Tue, 19 Feb 2019 21:15:16 -0500 Labels: component=kube-controller-manager tier=control-plane ....
建立svc
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-controller-manager labels: k8s-app: kube-controller-manager spec: selector: component: kube-controller-manager ports: - name: http-metrics port: 10252 targetPort: 10252 protocol: TCP
建立完後,咱們查看targer
這裏和上面是同一個問題。讓咱們使用上面的方法修改。讓咱們修改kube-controller-manager.yaml:
apiVersion: v1 kind: Pod metadata: annotations: scheduler.alpha.kubernetes.io/critical-pod: "" creationTimestamp: null labels: component: kube-controller-manager tier: control-plane name: kube-controller-manager namespace: kube-system spec: containers: - command: - kube-controller-manager - --node-monitor-grace-period=10s - --pod-eviction-timeout=10s - --address=0.0.0.0 #修改 ......
修改完成後咱們將該文件從當前文件夾中移除,隔一下子再移回該目錄,就能夠自動更新了,而後再去看 prometheus 中 kube-controller-manager 這個 target 是否已經正常了:
coredns啓動的metrics端口是9153,咱們查看kube-system下的svc是否有這個端口:
kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE heapster ClusterIP 10.96.28.220 <none> 80/TCP 19d kube-controller-manager ClusterIP 10.99.208.51 <none> 10252/TCP 1h kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 188d kube-scheduler ClusterIP 10.103.165.58 <none> 10251/TCP 2h kubelet ClusterIP None <none> 10250/TCP 5h kubernetes-dashboard NodePort 10.103.15.27 <none> 443:30589/TCP 131d monitoring-influxdb ClusterIP 10.103.155.57 <none> 8086/TCP 19d tiller-deploy ClusterIP 10.104.114.83 <none> 44134/TCP 18d
這裏咱們看到kube-dns沒有metrics的端口,可是metrics後端是啓動,因此咱們須要把這個端口經過svc暴露出來。建立svc:
apiVersion: v1 kind: Service metadata: namespace: kube-system name: kube-prometheus-prometheus-coredns labels: k8s-app: prometheus-operator-coredns spec: selector: k8s-app: kube-dns ports: - name: metrics port: 9153 targetPort: 9153 protocol: TCP
這裏咱們啓動一個svc,labels是 k8s-app: prometheus-operator-coredns ,全部咱們須要修改DNS的serviceMonitor下的labels值。
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: k8s-app: coredns name: coredns namespace: monitoring spec: endpoints: - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token interval: 15s port: metrics jobLabel: k8s-app namespaceSelector: matchNames: - kube-system selector: matchLabels: k8s-app: prometheus-operator-coredns
建立查看這兩個資源:
$ kubectl apply -f prometheus-serviceMonitorCoreDNS.yaml $ kubectl create -f prometheus-KubeDnsSvc.yaml $ kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-prometheus-prometheus-coredns ClusterIP 10.100.205.135 <none> 9153/TCP 1h
讓咱們再去看 prometheus 中 coredns 這個 target 是否已經正常了:
上面的監控數據配置完成後,如今咱們能夠去查看下 grafana 下面的 dashboard,一樣使用上面的 NodePort 訪問便可,第一次登陸使用 admin:admin 登陸便可,進入首頁後,能夠發現已經和咱們的 Prometheus 數據源關聯上了,正常來講能夠看到一些監控圖表了:
除了 Kubernetes 集羣中的一些資源對象、節點以及組件須要監控,有的時候咱們可能還須要根據實際的業務需求去添加自定義的監控項,添加一個自定義監控的步驟也是很是簡單的。
接下來演示如何添加 etcd 集羣的監控。
不管是 Kubernetes 集羣外的仍是使用 Kubeadm 安裝在集羣內部的 etcd 集羣,咱們這裏都將其視做集羣外的獨立集羣,由於對於兩者的使用方法沒什麼特殊之處。
對於 etcd 集羣通常狀況下,爲了安全都會開啓 https 證書認證的方式,因此要想讓 Prometheus 訪問到 etcd 集羣的監控數據,就須要提供相應的證書校驗。
因爲咱們這裏演示環境使用的是 Kubeadm 搭建的集羣,咱們可使用 kubectl 工具去獲取 etcd 啓動的時候使用的證書路徑:
$ kubectl get pods -n kube-system | grep etcd etcd-k8s-master 1/1 Running 2773 188d etcd-k8s-node01 1/1 Running 2 104d $ kubectl get pod etcd-k8s-master -n kube-system -o yaml ..... spec: containers: - command: - etcd - --advertise-client-urls=https://172.16.138.40:2379 - --initial-advertise-peer-urls=https://172.16.138.40:2380 - --initial-cluster=k8s-master=https://172.16.138.40:2380 - --listen-client-urls=https://127.0.0.1:2379,https://172.16.138.40:2379 - --listen-peer-urls=https://172.16.138.40:2380 - --cert-file=/etc/kubernetes/pki/etcd/server.crt - --client-cert-auth=true - --data-dir=/var/lib/etcd - --key-file=/etc/kubernetes/pki/etcd/server.key - --name=k8s-master - --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt - --peer-client-cert-auth=true - --peer-key-file=/etc/kubernetes/pki/etcd/peer.key - --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt - --snapshot-count=10000 - --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd-amd64:3.2.18 imagePullPolicy: IfNotPresent livenessProbe: exec: command: - /bin/sh - -ec - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo failureThreshold: 8 initialDelaySeconds: 15 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 15 name: etcd resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/etcd name: etcd-data - mountPath: /etc/kubernetes/pki/etcd name: etcd-certs ...... tolerations: - effect: NoExecute operator: Exists volumes: - hostPath: path: /var/lib/etcd type: DirectoryOrCreate name: etcd-data - hostPath: path: /etc/kubernetes/pki/etcd type: DirectoryOrCreate name: etcd-certs .....
咱們能夠看到 etcd 使用的證書都對應在節點的 /etc/kubernetes/pki/etcd 這個路徑下面,因此首先咱們將須要使用到的證書經過 secret 對象保存到集羣中去:(在 etcd 運行的節點)
$ kubectl -n monitoring create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.crt --from-file=/etc/kubernetes/pki/etcd/healthcheck-client.key --from-file=/etc/kubernetes/pki/etcd/ca.crt secret/etcd-certs created
而後將上面建立的 etcd-certs 對象配置到 prometheus 資源對象中,直接更新 prometheus 資源對象便可:
nodeSelector: beta.kubernetes.io/os: linux replicas: 2 secrets: - etcd-certs
更新完成後,咱們就能夠在 Prometheus 的 Pod 中獲取到上面建立的 etcd 證書文件了,具體的路徑咱們能夠進入 Pod 中查看:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/ config_out/ console_libraries/ consoles/ prometheus.yml rules/ secrets/ /prometheus $ ls /etc/prometheus/secrets/etcd-certs/ ca.crt healthcheck-client.crt healthcheck-client.key /prometheus $
如今 Prometheus 訪問 etcd 集羣的證書已經準備好了,接下來建立 ServiceMonitor 對象便可(prometheus-serviceMonitorEtcd.yaml)
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: etcd-k8s namespace: monitoring labels: k8s-app: etcd-k8s spec: jobLabel: k8s-app endpoints: - port: port interval: 30s scheme: https tlsConfig: caFile: /etc/prometheus/secrets/etcd-certs/ca.crt certFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.crt keyFile: /etc/prometheus/secrets/etcd-certs/healthcheck-client.key insecureSkipVerify: true selector: matchLabels: k8s-app: etcd namespaceSelector: matchNames: - kube-system
上面咱們在 monitoring 命名空間下面建立了名爲 etcd-k8s 的 ServiceMonitor 對象,基本屬性和前面章節中的一致,匹配 kube-system 這個命名空間下面的具備 k8s-app=etcd 這個 label 標籤的 Service,jobLabel 表示用於檢索 job 任務名稱的標籤,和前面不太同樣的地方是 endpoints 屬性的寫法,配置上訪問 etcd 的相關證書,endpoints 屬性下面能夠配置不少抓取的參數,好比 relabel、proxyUrl,tlsConfig 表示用於配置抓取監控數據端點的 tls 認證,因爲證書 serverName 和 etcd 中籤發的可能不匹配,因此加上了 insecureSkipVerify=true
直接建立這個 ServiceMonitor 對象:
$ kubectl create -f prometheus-serviceMonitorEtcd.yaml
servicemonitor.monitoring.coreos.com/etcd-k8s created
ServiceMonitor 建立完成了,可是如今尚未關聯的對應的 Service 對象,因此須要咱們去手動建立一個 Service 對象(prometheus-etcdService.yaml):
apiVersion: v1 kind: Service metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd spec: type: ClusterIP clusterIP: None ports: - name: port port: 2379 protocol: TCP --- apiVersion: v1 kind: Endpoints metadata: name: etcd-k8s namespace: kube-system labels: k8s-app: etcd subsets: - addresses: - ip: 172.16.138.40 nodeName: etcd-k8s-master - ip: 172.16.138.41 nodeName: etcd-k8s-node01 ports: - name: port port: 2379 protocol: TCP
咱們這裏建立的 Service 沒有采用前面經過 label 標籤的形式去匹配 Pod 的作法,由於前面咱們說過不少時候咱們建立的 etcd 集羣是獨立於集羣以外的,這種狀況下面咱們就須要自定義一個 Endpoints,要注意 metadata 區域的內容要和 Service 保持一致,Service 的 clusterIP 設置爲 None,對改知識點不太熟悉的,能夠去查看咱們前面關於 Service 部分的講解。
Endpoints 的 subsets 中填寫 etcd 集羣的地址便可,咱們這裏是建立的是高可用測試集羣,咱們建立的時候指定了node的主機IP地址(2個etcd也是不符合規範的。由於etcd是選舉制,2個就等於一個是同樣的。),直接建立該 Service 資源:
$ kubectl create -f prometheus-etcdService.yaml service/etcd-k8s created endpoints/etcd-k8s created
建立完成後,隔一下子去 Prometheus 的 Dashboard 中查看 targets,便會有 etcd 的監控項了:
數據採集到後,能夠在 grafana 中導入編號爲3070
的 dashboard,獲取到 etcd 的監控圖表。
如今咱們知道怎麼自定義一個 ServiceMonitor 對象了,可是若是須要自定義一個報警規則的話呢?好比如今咱們去查看 Prometheus Dashboard 的 Alert 頁面下面就已經有一些報警規則了,還有一些是已經觸發規則的了:
可是這些報警信息是哪裏來的呢?他們應該用怎樣的方式通知咱們呢?咱們知道以前咱們使用自定義的方式能夠在 Prometheus 的配置文件之中指定 AlertManager 實例和 報警的 rules 文件,如今咱們經過 Operator 部署的呢?咱們能夠在 Prometheus Dashboard 的 Config 頁面下面查看關於 AlertManager 的配置:
alerting: alert_relabel_configs: - separator: ; regex: prometheus_replica replacement: $1 action: labeldrop alertmanagers: - kubernetes_sd_configs: - role: endpoints namespaces: names: - monitoring scheme: http path_prefix: / timeout: 10s relabel_configs: - source_labels: [__meta_kubernetes_service_name] separator: ; regex: alertmanager-main replacement: $1 action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] separator: ; regex: web replacement: $1 action: keep rule_files: - /etc/prometheus/rules/prometheus-k8s-rulefiles-0/*.yaml
上面 alertmanagers 實例的配置咱們能夠看到是經過角色爲 endpoints 的 kubernetes 的服務發現機制獲取的,匹配的是服務名爲 alertmanager-main,端口名未 web 的 Service 服務,咱們查看下 alertmanager-main 這個 Service:
kubectl describe svc alertmanager-main -n monitoring Name: alertmanager-main Namespace: monitoring Labels: alertmanager=main Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"alertmanager":"main"},"name":"alertmanager-main","namespace":"monitoring"},... Selector: alertmanager=main,app=alertmanager Type: ClusterIP IP: 10.110.43.207 Port: web 9093/TCP TargetPort: web/TCP Endpoints: 10.244.0.31:9093,10.244.2.42:9093,10.244.3.40:9093 Session Affinity: None Events: <none>
能夠看到服務名正是 alertmanager-main,Port 定義的名稱也是 web,符合上面的規則,因此 Prometheus 和 AlertManager 組件就正確關聯上了。而對應的報警規則文件位於:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/
目錄下面全部的 YAML 文件。咱們能夠進入 Prometheus 的 Pod 中驗證下該目錄下面是否有 YAML 文件:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/ monitoring-prometheus-k8s-rules.yaml /prometheus $ cat /etc/prometheus/rules/prometheus-k8s-rulefiles-0/monitoring-prometheus-k8s-rules.yaml groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) ) record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate ...........
這個 YAML 文件實際上就是咱們以前建立的一個 PrometheusRule 文件包含的:
$ cat prometheus-rules.yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: prometheus-k8s-rules namespace: monitoring spec: groups: - name: k8s.rules rules: - expr: | sum(rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m])) by (namespace) record: namespace:container_cpu_usage_seconds_total:sum_rate - expr: | sum by (namespace, pod_name, container_name) ( rate(container_cpu_usage_seconds_total{job="kubelet", image!="", container_name!=""}[5m]) ) record: namespace_pod_name_container_name:container_cpu_usage_seconds_total:sum_rate .....
咱們這裏的 PrometheusRule 的 name 爲 prometheus-k8s-rules,namespace 爲 monitoring,咱們能夠猜測到咱們建立一個 PrometheusRule 資源對象後,會自動在上面的 prometheus-k8s-rulefiles-0 目錄下面生成一個對應的<namespace>-<name>.yaml
文件,因此若是之後咱們須要自定義一個報警選項的話,只須要定義一個 PrometheusRule 資源對象便可。至於爲何 Prometheus 可以識別這個 PrometheusRule 資源對象呢?這就須要查看咱們建立的 prometheus 這個資源對象了,裏面有很是重要的一個屬性 ruleSelector,用來匹配 rule 規則的過濾器,要求匹配具備 prometheus=k8s 和 role=alert-rules 標籤的 PrometheusRule 資源對象,如今明白了吧?
ruleSelector:
matchLabels:
prometheus: k8s
role: alert-rules
因此咱們要想自定義一個報警規則,只須要建立一個具備 prometheus=k8s 和 role=alert-rules 標籤的 PrometheusRule 對象就好了,好比如今咱們添加一個 etcd 是否可用的報警,咱們知道 etcd 整個集羣有一半以上的節點可用的話集羣就是可用的,因此咱們判斷若是不可用的 etcd 數量超過了一半那麼就觸發報警,建立文件 prometheus-etcdRules.yaml:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: etcd-rules namespace: monitoring spec: groups: - name: etcd rules: - alert: EtcdClusterUnavailable annotations: summary: etcd cluster small description: If one more etcd peer goes down the cluster will be unavailable expr: | count(up{job="etcd"} == 0) > (count(up{job="etcd"}) / 2 - 1) for: 3m labels: severity: critical
.....
$ kubectl create -f prometheus-etcdRules.yam
注意 label 標籤必定至少要有 prometheus=k8s 和 role=alert-rules,建立完成後,隔一下子再去容器中查看下 rules 文件夾:
$ kubectl exec -it prometheus-k8s-0 /bin/sh -n monitoring Defaulting container name to prometheus. Use 'kubectl describe pod/prometheus-k8s-0 -n monitoring' to see all of the containers in this pod. /prometheus $ ls /etc/prometheus/rules/prometheus-k8s-rulefiles-0/ monitoring-etcd-rules.yaml monitoring-prometheus-k8s-rules.yaml
能夠看到咱們建立的 rule 文件已經被注入到了對應的 rulefiles 文件夾下面了,證實咱們上面的設想是正確的。而後再去 Prometheus Dashboard 的 Alert 頁面下面就能夠查看到上面咱們新建的報警規則了:
咱們知道了如何去添加一個報警規則配置項,可是這些報警信息用怎樣的方式去發送呢?前面的課程中咱們知道咱們能夠經過 AlertManager 的配置文件去配置各類報警接收器,如今咱們是經過 Operator 提供的 alertmanager 資源對象建立的組件,應該怎樣去修改配置呢?
首先咱們將 alertmanager-main 這個 Service 改成 NodePort 類型的 Service,修改完成後咱們能夠在頁面上的 status 路徑下面查看 AlertManager 的配置信息:
$ kubectl edit svc alertmanager-main -n monitoring
......
selector:
alertmanager: main
app: alertmanager
sessionAffinity: None
type: NodePort
.....
這些配置信息其實是來自於咱們以前在prometheus-operator/contrib/kube-prometheus/manifests
目錄下面建立的 alertmanager-secret.yaml 文件:
apiVersion: v1 data: alertmanager.yaml: Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== kind: Secret metadata: name: alertmanager-main namespace: monitoring type: Opaque
能夠將 alertmanager.yaml 對應的 value 值作一個 base64 解碼:
echo Imdsb2JhbCI6IAogICJyZXNvbHZlX3RpbWVvdXQiOiAiNW0iCiJyZWNlaXZlcnMiOiAKLSAibmFtZSI6ICJudWxsIgoicm91dGUiOiAKICAiZ3JvdXBfYnkiOiAKICAtICJqb2IiCiAgImdyb3VwX2ludGVydmFsIjogIjVtIgogICJncm91cF93YWl0IjogIjMwcyIKICAicmVjZWl2ZXIiOiAibnVsbCIKICAicmVwZWF0X2ludGVydmFsIjogIjEyaCIKICAicm91dGVzIjogCiAgLSAibWF0Y2giOiAKICAgICAgImFsZXJ0bmFtZSI6ICJEZWFkTWFuc1N3aXRjaCIKICAgICJyZWNlaXZlciI6ICJudWxsIg== | base64 -d
解碼出來的結果 "global": "resolve_timeout": "5m" "receivers": - "name": "null" "route": "group_by": - "job" "group_interval": "5m" "group_wait": "30s" "receiver": "null" "repeat_interval": "12h" "routes": - "match": "alertname": "DeadMansSwitch" "receiver": "null"
咱們能夠看到內容和上面查看的配置信息是一致的,因此若是咱們想要添加本身的接收器,或者模板消息,咱們就能夠更改這個文件:
global: resolve_timeout: 5m smtp_smarthost: 'smtp.qq.com:587' smtp_from: 'zhaikun1992@qq.com' smtp_auth_username: 'zhaikun1992@qq.com' smtp_auth_password: '***' smtp_hello: 'qq.com' smtp_require_tls: true templates: - "/etc/alertmanager-tmpl/wechat.tmpl" route: group_by: ['job', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 5m receiver: default routes: - receiver: 'wechat' group_wait: 10s match: alertname: CoreDNSDown receivers: - name: 'default' email_configs: - to: 'zhai_kun@suixingpay.com' send_resolved: true - name: 'wechat' wechat_configs: - corp_id: '***' to_party: '*' to_user: "**" agent_id: '***' api_secret: '***' send_resolved: true
將上面文件保存爲 alertmanager.yaml,而後使用這個文件建立一個 Secret 對象:
#刪除原secret對象 kubectl delete secret alertmanager-main -n monitoring secret "alertmanager-main" deleted #將本身的配置文件導入到新的secret kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml -n monitoring
咱們添加了兩個接收器,默認的經過郵箱進行發送,對於 CoreDNSDown 這個報警咱們經過 wechat 來進行發送,上面的步驟建立完成後,很快咱們就會收到一條釘釘消息:
一樣郵箱中也會收到報警信息:
咱們再次查看 AlertManager 頁面的 status 頁面的配置信息能夠看到已經變成上面咱們的配置信息了:
AlertManager 配置也可使用模板(.tmpl文件),這些模板能夠與 alertmanager.yaml 配置文件一塊兒添加到 Secret 對象中,好比:
apiVersion:v1 kind:secret metadata: name:alertmanager-example data: alertmanager.yaml:{BASE64_CONFIG} template_1.tmpl:{BASE64_TEMPLATE_1} template_2.tmpl:{BASE64_TEMPLATE_2} ...
模板會被放置到與配置文件相同的路徑,固然要使用這些模板文件,還須要在 alertmanager.yaml 配置文件中指定:
templates: - '*.tmpl'
建立成功後,Secret 對象將會掛載到 AlertManager 對象建立的 AlertManager Pod 中去。
樣例:咱們建立一個alertmanager-tmpl.yaml文件,添加以下內容:
{{ define "wechat.default.message" }} {{ range .Alerts }} ========start========== 告警程序: prometheus_alert 告警級別: {{ .Labels.severity }} 告警類型: {{ .Labels.alertname }} 故障主機: {{ .Labels.instance }} 告警主題: {{ .Annotations.summary }} 告警詳情: {{ .Annotations.description }} 觸發時間: {{ .StartsAt.Format "2013-12-02 15:04:05" }} ========end========== {{ end }} {{ end }}
刪除原secret對象
$ kubectl delete secret alertmanager-main -n monitoring secret "alertmanager-main" deleted
建立新的secret對象
$ kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=alertmanager-tmpl.yaml -n monitoring secret/alertmanager-main created
過一會咱們的微信就會收到告警信息。固然這裏標籤訂義的問題,獲取的值不全,咱們能夠根據實際狀況自定義。
咱們想一個問題,若是在咱們的 Kubernetes 集羣中有了不少的 Service/Pod,那麼咱們都須要一個一個的去創建一個對應的 ServiceMonitor 對象來進行監控嗎?這樣豈不是又變得麻煩起來了?
爲解決這個問題,Prometheus Operator 爲咱們提供了一個額外的抓取配置的來解決這個問題,咱們能夠經過添加額外的配置來進行服務發現進行自動監控。和前面自定義的方式同樣,咱們想要在 Prometheus Operator 當中去自動發現並監控具備prometheus.io/scrape=true
這個 annotations 的 Service,以前咱們定義的 Prometheus 的配置以下:
- job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name
要想自動發現集羣中的 Service,就須要咱們在 Service 的annotation
區域添加prometheus.io/scrape=true
的聲明,將上面文件直接保存爲 prometheus-additional.yaml,而後經過這個文件建立一個對應的 Secret 對象:
$ kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitoring secret/additional-configs created
建立完成後,會將上面配置信息進行 base64 編碼後做爲 prometheus-additional.yaml 這個 key 對應的值存在:
$ kubectl get secret additional-configs -n monitoring -o yaml apiVersion: v1 data: prometheus-additional.yaml: LSBqb2JfbmFtZTogJ2t1YmVybmV0ZXMtc2VydmljZS1lbmRwb2ludHMnCiAga3ViZXJuZXRlc19zZF9jb25maWdzOgogIC0gcm9sZTogZW5kcG9pbnRzCiAgcmVsYWJlbF9jb25maWdzOgogIC0gc291cmNlX2xhYmVsczogW19fbWV0YV9rdWJlcm5ldGVzX3NlcnZpY2VfYW5ub3RhdGlvbl9wcm9tZXRoZXVzX2lvX3NjcmFwZV0KICAgIGFjdGlvbjoga2VlcAogICAgcmVnZXg6IHRydWUKICAtIHNvdXJjZV9sYWJlbHM6IFtfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19zY2hlbWVdCiAgICBhY3Rpb246IHJlcGxhY2UKICAgIHRhcmdldF9sYWJlbDogX19zY2hlbWVfXwogICAgcmVnZXg6IChodHRwcz8pCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9hbm5vdGF0aW9uX3Byb21ldGhldXNfaW9fcGF0aF0KICAgIGFjdGlvbjogcmVwbGFjZQogICAgdGFyZ2V0X2xhYmVsOiBfX21ldHJpY3NfcGF0aF9fCiAgICByZWdleDogKC4rKQogIC0gc291cmNlX2xhYmVsczogW19fYWRkcmVzc19fLCBfX21ldGFfa3ViZXJuZXRlc19zZXJ2aWNlX2Fubm90YXRpb25fcHJvbWV0aGV1c19pb19wb3J0XQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IF9fYWRkcmVzc19fCiAgICByZWdleDogKFteOl0rKSg/OjpcZCspPzsoXGQrKQogICAgcmVwbGFjZW1lbnQ6ICQxOiQyCiAgLSBhY3Rpb246IGxhYmVsbWFwCiAgICByZWdleDogX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9sYWJlbF8oLispCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfbmFtZXNwYWNlXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZXNwYWNlCiAgLSBzb3VyY2VfbGFiZWxzOiBbX19tZXRhX2t1YmVybmV0ZXNfc2VydmljZV9uYW1lXQogICAgYWN0aW9uOiByZXBsYWNlCiAgICB0YXJnZXRfbGFiZWw6IGt1YmVybmV0ZXNfbmFtZQo= kind: Secret metadata: creationTimestamp: 2019-03-20T03:38:37Z name: additional-configs namespace: monitoring resourceVersion: "29056864" selfLink: /api/v1/namespaces/monitoring/secrets/additional-configs uid: a579495b-4ac1-11e9-baf3-005056930126 type: Opaque
而後咱們只須要在聲明 prometheus 的資源對象文件中添加上這個額外的配置:(prometheus-prometheus.yaml)
apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: labels: prometheus: k8s name: k8s namespace: monitoring spec: alerting: alertmanagers: - name: alertmanager-main namespace: monitoring port: web baseImage: quay.io/prometheus/prometheus nodeSelector: beta.kubernetes.io/os: linux replicas: 2 secrets: - etcd-certs resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules securityContext: fsGroup: 2000 runAsNonRoot: true runAsUser: 1000 additionalScrapeConfigs: name: additional-configs key: prometheus-additional.yaml serviceAccountName: prometheus-k8s serviceMonitorNamespaceSelector: {} serviceMonitorSelector: {} version: v2.5.0
添加完成後,直接更新 prometheus 這個 CRD 資源對象:
$ kubectl apply -f prometheus-prometheus.yaml
prometheus.monitoring.coreos.com/k8s configured
隔一小會兒,能夠前往 Prometheus 的 Dashboard 中查看配置是否生效:
在 Prometheus Dashboard 的配置頁面下面咱們能夠看到已經有了對應的的配置信息了,可是咱們切換到 targets 頁面下面卻並無發現對應的監控任務,查看 Prometheus 的 Pod 日誌:
$ kubectl logs -f prometheus-k8s-0 prometheus -n monitoring evel=error ts=2019-03-20T03:55:01.298281581Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope" level=error ts=2019-03-20T03:55:02.29813427Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:301: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list services at the cluster scope" level=error ts=2019-03-20T03:55:02.298431046Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:300: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list endpoints at the cluster scope" level=error ts=2019-03-20T03:55:02.299312874Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope" level=error ts=2019-03-20T03:55:03.299674406Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:301: Failed to list *v1.Service: services is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list services at the cluster scope" level=error ts=2019-03-20T03:55:03.299757543Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:300: Failed to list *v1.Endpoints: endpoints is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list endpoints at the cluster scope" level=error ts=2019-03-20T03:55:03.299907982Z caller=main.go:240 component=k8s_client_runtime err="github.com/prometheus/prometheus/discovery/kubernetes/kubernetes.go:302: Failed to list *v1.Pod: pods is forbidden: User \"system:serviceaccount:monitoring:prometheus-k8s\" cannot list pods at the cluster scope"
能夠看到有不少錯誤日誌出現,都是xxx is forbidden
,這說明是 RBAC 權限的問題,經過 prometheus 資源對象的配置能夠知道 Prometheus 綁定了一個名爲 prometheus-k8s 的 ServiceAccount 對象,而這個對象綁定的是一個名爲 prometheus-k8s 的 ClusterRole:(prometheus-clusterRole.yaml)
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get
上面的權限規則中咱們能夠看到明顯沒有對 Service 或者 Pod 的 list 權限,因此報錯了,要解決這個問題,咱們只須要添加上須要的權限便可:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus-k8s rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get
更新上面的 ClusterRole 這個資源對象,而後重建下 Prometheus 的全部 Pod,正常就能夠看到 targets 頁面下面有 kubernetes-service-endpoints 這個監控任務了:
$ kubectl apply -f prometheus-clusterRole.yaml
clusterrole.rbac.authorization.k8s.io/prometheus-k8s configured
咱們這裏自動監控了兩個 Service,這兩個都是coredns的,咱們在 Service 中有兩個特殊的 annotations:
$ kubectl describe svc kube-dns -n kube-system Name: kube-dns Namespace: kube-system .... Annotations: prometheus.io/port=9153 prometheus.io/scrape=true
...
因此被自動發現了,固然咱們也能夠用一樣的方式去配置 Pod、Ingress 這些資源對象的自動發現。
上面咱們在修改完權限的時候,重啓了 Prometheus 的 Pod,若是咱們仔細觀察的話會發現咱們以前採集的數據已經沒有了,這是由於咱們經過 prometheus 這個 CRD 建立的 Prometheus 並無作數據的持久化,咱們能夠直接查看生成的 Prometheus Pod 的掛載狀況就清楚了:
............ volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /prometheus name: prometheus-k8s-db - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 ......... volumes: - name: config secret: defaultMode: 420 secretName: prometheus-k8s - emptyDir: {}
咱們能夠看到 Prometheus 的數據目錄 /prometheus 其實是經過 emptyDir 進行掛載的,咱們知道 emptyDir 掛載的數據的生命週期和 Pod 生命週期一致的,因此若是 Pod 掛掉了,數據也就丟失了,這也就是爲何咱們重建 Pod 後以前的數據就沒有了的緣由,對應線上的監控數據確定須要作數據的持久化的,一樣的 prometheus 這個 CRD 資源也爲咱們提供了數據持久化的配置方法,因爲咱們的 Prometheus 最終是經過 Statefulset 控制器進行部署的,因此咱們這裏須要經過 storageclass 來作數據持久化, 咱們以前用rook已經搭建過storageclass。因此咱們就能夠直接用了。咱們讓prometheus 的 CRD 資源對象(prometheus-prometheus.yaml)中添加以下配置:
storage: volumeClaimTemplate: spec: storageClassName: rook-ceph-block resources: requests: storage: 10Gi
注意這裏的 storageClassName 名字爲上面咱們建立的 StorageClass 對象名稱,而後更新 prometheus 這個 CRD 資源。更新完成後會自動生成兩個 PVC 和 PV 資源對象:
$ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-dba11961-4ad6-11e9-baf3-005056930126 10Gi RWO Delete Bound monitoring/prometheus-k8s-db-prometheus-k8s-0 rook-ceph-block 1m pvc-dbc6bac5-4ad6-11e9-baf3-005056930126 10Gi RWO Delete Bound monitoring/prometheus-k8s-db-prometheus-k8s-1 rook-ceph-block 1m $ kubectl get pvc -n monitoring NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE prometheus-k8s-db-prometheus-k8s-0 Bound pvc-dba11961-4ad6-11e9-baf3-005056930126 10Gi RWO rook-ceph-block 2m prometheus-k8s-db-prometheus-k8s-1 Bound pvc-dbc6bac5-4ad6-11e9-baf3-005056930126 10Gi RWO rook-ceph-block 2m
如今咱們再去看 Prometheus Pod 的數據目錄就能夠看到是關聯到一個 PVC 對象上了。
....... volumeMounts: - mountPath: /etc/prometheus/config_out name: config-out readOnly: true - mountPath: /prometheus name: prometheus-k8s-db subPath: prometheus-db - mountPath: /etc/prometheus/rules/prometheus-k8s-rulefiles-0 name: prometheus-k8s-rulefiles-0 ......... volumes: - name: prometheus-k8s-db persistentVolumeClaim: claimName: prometheus-k8s-db-prometheus-k8s-0 .........
如今即便咱們的 Pod 掛掉了,數據也不會丟失了。讓咱們測試一下。
咱們先隨便查一下數據
刪除pod
kubectl delete pod prometheus-k8s-1 -n monitorin kubectl delete pod prometheus-k8s-0 -n monitorin
查看pod狀態
kubectl get pod -n monitoring NAME READY STATUS RESTARTS AGE alertmanager-main-0 2/2 Running 0 2d alertmanager-main-1 2/2 Running 0 2d alertmanager-main-2 2/2 Running 0 2d grafana-7489c49998-pkl8w 1/1 Running 0 2d kube-state-metrics-d6cf6c7b5-7dwpg 4/4 Running 0 2d node-exporter-dlp25 2/2 Running 0 2d node-exporter-fghlp 2/2 Running 0 2d node-exporter-mxwdm 2/2 Running 0 2d node-exporter-r9v92 2/2 Running 0 2d prometheus-adapter-84cd9c96c9-n92n4 1/1 Running 0 2d prometheus-k8s-0 0/3 ContainerCreating 0 3s prometheus-k8s-1 3/3 Running 0 9s prometheus-operator-7b74946bd6-vmbcj 1/1 Running 0 2d
pod正在從新建立。等建立完成,咱們再查看一下數據
咱們的數據是正常的,沒有丟失。