Kubernetes更優雅的監控工具Prometheus Operator

時間 2019-11-11

標籤 kubernetes 優雅監控工具 prometheus operator 简体版

原文原文鏈接

Kubernetes更優雅的監控工具Prometheus Operator

[TOC]html

1. Kubernetes Operator 介紹

在 Kubernetes 的支持下，管理和伸縮 Web 應用、移動應用後端以及 API 服務都變得比較簡單了。其緣由是這些應用通常都是無狀態的，因此 Deployment 這樣的基礎 Kubernetes API 對象就能夠在無需附加操做的狀況下，對應用進行伸縮和故障恢復了。 node

而對於數據庫、緩存或者監控系統等有狀態應用的管理，就是個挑戰了。這些系統須要應用領域的知識，來正確的進行伸縮和升級，當數據丟失或不可用的時候，要進行有效的從新配置。咱們但願這些應用相關的運維技能能夠編碼到軟件之中，從而藉助 Kubernetes 的能力，正確的運行和管理複雜應用。 mysql

Operator 這種軟件，使用 TPR(第三方資源，如今已經升級爲 CRD) 機制對 Kubernetes API 進行擴展，將特定應用的知識融入其中，讓用戶能夠建立、配置和管理應用。和 Kubernetes 的內置資源同樣，Operator 操做的不是一個單實例應用，而是集羣範圍內的多實例。 git

2. Prometheus Operator介紹

Kubernetes的Prometheus Operator爲Kubernetes服務和Prometheus實例的部署和管理提供了簡單的監控定義。github

安裝完畢後，Prometheus Operator提供瞭如下功能：sql

建立/毀壞: 在Kubernetes namespace中更容易啓動一個Prometheus實例，一個特定的應用程序或團隊更容易使用Operator。
簡單配置: 配置Prometheus的基礎東西，好比在Kubernetes的本地資源versions, persistence, retention policies, 和replicas。
Target Services經過標籤: 基於常見的Kubernetes label查詢，自動生成監控target 配置；不須要學習普羅米修斯特定的配置語言。

Prometheus Operator 架構圖以下：數據庫

以上架構中的各組成部分以不一樣的資源方式運行在 Kubernetes 集羣中，它們各自有不一樣的做用：json

Operator： Operator 資源會根據自定義資源（Custom Resource Definition / CRDs）來部署和管理 Prometheus Server，同時監控這些自定義資源事件的變化來作相應的處理，是整個系統的控制中心。
Prometheus： Prometheus 資源是聲明性地描述 Prometheus 部署的指望狀態。
Prometheus Server： Operator 根據自定義資源 Prometheus 類型中定義的內容而部署的 Prometheus Server 集羣，這些自定義資源能夠看做是用來管理 Prometheus Server 集羣的 StatefulSets 資源。
ServiceMonitor： ServiceMonitor 也是一個自定義資源，它描述了一組被 Prometheus 監控的 targets 列表。該資源經過 Labels 來選取對應的 Service Endpoint，讓 Prometheus Server 經過選取的 Service 來獲取 Metrics 信息。
Service： Service 資源主要用來對應 Kubernetes 集羣中的 Metrics Server Pod，來提供給 ServiceMonitor 選取讓 Prometheus Server 來獲取信息。簡單的說就是 Prometheus 監控的對象，例如 Node Exporter Service、Mysql Exporter Service 等等。
Alertmanager： Alertmanager 也是一個自定義資源類型，由 Operator 根據資源描述內容來部署 Alertmanager 集羣。後端

3. Prometheus Operator部署

環境：api

Kubernetes version: kubeadm安裝的1.12
helm version: v2.11.0

咱們使用helm安裝。helm chart根據實際使用修改。prometheus-operator

裏面整合了grafana和監控kubernetes的exporter。須要注意的是，grafana我配置使用了mysql保存數據，相關說明在另外一篇文章中《使用Helm部署Prometheus和Grafana監控Kubernetes》。

cd helm/prometheus-operator/
helm install --name prometheus-operator --namespace monitoring -f values.yaml ./

爲了更加靈活的的使用Prometheus Operator，添加自定義監控是必不可少的。這裏咱們使用ceph-exporter作示例。

values.yaml中這一段便是使用servicemonitor來添加監控：

serviceMonitor:
  enabled: true  # 開啓監控
  # on what port are the metrics exposed by etcd
  exporterPort: 9128
  # for apps that have deployed outside of the cluster, list their adresses here
  endpoints: []
  # Are we talking http or https?
  scheme: http
  # service selector label key to target ceph exporter pods
  serviceSelectorLabelKey: app
  # default rules are in templates/ceph-exporter.rules.yaml
  prometheusRules: {}
  # Custom Labels to be added to ServiceMonitor
  # 通過測試，servicemonitor標籤添加prometheus operator的release標籤便可正常監控
  additionalServiceMonitorLabels: 
    release: prometheus-operator
  #Custom Labels to be added to Prometheus Rules CRD
  additionalRulesLabels: {}

最重要的是這個參數additionalServiceMonitorLabels，通過測試，servicemonitor須要添加prometheus operator已有的標籤，才能成功添加監控。

[root@lab1 prometheus-operator]# kubectl get servicemonitor ceph-exporter -n monitoring -o yaml
[root@lab1 templates]# kubectl get servicemonitor -n monitoring ceph-exporter -o yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  creationTimestamp: 2018-10-30T06:51:12Z
  generation: 1
  labels:
    app: ceph-exporter
    chart: ceph-exporter-0.1.0
    heritage: Tiller
    prometheus: ceph-exporter
    release: prometheus-operator
  name: ceph-exporter
  namespace: monitoring
  resourceVersion: "13937459"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/ceph-exporter
  uid: 30569173-dc10-11e8-bcf3-000c293d66a5
spec:
  endpoints:
  - interval: 30s
    port: http
  namespaceSelector:
    matchNames:
    - monitoring
  selector:
    matchLabels:
      app: ceph-exporter
      release: ceph-exporter

[root@lab1 prometheus-operator]# kubectl get pod -n monitoring  prometheus-operator-operator-7459848949-8dddt -o yaml|more
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-10-30T00:39:37Z
  generateName: prometheus-operator-operator-7459848949-
  labels:
    app: prometheus-operator-operator
    chart: prometheus-operator-0.1.6
    heritage: Tiller
    pod-template-hash: "745984894
    release: prometheus-operator

要點說明：

ServiceMonitor的標籤中至少須要有和prometheus-operator POD中標籤相匹配；
ServiceMonitor的spec參數
service能被prometheus訪問，各端點正常；
遇到問題，能夠開啓prometheus operator和prometheus的調試日誌。雖然日誌沒有什麼其它信息，可是prometheus operator調試日誌能夠看到當前監控到的servicemonitor，這樣能夠確認安裝的servicemonitor是否被匹配到。

安裝成功後，查看相關資源：

[root@lab1 prometheus-operator]# kubectl get service,servicemonitor,ep -n monitoring
NAME                                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/alertmanager-operated                          ClusterIP   None             <none>        9093/TCP,6783/TCP   12d
service/ceph-exporter                                  ClusterIP   10.100.57.62     <none>        9128/TCP            46h
service/monitoring-mysql-mysql                         ClusterIP   10.108.93.155    <none>        3306/TCP            42d
service/prometheus-operated                            ClusterIP   None             <none>        9090/TCP            12d
service/prometheus-operator-alertmanager               ClusterIP   10.98.42.209     <none>        9093/TCP            6d19h
service/prometheus-operator-grafana                    ClusterIP   10.103.100.150   <none>        80/TCP              6d19h
service/prometheus-operator-kube-state-metrics         ClusterIP   10.110.76.250    <none>        8080/TCP            6d19h
service/prometheus-operator-operator                   ClusterIP   None             <none>        8080/TCP            6d19h
service/prometheus-operator-prometheus                 ClusterIP   10.111.24.83     <none>        9090/TCP            6d19h
service/prometheus-operator-prometheus-node-exporter   ClusterIP   10.97.126.74     <none>        9100/TCP            6d19h

NAME                                                                               AGE
servicemonitor.monitoring.coreos.com/ceph-exporter                                 1d
servicemonitor.monitoring.coreos.com/prometheus-operator                           8d
servicemonitor.monitoring.coreos.com/prometheus-operator-alertmanager              6d
servicemonitor.monitoring.coreos.com/prometheus-operator-apiserver                 6d
servicemonitor.monitoring.coreos.com/prometheus-operator-coredns                   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-controller-manager   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-etcd                 6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-scheduler            6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kube-state-metrics        6d
servicemonitor.monitoring.coreos.com/prometheus-operator-kubelet                   6d
servicemonitor.monitoring.coreos.com/prometheus-operator-node-exporter             6d
servicemonitor.monitoring.coreos.com/prometheus-operator-operator                  6d
servicemonitor.monitoring.coreos.com/prometheus-operator-prometheus                6d

NAME                                                     ENDPOINTS                                                                 AGE
endpoints/alertmanager-operated                          10.244.6.174:9093,10.244.6.174:6783                                       12d
endpoints/ceph-exporter                                  10.244.2.59:9128                                                          46h
endpoints/monitoring-mysql-mysql                         10.244.6.171:3306                                                         42d
endpoints/prometheus-operated                            10.244.2.60:9090,10.244.6.175:9090                                        12d
endpoints/prometheus-operator-alertmanager               10.244.6.174:9093                                                         6d19h
endpoints/prometheus-operator-grafana                    10.244.6.106:3000                                                         6d19h
endpoints/prometheus-operator-kube-state-metrics         10.244.2.163:8080                                                         6d19h
endpoints/prometheus-operator-operator                   10.244.6.113:8080                                                         6d19h
endpoints/prometheus-operator-prometheus                 10.244.2.60:9090,10.244.6.175:9090                                        6d19h
endpoints/prometheus-operator-prometheus-node-exporter   192.168.105.92:9100,192.168.105.93:9100,192.168.105.94:9100 + 4 more...   6d19h

4. Grafana添加dashboard

上面的prometheus-operator裏的_dashboards有我修改過的dashboard，比較全面，使用手動在grafana界面導入，後續能夠隨意修改dashboard，使用過程當中很是方便。而若是將dashboard json文件放到dashboards目錄中，helm安裝的話，安裝的dashboard不支持grafana中直接修改，使用過程當中比較麻煩。

5. Alertmanager添加報警

添加prometheusrule，如下是一個示例：

[root@lab1 ceph-exporter]# kubectl get prometheusrule -n monitoring ceph-exporter -o yaml 
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  creationTimestamp: 2018-10-30T06:51:12Z
  generation: 1
  labels:
    app: prometheus
    chart: ceph-exporter-0.1.0
    heritage: Tiller
    prometheus: ceph-exporter
    release: ceph-exporter
  name: ceph-exporter
  namespace: monitoring
  resourceVersion: "13965150"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/ceph-exporter
  uid: 30543ec9-dc10-11e8-bcf3-000c293d66a5
spec:
  groups:
  - name: ceph-exporter.rules
    rules:
    - alert: Ceph
      annotations:
        description: There is no running ceph exporter.
        summary: Ceph exporter is down
      expr: absent(up{job="ceph-exporter"} == 1)
      for: 5m
      labels:
        severity: critical

默認監控k8s的rule已經不少很全面了，能夠自行調整prometheus-operator/templates/all-prometheus-rules.yaml。

報警規則可修改values.yaml中alertmanager:下面這段

config:
    global:
      resolve_timeout: 5m
      # The smarthost and SMTP sender used for mail notifications.
      smtp_smarthost: 'smtp.163.com:25'
      smtp_from: 'xxxxxx@163.com'
      smtp_auth_username: 'xxxxxx@163.com'
      smtp_auth_password: 'xxxxxx'
      # The API URL to use for Slack notifications.
      slack_api_url: 'https://hooks.slack.com/services/some/api/token'
    route:
      group_by: ["job", "alertname"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'noemail'
      routes:
      - match:
          severity: critical
        receiver: critical_email_alert
      - match_re:
          alertname: "^KubeJob*"
        receiver: default_email

    receivers:
      - name: 'default_email'
        email_configs:
        - to : 'xxxxxx@163.com'
          send_resolved: true

      - name: 'critical_email_alert'
        email_configs:
        - to : 'xxxxxx@163.com'
          send_resolved: true

      - name: 'noemail'
        email_configs:
        - to : 'null@null.cn'
          send_resolved: false

  ## Alertmanager template files to format alerts
  ## ref: https://prometheus.io/docs/alerting/notifications/
  ##      https://prometheus.io/docs/alerting/notification_examples/
  ##
  templateFiles:
    template_1.tmpl: |-
      {{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}

      {{ define "slack.k8s.text" }}
      {{- $root := . -}}
      {{ range .Alerts }}
       *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
       *Cluster:*  {{ template "cluster" $root }}
       *Description:* {{ .Annotations.description }}
       *Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:>
       *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>
       *Details:*
         {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
         {{ end }}