監控 Kubernetes 集羣應用

時間 2019-12-13

原文原文鏈接

Prometheus的數據指標是經過一個公開的 HTTP(S) 數據接口獲取到的，咱們不須要單獨安裝監控的 agent，只須要暴露一個 metrics 接口，Prometheus 就會按期去拉取數據；對於一些普通的 HTTP 服務，咱們徹底能夠直接重用這個服務，添加一個/metrics接口暴露給 Prometheus；並且獲取到的指標數據格式是很是易懂的。node

如今不少服務從一開始就內置了一個/metrics接口，好比 Kubernetes 的各個組件、istio 服務網格都直接提供了數據指標接口。有一些服務即便沒有原生集成該接口，也徹底可使用一些 exporter 來獲取到指標數據，好比 mysqld_exporter、node_exporter，這些 exporter 就有點相似於傳統監控服務中的 agent，做爲一直服務存在，用來收集目標服務的指標數據而後直接暴露給 Prometheus。mysql

普通應用監控

ingress 的使用，咱們採用的是Traefik做爲咱們的 ingress-controller，是咱們 Kubernetes 集羣內部服務和外部用戶之間的橋樑。Traefik 自己內置了一個/metrics的接口，可是須要咱們在參數中配置開啓:git

[metrics] [metrics.prometheus] entryPoint = "traefik" buckets = [0.1, 0.3, 1.2, 5.0]

以前的版本中是經過--web和--web.metrics.prometheus兩個參數進行開啓的，要注意查看對應版本的文檔。github

咱們須要在traefik.toml的配置文件中添加上上面的配置信息，而後更新 ConfigMap 和 Pod 資源對象便可，Traefik Pod 運行後，咱們能夠看到咱們的服務 IP：web

$ kubectl get svc -n kube-system
NAME                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S) AGE ...... traefik-ingress-service NodePort 10.101.33.56 <none> 80:31692/TCP,8080:32115/TCP 63d

而後咱們可使用curl檢查是否開啓了 Prometheus 指標數據接口，或者經過 NodePort 訪問也能夠：redis

$ curl 10.101.33.56:8080/metrics # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0.000121036 go_gc_duration_seconds{quantile="0.25"} 0.000210328 go_gc_duration_seconds{quantile="0.5"} 0.000279974 go_gc_duration_seconds{quantile="0.75"} 0.000420738 go_gc_duration_seconds{quantile="1"} 0.001191494 go_gc_duration_seconds_sum 0.004353914 go_gc_duration_seconds_count 12 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 63 ......

從這裏能夠看到 Traefik 的監控數據接口已經開啓成功了，而後咱們就能夠將這個/metrics接口配置到prometheus.yml中去了，直接加到默認的prometheus這個 job 下面：(prome-cm.yaml)sql

apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-ops data: prometheus.yml: | global: scrape_interval: 30s scrape_timeout: 30s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'traefik' static_configs: - targets: ['traefik-ingress-service.kube-system.svc.cluster.local:8080']

固然，咱們這裏只是一個很簡單的配置，scrape_configs 下面能夠支持不少參數，例如：shell

basic_auth 和 bearer_token：好比咱們提供的/metrics接口須要 basic 認證的時候，經過傳統的用戶名/密碼或者在請求的header中添加對應的 token 均可以支持
kubernetes_sd_configs 或 consul_sd_configs：能夠用來自動發現一些應用的監控數據

因爲咱們這裏 Traefik 對應的 servicename 是traefik-ingress-service，而且在 kube-system 這個 namespace 下面，因此咱們這裏的targets的路徑配置則須要使用FQDN的形式：traefik-ingress-service.kube-system.svc.cluster.local，固然若是你的 Traefik 和 Prometheus 都部署在同一個命名空間的話，則直接填 servicename:serviceport便可。而後咱們從新更新這個 ConfigMap 資源對象：api

$ kubectl delete -f prome-cm.yaml
configmap "prometheus-config" deleted $ kubectl create -f prome-cm.yaml configmap "prometheus-config" created

如今 Prometheus 的配置文件內容已經更改了，隔一下子被掛載到 Pod 中的 prometheus.yml 文件也會更新，因爲咱們以前的 Prometheus 啓動參數中添加了--web.enable-lifecycle參數，因此如今咱們只須要執行一個 reload 命令便可讓配置生效：app

$ kubectl get svc -n kube-ops
NAME         TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S) AGE prometheus NodePort 10.102.74.90 <none> 9090:30358/TCP 3d $ curl -X POST "http://10.102.74.90:9090/-/reload"

因爲 ConfigMap 經過 Volume 的形式掛載到 Pod 中去的熱更新須要必定的間隔時間纔會生效，因此須要稍微等一小會兒。

reload 這個 url 是一個 POST 請求，因此這裏咱們經過 service 的 CLUSTER-IP:PORT 就能夠訪問到這個重載的接口，這個時候咱們再去看 Prometheus 的 Dashboard 中查看採集的目標數據：

能夠看到咱們剛剛添加的traefik這個任務已經出現了，而後一樣的咱們能夠切換到 Graph 下面去，咱們能夠找到一些 Traefik 的指標數據，至於這些指標數據表明什麼意義，通常狀況下，咱們能夠去查看對應的/metrics接口，裏面通常狀況下都會有對應的註釋。

到這裏咱們就在 Prometheus 上配置了第一個 Kubernetes 應用。

使用 exporter 監控應用

上面咱們也說過有一些應用可能沒有自帶/metrics接口供 Prometheus 使用，在這種狀況下，咱們就須要利用 exporter 服務來爲 Prometheus 提供指標數據了。Prometheus 官方爲許多應用就提供了對應的 exporter 應用，也有許多第三方的實現，咱們能夠前往官方網站進行查看：exporters

好比咱們這裏經過一個redis-exporter的服務來監控 redis 服務，對於這類應用，咱們通常會以 sidecar 的形式和主應用部署在同一個 Pod 中，好比咱們這裏來部署一個 redis 應用，並用 redis-exporter 的方式來採集監控數據供 Prometheus 使用，以下資源清單文件：（prome-redis.yaml）

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: redis namespace: kube-ops spec: template: metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9121" labels: app: redis spec: containers: - name: redis image: redis:4 resources: requests: cpu: 100m memory: 100Mi ports: - containerPort: 6379 - name: redis-exporter image: oliver006/redis_exporter:latest resources: requests: cpu: 100m memory: 100Mi ports: - containerPort: 9121 --- kind: Service apiVersion: v1 metadata: name: redis namespace: kube-ops spec: selector: app: redis ports: - name: redis port: 6379 targetPort: 6379 - name: prom port: 9121 targetPort: 9121

能夠看到上面咱們在 redis 這個 Pod 中包含了兩個容器，一個就是 redis 自己的主應用，另一個容器就是 redis_exporter。如今直接建立上面的應用：

$ kubectl create -f prome-redis.yaml
deployment.extensions "redis" created service "redis" created

建立完成後，咱們能夠看到 redis 的 Pod 裏面包含有兩個容器：

$ kubectl get pods -n kube-ops
NAME                          READY     STATUS    RESTARTS   AGE
prometheus-8566cd9699-gt9wh   1/1       Running   0          3d
redis-544b6c8c54-8xd2g        2/2       Running   0          3m
$ kubectl get svc -n kube-ops
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S) AGE prometheus NodePort 10.102.74.90 <none> 9090:30358/TCP 3d redis ClusterIP 10.104.131.44 <none> 6379/TCP,9121/TCP 5m

咱們能夠經過 9121 端口來校驗是否可以採集到數據：

$ curl 10.104.131.44:9121/metrics # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 ...... # HELP redis_used_cpu_user_children used_cpu_user_childrenmetric # TYPE redis_used_cpu_user_children gauge redis_used_cpu_user_children{addr="redis://localhost:6379",alias=""} 0

一樣的，如今咱們只須要更新 Prometheus 的配置文件：

- job_name: 'redis' static_configs: - targets: ['redis:9121']

因爲咱們這裏的 redis 服務和 Prometheus 處於同一個 namespace，因此咱們直接使用 servicename 便可。

配置文件更新後，從新加載：

$ kubectl delete -f prome-cm.yaml
configmap "prometheus-config" deleted $ kubectl create -f prome-cm.yaml configmap "prometheus-config" created # 隔一下子執行reload操做 $ curl -X POST "http://10.102.74.90:9090/-/reload"

這個時候咱們再去看 Prometheus 的 Dashboard 中查看採集的目標數據：

能夠看到配置的 redis 這個 job 已經生效了。切換到 Graph 下面能夠看到不少關於 redis 的指標數據：

咱們選擇任意一個指標，好比redis_exporter_scrapes_total，而後點擊執行就能夠看到對應的數據圖表了：