Prometheus Operator默認的監控指標並不能徹底知足實際的監控需求,這時候就須要咱們本身根據業務添加自定義監控。添加一個自定義監控的步驟以下:
一、建立一個ServiceMonitor對象,用於Prometheus添加監控項
二、爲ServiceMonitor對象關聯metrics數據接口的Service對象
三、確保Services對象能夠正確獲取到metrics數據node
下面本文將以如何添加redis監控爲例web
k8s-redis-and-exporter-deployment.yamlredis
--- apiVersion: v1 kind: Namespace metadata: name: redis --- apiVersion: apps/v1 kind: Deployment metadata: namespace: redis name: redis spec: replicas: 1 selector: matchLabels: app: redis template: metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9121" labels: app: redis spec: containers: - name: redis image: redis resources: requests: cpu: 100m memory: 100Mi ports: - containerPort: 6379 - name: redis-exporter image: oliver006/redis_exporter:latest resources: requests: cpu: 100m memory: 100Mi ports: - containerPort: 9121
部署redis的同時,咱們把redis_exporter以sidecar的形式和redis服務部署在用一個Pod
另外注意,咱們添加了annotations:prometheus.io/scrape: "true" 和 prometheus.io/port: "9121"
shell
apiVersion: v1 kind: Service metadata: name: redis-svc namespace: redis labels: app: redis spec: type: NodePort ports: - name: redis port: 6379 targetPort: 6379 - name: redis-exporter port: 9121 targetPort: 9121 selector: app: redis
檢查下部署好的服務並驗證metrics可以獲取到數據api
[root@]# kubectl get po,ep,svc -n redis NAME READY STATUS RESTARTS AGE pod/redis-78446485d8-sp57x 2/2 Running 0 116m NAME ENDPOINTS AGE endpoints/redis-svc 100.102.126.3:9121,100.102.126.3:6379 6m5s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/redis-svc NodePort 10.105.111.177 <none> 6379:32357/TCP,9121:31019/TCP 6m5s 驗證metrics [root@qd01-stop-k8s-master001 MyDefine]# curl 10.105.111.177:9121/metrics # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 8 # HELP go_info Information about the Go environment. # TYPE go_info gauge ............
如今 Prometheus 訪問redis,接下來建立 ServiceMonitor 對象便可
prometheus-serviceMonitorRedis.yaml微信
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: redis-k8s namespace: monitoring labels: app: redis spec: jobLabel: redis endpoints: - port: redis-exporter interval: 30s scheme: http selector: matchLabels: app: redis namespaceSelector: matchNames: - redis
執行建立並查看-serviceMonitorapp
[root@]# kubectl apply -f prometheus-serviceMonitorRedis.yaml servicemonitor.monitoring.coreos.com/redis-k8s created [root@]# kubectl get serviceMonitor -n monitoring NAME AGE redis-k8s 11s
如今切換到PrometheusUI界面查看targets,會發現多了剛纔建立的redis-k8s監控項
如今就能夠查詢redis-exporter收集到的redis監控指標了curl
咱們如今能收集到redis的監控指標了,可是如今並無配置監控報警規則。須要咱們本身根據實際關心的指標添加報警規則
首先咱們看下Prometheus默認的規則,大概以下。ide
如今咱們就來爲redis添加一條規則,在 Prometheus的 Config 頁面下面查看關於 AlertManager 的配置:ui
上面 alertmanagers 實例的配置咱們能夠看到是經過角色爲 endpoints 的 kubernetes 的服務發現機制獲取的,匹配的是服務名爲 alertmanager-main,端口名未 web 的 Service 服務,咱們查看下 alertmanager-main 這個 Service:
[root@]# kubectl describe svc alertmanager-main -n monitoring Name: alertmanager-main Namespace: monitoring Labels: alertmanager=main Annotations: <none> Selector: alertmanager=main,app=alertmanager Type: ClusterIP IP: 10.111.141.65 Port: web 9093/TCP TargetPort: web/TCP Endpoints: 100.118.246.1:9093,100.64.147.129:9093,100.98.81.194:9093 Session Affinity: ClientIP Events: <none>
能夠看到服務名就是 alertmanager-main,Port 定義的名稱也是 web,符合上面的規則,因此 Prometheus 和 AlertManager 組件就正確關聯上了。而對應的報警規則文件位於:/etc/prometheus/rules/prometheus-k8s-rulefiles-0/目錄下面全部的 YAML 文件。能夠進入 Prometheus 的 Pod 中驗證下該目錄下面是否有 YAML 文件:
這個YAML文件實際上就是咱們以前建立的一個 PrometheusRule 文件包含的:
這裏的 PrometheusRule 的 name 爲 prometheus-k8s-rules,namespace 爲 monitoring,咱們能夠猜測到咱們建立一個 PrometheusRule 資源對象後,會自動在上面的 prometheus-k8s-rulefiles-0 目錄下面生成一個對應的<namespace>-<name>.yaml文件,因此若是之後咱們須要自定義一個報警選項的話,只須要定義一個 PrometheusRule 資源對象便可。至於爲何 Prometheus 可以識別這個 PrometheusRule 資源對象呢?這就查看咱們建立的 prometheus( prometheus-prometheus.yaml) 這個資源對象了,裏面有很是重要的一個屬性 ruleSelector,用來匹配 rule 規則的過濾器,要求匹配具備 prometheus=k8s 和 role=alert-rules 標籤的 PrometheusRule 資源對象,如今明白了吧?
ruleSelector: matchLabels: prometheus: k8s role: alert-rules
因此要想自定義一個報警規則,只須要建立一個具備 prometheus=k8s 和 role=alert-rules 標籤的 PrometheusRule 對象就好了,好比如今咱們添加一個redis是否可用的報警,咱們能夠經過redis_up這個指標檢查redis是否啓動,建立文件 prometheus-redisRules.yaml:
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: prometheus: k8s role: alert-rules name: redis-rules namespace: monitoring spec: groups: - name: redis rules: - alert: RedisUnavailable annotations: summary: redis instance info description: If redis_up == 0, redis will be unavailable expr: | redis_up == 0 for: 3m labels: severity: critical
建立prometheusrule後,能夠看到咱們本身建立的redis-rules
kubectl apply -f prometheus-redisRules.yaml kubectl get prometheusrule -n monitoring NAME AGE etcd-rules 4d18h prometheus-k8s-rules 17d redis-rules 15s
注意 label 標籤必定至少要有 prometheus=k8s 或 role=alert-rules,建立完成後,隔一下子再去容器中查看下 rules 文件夾:
如今看到咱們建立的 rule 文件已經被注入到了對應的 rulefiles 文件夾下面了。而後再去 Prometheus的 Alert 頁面下面就能夠查看到上面咱們新建的報警規則了:
如今咱們知道了怎麼去添加一個報警規則配置項,可是這些報警信息用怎樣的方式去發送呢?
這個就須要咱們配置alertmanager
這裏我以郵件和微信爲例
alertmanager的配置文件alertmanager.yaml使用 alertmanager-secret.yaml 文件建立,這裏看下默認的配置
cat alertmanager-secret.yaml
apiVersion: v1 kind: Secret metadata: name: alertmanager-main namespace: monitoring stringData: alertmanager.yaml: |- "global": "resolve_timeout": "5m" "inhibit_rules": - "equal": - "namespace" - "alertname" "source_match": "severity": "critical" "target_match_re": "severity": "warning|info" - "equal": - "namespace" - "alertname" "source_match": "severity": "warning" "target_match_re": "severity": "info" "receivers": - "name": "Default" - "name": "Watchdog" - "name": "Critical" "route": "group_by": - "namespace" "group_interval": "5m" "group_wait": "30s" "receiver": "Default" "repeat_interval": "12h" "routes": - "match": "alertname": "Watchdog" "receiver": "Watchdog" - "match": "severity": "critical" "receiver": "Critical" type: Opaque
如今咱們須要修改這個文件,配置微信和郵件相關信息,前提你須要自行準備好企業微信相關信息,能夠自行網上搜相關教程。
首先建立alertmanager.yaml文件
global: resolve_timeout: 5m smtp_smarthost: 'smtp.51os.club:25' smtp_from: 'amos' smtp_auth_username: 'amos@51os.club' smtp_auth_password: 'Mypassword' smtp_hello: '51os.club' smtp_require_tls: false wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' wechat_api_secret: 'SGGc4x-RDcVD_ptvVhYrxxxxxxxxxxOhWVWIITRxM' wechat_api_corp_id: 'ww419xxxxxxxx735e1c0' templates: - '*.tmpl' route: group_by: ['job', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: default routes: - receiver: wechat continue: true match: alertname: Watchdog receivers: - name: 'default' email_configs: - to: '10xxxx1648@qq.com' send_resolved: true - name: 'wechat' wechat_configs: - send_resolved: false corp_id: 'ww419xxxxxxxx35e1c0' to_party: '13' message: '{{ template "wechat.default.message" . }}' agent_id: '1000003' api_secret: 'SGGc4x-RDcxxxxxxxxY6YwfZFsO9OhWVWIITRxM'
我這裏添加了兩個接收器,默認的經過郵箱進行發送,對於 Watchdog 這個報警咱們經過 webhook 來進行發送,這個 webhook 就是wechat。
而後使用在建立一個templates文件,這個文件是發微信消息的模板wechat.tmpl:
{{ define "wechat.default.message" }} {{- if gt (len .Alerts.Firing) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 -}} AlertTpye: {{ $alert.Labels.alertname }} AlertLevel: {{ $alert.Labels.severity }} ===================== {{- end }} ===Alert Info=== Alert Info: {{ $alert.Annotations.message }} Alert Time: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }} ===More Info=== {{ if gt (len $alert.Labels.instance) 0 -}}InstanceIp: {{ $alert.Labels.instance }};{{- end -}} {{- if gt (len $alert.Labels.namespace) 0 -}}InstanceNamespace: {{ $alert.Labels.namespace }};{{- end -}} {{- if gt (len $alert.Labels.node) 0 -}}NodeIP: {{ $alert.Labels.node }};{{- end -}} {{- if gt (len $alert.Labels.pod_name) 0 -}}PodName: {{ $alert.Labels.pod_name }}{{- end }} ===================== {{- end }} {{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 -}} AlertTpye: {{ $alert.Labels.alertname }} AlertLevel: {{ $alert.Labels.severity }} ===================== {{- end }} ===Alert Info=== Alert Info: {{ $alert.Annotations.message }} Alert Start Time: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }} Alert Fix Time: {{ $alert.EndsAt.Format "2006-01-02 15:04:05" }} ===More Info=== {{ if gt (len $alert.Labels.instance) 0 -}}InstanceIp: {{ $alert.Labels.instance }};{{- end -}} {{- if gt (len $alert.Labels.namespace) 0 -}}InstanceNamespace: {{ $alert.Labels.namespace }};{{- end -}} {{- if gt (len $alert.Labels.node) 0 -}}NodeIP: {{ $alert.Labels.node }};{{- end -}} {{- if gt (len $alert.Labels.pod_name) 0 -}}PodName: {{ $alert.Labels.pod_name }};{{- end }} ===================== {{- end }} {{- end }} {{- end }}
如今咱們先刪除原來的 alertmanager-main secret,而後再基於alertmanager.yaml和wechat.tmpl建立alertmanager-main secret
kubectl delete secret alertmanager-main -n monitoring kubectl create secret generic alertmanager-main --from-file=alertmanager.yaml --from-file=wechat.tmpl -n monitoring
上面的步驟建立完成後,很快咱們就會收到一條wechat消息,一樣郵箱中也會收到報警信息:
再次查看 AlertManager 的配置信息能夠看到已經變成上面咱們的配置信息了