kubernetes(k8s) Prometheus+grafana監控告警安裝部署

時間 2019-11-21

標籤 kubernetes k8s prometheus+grafana prometheus grafana 監控告警安裝部署简体版

原文原文鏈接

主機數據收集

主機數據的採集是集羣監控的基礎；外部模塊收集各個主機採集到的數據分析就能對整個集羣完成監控和告警等功能。通常主機數據採集和對外提供數據使用cAdvisor 和node-exporter等工具。javascript

cAdvisor

概述

Kubernetes的生態中，cAdvisor是做爲容器監控數據採集的Agent，其部署在每一個節點上，內部代碼結構大體以下：代碼結構很良好，collector和storage部分基本可作到增量擴展開發。php

cAdvisor.png

關於cAdvisor支持自定義指標方式能力，其自身是經過容器部署的時候設置lable標籤項：io.cadvisor.metric.開頭的lable，而value則爲自定義指標的配置文件，形以下：css

{
  "endpoint" : { "protocol": "https", "port": 8000, "path": "/nginx_status" }, "metrics_config" : [ { "name" : "activeConnections", "metric_type" : "gauge", "units" : "number of active connections", "data_type" : "int", "polling_frequency" : 10, "regex" : "Active connections: ([0-9]+)" }, { "name" : "reading", "metric_type" : "gauge", "units" : "number of reading connections", "data_type" : "int", "polling_frequency" : 10, "regex" : "Reading: ([0-9]+) .*" }, { "name" : "writing", "metric_type" : "gauge", "data_type" : "int", "units" : "number of writing connections", "polling_frequency" : 10, "regex" : ".*Writing: ([0-9]+).*" }, { "name" : "waiting", "metric_type" : "gauge", "units" : "number of waiting connections", "data_type" : "int", "polling_frequency" : 10, "regex" : ".*Waiting: ([0-9]+)" } ] }

但kubernetes1.6開始 cAdvisor 被集成到kubernetes中，能夠在安裝kubernetes經過參數激活 cAdvisorhtml

當前cAdvisor只支持http接口方式，也就是被監控容器應用必須提供http接口，因此能力較弱，若是咱們在collector這一層作擴展加強，提供數據庫，mq等等標準應用的監控模式是頗有價值的。在此以前的另外一種方案就是如上圖所示搭配promethuese（其內置有很是豐富的標準應用的插件涵蓋了APM所需的採集大部分插件），可是這每每會致使系統更復雜（若是應用層並不是想使用promethuse）java

在Kubernetes監控生態中，通常是以下的搭配使用：node

cAdvisor-promethus.png

Node-exporter

概述

node-exporter 運行在節點上採集節點主機自己的cpu和內存等使用信息，並對外提供獲取主機性能開銷的信息。linux

部署

下面是node-exporter在k8s下的部署文件nginx

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true' labels: app: node-exporter name: node-exporter name: node-exporter namespace: kube-system spec: clusterIP: None ports: - name: scrape port: 9100 protocol: TCP selector: app: node-exporter type: ClusterIP --- apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: node-exporter namespace: kube-system spec: template: metadata: labels: app: node-exporter name: node-exporter spec: containers: - image: prom/node-exporter:latest name: node-exporter ports: - containerPort: 9100 hostPort: 9100 name: scrape hostNetwork: true hostPID: true restartPolicy: Always

監控

完成對kubernetes的監控, 監控收集數據通常有PULL和PUSH兩種方式。PULL方式是監控平臺從集羣中的主機上主動拉取採集到的主機信息，而PUSH方式是主機將採集到的信息推送到監控平臺。經常使用的監控平臺是Prometheus,是採用PULL的方式採集主機信息。git

Prometheus

概述

Prometheus 是源於 Google Borgmon 的一個系統監控和報警工具，用 Golang 語言開發。基本原理是經過 HTTP 協議週期性地抓取被監控組件的狀態（pull 方式），這樣作的好處是任意組件只要提供 HTTP 接口就能夠接入監控系統，不須要任何 SDK 或者其餘的集成過程。github

這樣作很是適合虛擬化環境好比 VM 或者 Docker ，故其爲爲數很少的適合 Docker、Mesos 、Kubernetes 環境的監控系統之一，被不少人稱爲下一代監控系統。

特性

自定義多維度的數據模型
很是高效的存儲平均一個採樣數據佔 ~3.5 bytes左右，320萬的時間序列，每30秒採樣，保持60天，消耗磁盤大概228G。
強大的查詢語句
輕鬆實現數據可視化

優勢

很是少的外部依賴，安裝使用超簡單
已經有很是多的系統集成例如：docker HAProxy Nginx JMX等等
服務自動化發現
直接集成到代碼
設計思想是按照分佈式、微服務架構來實現的

組件

Prometheus server
主要負責數據採集和存儲，提供 PromQL 查詢語言的支持；
Push Gateway
支持臨時性 Job 主動推送指標的中間網關；
Exporters
提供被監控組件信息的 HTTP 接口被叫作 exporter ，目前互聯網公司經常使用的組件大部分都有 exporter 能夠直接使用，好比 Varnish、Haproxy、Nginx、MySQL、Linux 系統信息 (包括磁盤、內存、CPU、網絡等等)；
PromDash
使用 rails 開發的 dashboard，用於可視化指標數據；
WebUI
9090 端口提供的圖形化功能；
Alertmanager
用來進行報警；
APIclients
提供 HTTPAPI 接口

Prometheus

部署

安裝Rbac

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: - extensions resources: - ingresses verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-system --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system

安裝Configmap

# cat configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-system data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'kubernetes-apiservers' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kubernetes-cadvisor' kubernetes_sd_configs: - role: node scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: 'kubernetes-service-endpoints' kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-services' kubernetes_sd_configs: - role: service metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__address__] target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] target_label: kubernetes_name - job_name: 'kubernetes-ingresses' kubernetes_sd_configs: - role: ingress relabel_configs: - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe] action: keep regex: true - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: blackbox-exporter.example.com:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

部署 Prometheus

---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    name: prometheus-deployment
  name: prometheus
  namespace: kube-system spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: containers: - image: prom/prometheus:v2.0.0 name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: "/prometheus" name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 100Mi limits: cpu: 500m memory: 2500Mi serviceAccountName: prometheus volumes: - name: data emptyDir: {} - name: config-volume configMap: name: prometheus-config

---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: kube-system
spec:
  type: NodePort ports: - port: 9090 targetPort: 9090 nodePort: 30003 selector: app: prometheus

測試

訪問集羣中 http://prometheus地址:30003後。在Graph頁面
輸入:

node_cpu

查詢命令能夠看到節點cpu的使用信息。prometheus監控節點信息成功。
訪問targets頁面能夠看到prometheus採集的監控信息的來源。

告警

Prometheus的告警是使用AlertManger來一同完成的。Prometheus在監控信息超過設定閥值時就將告警信息發送給AlertManger模塊,AlertManger模塊負責告警。

AlertManger

概述

Alertmanager與Prometheus是相互分離的兩個組件。Prometheus服務器根據報警規則將警報發送給Alertmanager，而後Alertmanager將silencing、inhibition、aggregation等消息經過電子郵件、PaperDuty和HipChat發送通知。

設置警報和通知的主要步驟：

安裝配置Alertmanager
配置Prometheus經過-alertmanager.url標誌與Alertmanager通訊
在Prometheus中建立告警觸發規則。
在Alertmanager中設置告警通知規則

告警通知規則

Alertmanager處理由例如Prometheus服務器等客戶端發來的警報。它負責刪除重複數據、分組，並將警報經過路由發送到正確的接收器，好比電子郵件、Slack等。Alertmanager還支持groups,silencing和警報抑制的機制。

分組

分組是指將同一類型的警報分類爲單個通知。當許多系統同時宕機時，頗有可能成百上千的警報會同時生成，這種機制特別有用。
例如，當數十或數百個服務的實例在運行，網絡發生故障時，有可能一半的服務實例不能訪問數據庫。在prometheus告警規則中配置爲每個服務實例都發送警報的話，那麼結果是數百警報被髮送至Alertmanager。

可是做爲用戶只想看到單一的報警頁面，同時仍然可以清楚的看到哪些實例受到影響，所以，能夠經過配置Alertmanager將警報分組打包，併發送一個相對看起來緊湊的通知。

分組警報、警報時間，以及接收警報的receiver是在alertmanager配置文件中經過路由樹配置的。

抑制(Inhibition)

抑制是指當警報發出後，中止重複發送由此警報引起其餘錯誤的警報的機制。(好比網絡不可達，致使其餘服務鏈接相關警報)

例如，當整個集羣網絡不可達，此時警報被觸發，能夠事先配置Alertmanager忽略由該警報觸發而產生的全部其餘警報，這能夠防止通知數百或數千與此問題不相關的其餘警報。

抑制機制也是經過Alertmanager的配置文件來配置。

沉默(Silences)

Silences是一種簡單的特定時間不告警的機制。silences警告是經過匹配器(matchers)來配置，就像路由樹同樣。傳入的警報會匹配RE，若是匹配，將不會爲此警報發送通知。
silences報警機制能夠經過Alertmanager的Web頁面進行配置。

接收

使用Receiver定義各類通知用戶的途徑，告警通過分組，過濾處理後選擇匹配的通知渠道發送給接收用戶。

部署

報警觸發規則

定義報警規則
報警規則經過如下格式定義：

ALERT <alert name> IF <expression> [ FOR <duration> ] [ LABELS <label set> ] [ ANNOTATIONS <label set> ]

可選的FOR語句，使得Prometheus在表達式輸出的向量元素（例如高HTTP錯誤率的實例）之間等待一段時間，將警報計數做爲觸發此元素。若是元素是active，可是沒有firing的，就處於pending狀態。
LABELS（標籤）語句容許指定一組標籤附加警報上。將覆蓋現有衝突的任何標籤，標籤值也能夠被模板化。
ANNOTATIONS（註釋）它們被用於存儲更長的其餘信息，例如警報描述或者連接，註釋值也能夠被模板化。
Templating(模板) 標籤和註釋值可使用控制檯模板進行模板化。 $labels變量保存警報實例的標籤鍵/值對，value保存警報實例的評估值。$

# To insert a firing element's label values: {{ $labels.<labelname> }} # To insert the numeric expression value of the firing element: {{ $value }}

例子:
prometheus.yml中配置Prometheus和AlertManager通訊的通訊方式以及告警觸發規則。

在prometheus.yml中配置Prometheus來和AlertManager通訊

alerting:
  alertmanagers:
    - static_configs:
      - targets: ["alertManager域名:9093"]

在prometheus.yml中指定匹配報警規則的間隔

# How frequently to evaluate rules. [ evaluation_interval: <duration> | default = 1m ]

在prometheus.yml中指定規則文件（可以使用通配符，如rules/*.rules）

rule_files:
    - /etc/prometheus/rules.yml

其中rule_files就是用來指定報警規則的，這裏咱們將rules.yml用ConfigMap的形式掛載到/etc/prometheus目錄下面便可:

rules.yml: |
    groups:
    - name: test-rule rules: - alert: NodeFilesystemUsage expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Filesystem usage detected" description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}" - alert: NodeMemoryUsage expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High CPU usage detected" description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"

配置文件設置好後，須要讓prometheus從新讀取，有兩種方法：

經過HTTP API向/-/reload發送POST請求，
例：curl -X POST http://localhost:9090/-/reload
向prometheus進程發送SIGHUP信號

告警通知規則

全局
要指定加載的配置文件，須要使用-config.file標誌。該文件使用YAML來完成，經過下面的描述來定義。帶括號的參數表示是可選的，對於非列表的參數的值，將被設置爲指定的缺省值。

通用佔位符定義解釋:

<duration> : 與正則表達式匹配的持續時間值,[0-9]+(ms|[smhdwy]) <labelname>: 與正則表達式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]* <labelvalue>: unicode字符串 <filepath>: 有效的文件路徑 <boolean>: boolean類型，true或者false <string>: 字符串 <tmpl_string>: 模板變量字符串

global全局配置文件參數在全部配置上下文生效，做爲其餘配置項的默認值,可被覆蓋.

global: resolve_timeout: 30s smtp_smarthost: "smtp.163.com:25" smtp_from: 'xiyanxiyan10@163.com' smtp_auth_username: "xiyanxiyan10@163.com" smtp_auth_password: "xiyanxiyan10" smtp_require_tls: false

路由(route)
路由塊定義了路由樹及其子節點。若是沒有設置的話，子節點的可選配置參數從其父節點繼承。

每一個警報都會在配置的頂級路由中進入路由樹，該路由樹必須匹配全部警報（即沒有任何配置的匹配器）。而後遍歷子節點。若是continue的值設置爲false，它在第一個匹配的子節點以後就中止；若是continue的值爲true，警報將繼續進行後續子節點的匹配。若是警報不匹配任何節點的任何子節點（沒有匹配的子節點，或不存在），該警報基於當前節點的配置處理。

route: receiver: mailhook group_wait: 30s group_interval: 1m repeat_interval: 1m group_by: [NodeMemoryUsage, NodeCPUUsage, NodeFilesystemUsage] routes: - receiver: mailhook group_wait: 10s match: team: node

設置alertmanager.yml的的route與receivers

route:
  # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ['alertname'] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 5s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 1m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3h # A default receiver receiver: mengyuan receivers: - name: 'mengyuan' webhook_configs: - url: http://192.168.0.53:8080 email_configs: - to: 'xiyanxiyan10@hotmail.com'

路由配置格式

#報警接收器 [ receiver: <string> ] #分組 [ group_by: '[' <labelname>, ... ']' ] # Whether an alert should continue matching subsequent sibling nodes. [ continue: <boolean> | default = false ] # A set of equality matchers an alert has to fulfill to match the node. #根據匹配的警報，指定接收器 match: [ <labelname>: <labelvalue>, ... ] # A set of regex-matchers an alert has to fulfill to match the node. match_re: #根據匹配正則符合的警告，指定接收器 [ <labelname>: <regex>, ... ] # How long to initially wait to send a notification for a group # of alerts. Allows to wait for an inhibiting alert to arrive or collect # more initial alerts for the same group. (Usually ~0s to few minutes.) [ group_wait: <duration> ] # How long to wait before sending notification about new alerts that are # in are added to a group of alerts for which an initial notification # has already been sent. (Usually ~5min or more.) [ group_interval: <duration> ] # How long to wait before sending a notification again if it has already # been sent successfully for an alert. (Usually ~3h or more). [ repeat_interval: <duration> ] # Zero or more child routes. routes: [ - <route> ... ]

例子：

// Match does a depth-first left-to-right search through the route tree // and returns the matching routing nodes. func (r *Route) Match(lset model.LabelSet) []*Route { Alert Alert是alertmanager接收到的報警，類型以下。 // Alert is a generic representation of an alert in the Prometheus eco-system. type Alert struct { // Label value pairs for purpose of aggregation, matching, and disposition // dispatching. This must minimally include an "alertname" label. Labels LabelSet `json:"labels"` // Extra key/value information which does not define alert identity. Annotations LabelSet `json:"annotations"` // The known time range for this alert. Both ends are optional. StartsAt time.Time `json:"startsAt,omitempty"` EndsAt time.Time `json:"endsAt,omitempty"` GeneratorURL string `json:"generatorURL"` }

具備相同Lables的Alert（key和value都相同）纔會被認爲是同一種。在prometheus rules文件配置的一條規則可能會產生多種報警

抑制規則 inhibit_rule

抑制規則，是存在另外一組匹配器匹配的狀況下，使其餘被引起警報的規則靜音。這兩個警報，必須有一組相同的標籤。

抑制配置格式

# Matchers that have to be fulfilled in the alerts to be muted. ##必須在要須要靜音的警報中履行的匹配者 target_match: [ <labelname>: <labelvalue>, ... ] target_match_re: [ <labelname>: <regex>, ... ] # Matchers for which one or more alerts have to exist for the # inhibition to take effect. #必須存在一個或多個警報以使抑制生效的匹配者。 source_match: [ <labelname>: <labelvalue>, ... ] source_match_re: [ <labelname>: <regex>, ... ] # Labels that must have an equal value in the source and target # alert for the inhibition to take effect. #在源和目標警報中必須具備相等值的標籤才能使抑制生效 [ equal: '[' <labelname>, ... ']' ]

接收器(receiver)

顧名思義，警報接收的配置。
通用配置格式

# The unique name of the receiver. name: <string> # Configurations for several notification integrations. email_configs: [ - <email_config>, ... ] pagerduty_configs: [ - <pagerduty_config>, ... ] slack_config: [ - <slack_config>, ... ] opsgenie_configs: [ - <opsgenie_config>, ... ] webhook_configs: [ - <webhook_config>, ... ] 郵件接收器email_config # Whether or not to notify about resolved alerts. #警報被解決以後是否通知 [ send_resolved: <boolean> | default = false ] # The email address to send notifications to. to: <tmpl_string> # The sender address. [ from: <tmpl_string> | default = global.smtp_from ] # The SMTP host through which emails are sent. [ smarthost: <string> | default = global.smtp_smarthost ] # The HTML body of the email notification. [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] # Further headers email header key/value pairs. Overrides any headers # previously set by the notification implementation. [ headers: { <string>: <tmpl_string>, ... } ]

Slcack接收器slack_config

# Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = true ] # The Slack webhook URL. [ api_url: <string> | default = global.slack_api_url ] # The channel or user to send notifications to. channel: <tmpl_string> # API request data as defined by the Slack webhook API. [ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ] [ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}' [ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ] [ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ] [ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ] [ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ] [ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器webhook_config

# Whether or not to notify about resolved alerts. [ send_resolved: <boolean> | default = true ] # The endpoint to send HTTP POST requests to. url: <string>

Alertmanager會使用如下的格式向配置端點發送HTTP POST請求：

{
  "version": "3", "groupKey": <number> // key identifying the group of alerts (e.g. to deduplicate) "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backling to the Alertmanager. "alerts": [ { "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>" }, ... ] }

例子:

receivers: - name: mailhook email_configs: - to: "xiyanxiyan10@hotmail.com" html: '{{ template "alert.html" . }}' headers: { Subject: "[WARN] Warn Email" }

集羣信息展現

集羣的總體信息已經收集彙總在Prometheus中，但Prometheus主要是對外提供數據獲取接口，並不負責完成完善的圖形展現，所以須要使用DashBoard工具對接Prometheus完成集羣信息的圖形化展現.

Grafana

概述

grafana 是一款採用 go 語言編寫的開源應用，主要用於大規模指標數據的可視化展示，基於商業友好的 Apache License 2.0 開源協議。grafana有熱插拔控制面板和可擴展的數據源，目前已經支持絕大部分經常使用的時序數據庫。

目前grafana支持的數據源

支持的數據源

上文已經提到，Grafana支持不少的數據源，主要支持的有以下數據源：

部署

部署服務

# cat grafana-deploy.yaml --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: grafana-core namespace: kube-system labels: app: grafana component: core spec: replicas: 1 template: metadata: labels: app: grafana component: core spec: containers: - image: grafana/grafana:4.2.0 name: grafana-core imagePullPolicy: IfNotPresent # env: resources: # keep request = limit to keep this container in guaranteed class limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi env: # The following env variables set up basic auth twith the default admin user and admin password. - name: GF_AUTH_BASIC_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ENABLED value: "false" # - name: GF_AUTH_ANONYMOUS_ORG_ROLE # value: Admin # does not really work, because of template variables in exported dashboards: # - name: GF_DASHBOARDS_JSON_ENABLED # value: "true" readinessProbe: httpGet: path: /login port: 3000 # initialDelaySeconds: 30 # timeoutSeconds: 1 volumeMounts: - name: grafana-persistent-storage mountPath: /var volumes: - name: grafana-persistent-storage emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: grafana namespace: kube-system labels: app: grafana component: core spec: type: NodePort ports: - port: 3000 targetPort: 3000 nodePort: 30009 selector: app: grafana