環境信息html
# k8s版本(使用kubeadm安裝): $ kubectl version Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:17:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:09:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
# helm 版本: $ helm version version.BuildInfo{Version:"v3.3.4", GitCommit:"a61ce5633af99708171414353ed49547cf05013d", GitTreeState:"clean", GoVersion:"go1.14.9"}
helm 部署 Prometheus operator,官方文檔github連接,注意須要k8s版本1.16以上以及helm3 以上node
# 添加repo源 $ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts $ helm repo add stable https://charts.helm.sh/stable $ helm repo update
# 檢查repo源是否添加成功,以下顯示prometheus-community表示已添加成功 $ helm repo list NAME URL prometheus-community https://prometheus-community.github.io/helm-charts stable https://charts.helm.sh/stable
這裏由於要給prometheus operator添加pvc持久化存儲以及添加額外的監控還有修改部分默認的配置,因此選擇先把chart包下載到本地,修改好values.yaml之後再執行安裝命令linux
# 下載chart包 $ helm pull prometheus-community/kube-prometheus-stack $ ls -l -rw-r--r-- 1 root root 326161 Dec 21 10:24 kube-prometheus-stack-12.2.3.tgz # 解壓 $ tar -xzvf kube-prometheus-stack-12.2.3.tgz # 編輯values.yaml $ cd kube-prometheus-stack # 切換到目錄 $ vi values.yaml
筆者的環境經過hostPath的方式提供pv,因此在修改values.yaml以前,咱們先在集羣建立對應的pvnginx
$ mkdir -p /promethous/{alert,grafana,promethous} # 建立用於pv的hostPath目錄
保存如下內容爲prometheus-pv.yaml後執行kubectl create -f prometheus-pv.yaml,建立pvgit
--- # storageClass.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: prometheus provisioner: kubernetes.io/no-provisioner reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer --- # alertmanager pv apiVersion: v1 kind: PersistentVolume metadata: finalizers: - kubernetes.io/pv-protection labels: use: alert name: alert-pv spec: accessModes: - ReadWriteOnce capacity: storage: 20Gi local: path: /promethous/alert nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - 192.168.0.13 # 指定節點親和性,這裏筆者將pv指定到192.168.0.13這個節點 persistentVolumeReclaimPolicy: Delete storageClassName: prometheus volumeMode: Filesystem --- # grafana pv apiVersion: v1 kind: PersistentVolume metadata: finalizers: - kubernetes.io/pv-protection labels: use: grafana name: grafana-pv spec: accessModes: - ReadWriteOnce capacity: storage: 10Gi local: path: /promethous/grafana nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - 192.168.0.13 persistentVolumeReclaimPolicy: Delete storageClassName: prometheus volumeMode: Filesystem --- # prometheus pv apiVersion: v1 kind: PersistentVolume metadata: finalizers: - kubernetes.io/pv-protection labels: use: prometheus name: prometheus-pv spec: accessModes: - ReadWriteOnce capacity: storage: 70Gi local: path: /promethous/promethous nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - 192.168.0.13 persistentVolumeReclaimPolicy: Delete storageClassName: prometheus volumeMode: Filesystem
values.yaml文件內容較長,此處逐一貼出須要修改的部分,完整的修改完成的values.yaml我已上傳到這裏,讀者可自行獲取(注:部份內容還需根據本身的實際環境修改)github
alertmanager 配置:api
# alertmanager 部分主要添加告警模板和告警接收者以及添加ingress配置暴露alertmanager界面, # alertmanager 與 prometheus 的對接 chart 包已默認實現 alertmanager: enabled: true apiVersion: v2 serviceAccount: create: true name: "" annotations: {} podDisruptionBudget: enabled: false minAvailable: 1 maxUnavailable: "" config: global: resolve_timeout: 5m # 郵箱告警配置 smtp_hello: 'kubernetes' smtp_from: 'example@163.com' smtp_smarthost: 'smtp.163.com:25' smtp_auth_username: 'test' smtp_auth_password: 'USFRGHSFQTCJNDAHQ' # 此處爲受權密碼,並不是郵箱密碼,具體根據你用的郵箱進行配置 # 企業微信告警配置參考 https://www.cnblogs.com/miaocbin/p/13706164.html wechat_api_secret: 'RRTAFGGSS0G_KFSl6FYBVlHyMo' wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' wechat_api_corp_id: 'ssssdghsetyxsg' templates: - '/etc/alertmanager/config/*.tmpl' # 指定告警模板路徑 route: group_by: ['job'] # 經過job這個標籤來進行告警分組 group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'wechat' # 指定默認接收者 # 指定擁有label key爲:alertname 值爲:Watchdog的告警發給email這個receiver routes: - match: alertname: Watchdog receiver: 'email' # 定義郵件告警接收者 receivers: - name: 'email' email_configs: - to: 'test@example.com' html: '{{ template "template_email.tmpl" }}' # 定義企業微信告警接收者 - name: 'wechat' wechat_configs: - send_resolved: true message: '{{ template "template_wechat.tmpl" . }}' to_party: '2' agent_id: '1000002' tplConfig: false templateFiles: # 郵件告警模板 template_email.tmpl: |- {{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }} {{ define "slack.myorg.text" }} {{- $root := . -}} {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}` *Cluster:* {{ template "cluster" $root }} *Description:* {{ .Annotations.description }} *Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:> *Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:> *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }} {{ end }} # 微信告警模板,此處參考 https://www.cnblogs.com/miaocbin/p/13706164.html template_wechat.tmpl: |- {{ define "template_wechat.tmpl" }} {{- if gt (len .Alerts.Firing) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 }} ========= 監控報警 ========= 告警狀態:{{ .Status }} 告警級別:{{ .Labels.severity }} 告警類型:{{ $alert.Labels.alertname }} 故障主機: {{ $alert.Labels.instance }} 告警主題: {{ $alert.Annotations.summary }} 告警詳情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}; 觸發閥值:{{ .Annotations.value }} 故障時間: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} ========= = end = ========= {{- end }} {{- end }} {{- end }} {{- if gt (len .Alerts.Resolved) 0 -}} {{- range $index, $alert := .Alerts -}} {{- if eq $index 0 }} ========= 異常恢復 ========= 告警類型:{{ .Labels.alertname }} 告警狀態:{{ .Status }} 告警主題: {{ $alert.Annotations.summary }} 告警詳情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}}; 故障時間: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} 恢復時間: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} {{- if gt (len $alert.Labels.instance) 0 }} 實例信息: {{ $alert.Labels.instance }} {{- end }} ========= = end = ========= {{- end }} {{- end }} {{- end }} {{- end }}
# 配置 alertmanager 經過 ingress 暴露(根據我的須要也可經過NodePort的方式暴露) ingress: enabled: true # 開啓ingress ingressClassName: nginx # 若是k8s集羣有多個ingress控制器請指定具體在哪一個控制器上暴露,筆者是經過 ingress-nginx-controller 暴露 annotations: {} labels: {} hosts: - alertmanager.test.com # 指定host域名 paths: [] tls: [] secret: annotations: {}
# 持久化存儲 storage: volumeClaimTemplate: spec: storageClassName: prometheus # 指定storage class accessModes: ["ReadWriteOnce"] resources: requests: storage: 20Gi selector: # 匹配具體的 pv matchLabels: use: alert # 此處的標籤要和上面定義的 pv 標籤匹配
grafana配置:瀏覽器
grafana: enabled: true defaultDashboardsEnabled: true adminPassword: admin@123 # 設置默認密碼 plugins: - grafana-kubernetes-app # 選擇安裝插件,該插件提供k8s相關的監控dashboard,能夠選擇安裝 # 持久化配置 persistence: type: pvc enabled: true selector: # 選定特定標籤的pv matchLabels: use: grafana storageClassName: prometheus # 指定storage class accessModes: - ReadWriteOnce size: 10Gi finalizers: - kubernetes.io/pvc-protection ingress: enabled: true # 開啓ingress annotations: kubernetes.io/ingress.class: nginx # 指定ingress class labels: {} hosts: - grafana.sz.com # 指定 grafana 域名 path: /
prometheus配置主要添加ingress配置和ingress controller target和持久化配置:微信
prometheus: enabled: true annotations: {} serviceAccount: create: true name: "" service: annotations: {} labels: {} clusterIP: "" port: 9090 targetPort: 9090 externalIPs: [] nodePort: 30090 loadBalancerIP: "" loadBalancerSourceRanges: [] type: ClusterIP # 改成NodePort的將暴露上面的30090端口 sessionAffinity: "" ingress: enabled: true # 開啓ingress ingressClassName: nginx annotations: {} labels: {} hosts: - prometheus.test.com paths: [] tls: [] image: repository: quay.io/prometheus/prometheus tag: v2.22.1 sha: "" # 持久化配置 storageSpec: volumeClaimTemplate: spec: storageClassName: prometheus accessModes: ["ReadWriteOnce"] resources: requests: storage: 50Gi selector: matchLabels: use: prometheus # 添加額外的ingress controller target,須要在ingress controller的deployment中加入prometheus.io/scrape: "true"註釋纔會開啓抓取數據操做 additionalScrapeConfigs: - job_name: 'ingress-nginx-endpoints' kubernetes_sd_configs: - role: pod namespaces: names: - ingress-nginx relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__meta_kubernetes_pod_container_port_name] action: keep regex: metrics - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 - source_labels: [__meta_kubernetes_service_name] regex: prometheus-server action: drop
修改ingress controller的deployment配置,添加prometheus.io/scrape: "true" 註釋,注意是在spec.template.metadata.annotations處添加session
$ kubectl edit deploy -n ingress-nginx ingress-nginx-controller spec: template: metadata: annotations: prometheus.io/scrape: "true"
k8s 1.19.4版本部分組件提供metrics的端口更改了,須要修改serviceMonitor默認定義的端口號:
# kubeControllerManager: kubeControllerManager: enabled: true service: port: 10257 # 端口號爲https的10257 targetPort: 10257 serviceMonitor: https: true # 默認使用的是https insecureSkipVerify: true # 跳過證書認證 # etcd kubeEtcd: enabled: true service: port: 2381 targetPort: 2381 serviceMonitor: scheme: http insecureSkipVerify: false # kubeScheduler: kubeScheduler: enabled: true service: port: 10259 targetPort: 10259 serviceMonitor: https: true insecureSkipVerify: true # kubeProxy: kubeProxy: enabled: true service: port: 10249 targetPort: 10249
修改完成後保存,在values.yaml文件所在的路徑下執行如下命令安裝prometheus operator
$ helm install prometheus .
經過如下命令查看安裝狀態
$ kubectl get pod -n monitoring --watch
待全部pod狀態Running後,經過瀏覽器訪問上面定義的ingress host,如你的ingress controller不是監聽在80端口則要加上ingress controller的端口號進行訪問如:
默認狀況下prometheus中的etcd、kube-scheduler還有kube-controller的targets會顯示抓取指標失敗,這是由於kubeadm方式部署的k8s集羣,默認一些監控指標端口監聽在127.0.0.1上,要讓prometheus可以正確抓取到監控信息還須要更改manifests讓對應的組件指標端口監聽在可被prometheus訪問的ip上
$ cd /etc/kubernetes/manifests # 在master節點上切換到該路徑下 # vi etcd.yaml,添加 "--listen-metrics-urls=http://0.0.0.0:2381" spec: containers: - command: - etcd - --listen-metrics-urls=http://0.0.0.0:2381 ... # vi kube-controller-manager.yaml 修改"--bind-address=0.0.0.0" spec: containers: - command: - kube-controller-manager - --bind-address=0.0.0.0 ... # vi kube-scheduler.yaml 修改"--bind-address=0.0.0.0" spec: containers: - command: - kube-scheduler - --bind-address=0.0.0.0 ...
grafana 配置kubernetes 插件
點擊左側圖標選擇plugins,出現k8s圖標點擊進入
點擊鏈接進入配置界面
輸入k8s集羣名稱,輸入k8s集羣訪問地址,由於grafana是部署在k8s集羣內部,因此此處能夠配置集羣內部的訪問域名,打開tls和ca認證按鈕,底下貼入對應的證書信息,該信息可在k8s master節點的/root/.kube/config 配置文件中獲取certificate-authority-data、client-certificate-data和client-key-data分別對應CA Cert、Client Cert和Client Key,須要注意的是config配置文件中的值須要經過base64 -d解密後才能使用
數據源選擇prometheus,點擊save
保存後在左側可看到新增的k8s圖標,點擊進去可看到相應的監控界面
至此安裝部分結束!
自定義監控使用介紹:
經過建立PrometheusRule資源對象可添加自定義告警
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: # 此處必須添加這兩個標籤operator纔會自動根據該定義建立對應的告警規則,此處的標籤和prometheus資源對象裏面的ruleSelector.matchLabels定義的標籤匹配,可經過 kubectl get prometheus -n monitoring prometheus-kube-prometheus-prometheus -oyaml 獲取 app: kube-prometheus-stack release: prometheus name: deployment-status namespace: monitoring spec: groups: - name: deployment-status rules: - alert: DeploymentUnavailable # prometheus 界面上顯示的alert名稱 annotations: summary: deployment {{ $labels.deployment }} unavailable # 此處定義的信息可在告警模板中經過{{ $alert.Annotations.summary }} 獲取 description: 工做負載:{{ $labels.deployment }}, 有{{ $value }}實例不可用 expr: | kube_deployment_status_replicas_unavailable > 0 # 觸發規則 for: 3m # 該告警規則觸發持續3分鐘將alertmanager將收到此告警信息 labels: severity: critical # 此處定義的標籤會傳到alertmanager裏
可經過建立serviceMonitor對象的方式添加自定義監控項,也可在chart包的values.yaml添加自定義監控,如前面添加ingress controller的監控項,具體的細節和其餘選項可參考官網,此處再也不描述
參考資料:
https://github.com/prometheus...
https://www.cnblogs.com/miaoc...