Prometheus Operator 教程：根據服務維度對 Prometheus 分片

時間 2020-08-11

標籤 prometheus operator 教程根據服務維度分片简体版

原文原文鏈接

原文連接：https://fuckcloudnative.io/posts/aggregate-metrics-user-prometheus-operator/node

Promtheus 自己只支持單機部署，沒有自帶支持集羣部署，也不支持高可用以及水平擴容，它的存儲空間受限於本地磁盤的容量。同時隨着數據採集量的增長，單臺 Prometheus 實例可以處理的時間序列數會達到瓶頸，這時 CPU 和內存都會升高，通常內存先達到瓶頸，主要緣由有：linux

Prometheus 的內存消耗主要是由於每隔 2 小時作一個 Block 數據落盤，落盤以前全部數據都在內存裏面，所以和採集量有關。
加載歷史數據時，是從磁盤到內存的，查詢範圍越大，內存越大。這裏面有必定的優化空間。
一些不合理的查詢條件也會加大內存，如 Group 或大範圍 Rate。

這個時候要麼加內存，要麼經過集羣分片來減小每一個實例須要採集的指標。本文就來討論經過 Prometheus Operator 部署的 Prometheus 如何根據服務維度來拆分實例。git

1. 根據服務維度拆分 Prometheus

Prometheus 主張根據功能或服務維度進行拆分，即若是要採集的服務比較多，一個 Prometheus 實例就配置成僅採集和存儲某一個或某一部分服務的指標，這樣根據要採集的服務將 Prometheus 拆分紅多個實例分別去採集，也能必定程度上達到水平擴容的目的。github

在 Kubernetes 集羣中，咱們能夠根據 namespace 來拆分 Prometheus 實例，例如將全部 Kubernetes 集羣組件相關的監控發送到一個 Prometheus 實例，將其餘全部監控發送到另外一個 Prometheus 實例。web

Prometheus Operator 經過 CRD 資源名 Prometheus 來控制 Prometheus 實例的部署，其中能夠經過在配置項 serviceMonitorNamespaceSelector 和 podMonitorNamespaceSelector 中指定標籤來限定抓取 target 的 namespace。例如，將 namespace kube-system 打上標籤 monitoring-role=system，將其餘的 namespace 打上標籤 monitoring-role=others。數據庫

2. 告警規則拆分

將 Prometheus 拆分紅多個實例以後，就不能再使用默認的告警規則了，由於默認的告警規則是針對全部 target 的監控指標的，每個 Prometheus 實例都沒法獲取全部 target 的監控指標，勢必會一直報警。爲了解決這個問題，須要對告警規則進行拆分，使其與每一個 Prometheus 實例的服務維度一一對應，按照上文的拆分邏輯，這裏只須要拆分紅兩個告警規則，打上不一樣的標籤，而後在 CRD 資源 Prometheus 中經過配置項 ruleSelector 指定規則標籤來選擇相應的告警規則。api

3. 集中數據存儲

解決了告警問題以後，還有一個問題，如今監控數據比較分散，使用 Grafana 查詢監控數據時咱們也須要添加許多數據源，並且不一樣數據源之間的數據還不能聚合查詢，監控頁面也看不到全局的視圖，形成查詢混亂的局面。bash

爲了解決這個問題，咱們可讓 Prometheus 不負責存儲數據，只將採集到的樣本數據經過 Remote Write 的方式寫入遠程存儲的 Adapter，而後將 Grafana 的數據源設爲遠程存儲的地址，就能夠在 Grafana 中查看全局視圖了。這裏選擇 VictoriaMetrics 來做爲遠程存儲。VictoriaMetrics 是一個高性能，低成本，可擴展的時序數據庫，能夠用來作 Prometheus 的長期存儲，分爲單機版本和集羣版本，均已開源。若是數據寫入速率低於每秒一百萬個數據點，官方建議使用單節點版本而不是集羣版本。本文做爲演示，僅使用單機版本，架構如圖：網絡

4. 實踐

肯定好了方案以後，下面來進行動手實踐。session

部署 VictoriaMetrics

首先部署一個單實例的 VictoriaMetrics，完整的 yaml 以下：

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: victoriametrics
  namespace: kube-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: victoriametrics
  name: victoriametrics
  namespace: kube-system
spec:
  serviceName: pvictoriametrics
  selector:
    matchLabels:
      app: victoriametrics
  replicas: 1
  template:
    metadata:
      labels:
        app: victoriametrics
    spec:
      nodeSelector:
        blog: "true"
      containers:    
      - args:
        - --storageDataPath=/storage
        - --httpListenAddr=:8428
        - --retentionPeriod=1
        image: victoriametrics/victoria-metrics
        imagePullPolicy: IfNotPresent
        name: victoriametrics
        ports:
        - containerPort: 8428
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /health
            port: 8428
          initialDelaySeconds: 30
          timeoutSeconds: 30
        livenessProbe:
          httpGet:
            path: /health
            port: 8428
          initialDelaySeconds: 120
          timeoutSeconds: 30
        resources:
          limits:
            cpu: 2000m
            memory: 2000Mi
          requests:
            cpu: 2000m
            memory: 2000Mi
        volumeMounts:
        - mountPath: /storage
          name: storage-volume
      restartPolicy: Always
      priorityClassName: system-cluster-critical
      volumes:
      - name: storage-volume
        persistentVolumeClaim:
          claimName: victoriametrics
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: victoriametrics
  name: victoriametrics
  namespace: kube-system
spec:
  ports:
  - name: http
    port: 8428
    protocol: TCP
    targetPort: 8428
  selector:
    app: victoriametrics
  type: ClusterIP

有幾個啓動參數須要注意：

storageDataPath : 數據目錄的路徑。 VictoriaMetrics 將全部數據存儲在此目錄中。
retentionPeriod : 數據的保留期限（以月爲單位）。舊數據將自動刪除。默認期限爲1個月。
httpListenAddr : 用於監聽 HTTP 請求的 TCP 地址。默認狀況下，它在全部網絡接口上監聽端口 8428。

給 namespace 打標籤

爲了限定抓取 target 的 namespace，咱們須要給 namespace 打上標籤，使每一個 Prometheus 實例只抓取特定 namespace 的指標。根據上文的方案，須要給 kube-system 打上標籤 monitoring-role=system：

$ kubectl label ns kube-system monitoring-role=system

給其餘的 namespace 打上標籤 monitoring-role=others。例如：

$ kubectl label ns monitoring monitoring-role=others
$ kubectl label ns default monitoring-role=others

拆分 PrometheusRule

告警規則須要根據監控目標拆分紅兩個 PrometheusRule。具體作法是將 kube-system namespace 相關的規則整合到一個 PrometheusRule 中，並修更名稱和標籤：

# prometheus-rules-system.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: system
    role: alert-rules
  name: prometheus-system-rules
  namespace: monitoring
spec:
  groups:
...
...

剩下的放到另一個 PrometheusRule 中，並修更名稱和標籤：

# prometheus-rules-others.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: others
    role: alert-rules
  name: prometheus-others-rules
  namespace: monitoring
spec:
  groups:
...
...

而後刪除默認的 PrometheusRule：

$ kubectl -n monitoring delete prometheusrule prometheus-k8s-rules

新增兩個 PrometheusRule：

$ kubectl apply -f prometheus-rules-system.yaml
$ kubectl apply -f prometheus-rules-others.yaml

若是你實在不知道如何拆分規則，或者不想拆分，想作一個伸手黨，能夠看這裏：

拆分 Prometheus

下一步是拆分 Prometheus 實例，根據上面的方案須要拆分紅兩個實例，一個用來監控 kube-system namespace，另外一個用來監控其餘 namespace：

# prometheus-prometheus-system.yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: system 
  name: system
  namespace: monitoring
spec:
  remoteWrite:
    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write
      queueConfig:
        maxSamplesPerSend: 10000
  retention: 2h 
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.17.2
  nodeSelector:
    beta.kubernetes.io/os: linux
  podMonitorNamespaceSelector:
    matchLabels:
      monitoring-role: system 
  podMonitorSelector: {}
  replicas: 1 
  resources:
    requests:
      memory: 400Mi
    limits:
      memory: 2Gi
  ruleSelector:
    matchLabels:
      prometheus: system 
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector: 
    matchLabels:
      monitoring-role: system 
  serviceMonitorSelector: {}
  version: v2.17.2
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  labels:
    prometheus: others
  name: others
  namespace: monitoring
spec:
  remoteWrite:
    - url: http://victoriametrics.kube-system.svc.cluster.local:8428/api/v1/write
      queueConfig:
        maxSamplesPerSend: 10000
  retention: 2h
  alerting:
    alertmanagers:
    - name: alertmanager-main
      namespace: monitoring
      port: web
  image: quay.io/prometheus/prometheus:v2.17.2
  nodeSelector:
    beta.kubernetes.io/os: linux
  podMonitorNamespaceSelector: 
    matchLabels:
      monitoring-role: others 
  podMonitorSelector: {}
  replicas: 1
  resources:
    requests:
      memory: 400Mi
    limits:
      memory: 2Gi
  ruleSelector:
    matchLabels:
      prometheus: others 
      role: alert-rules
  securityContext:
    fsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccountName: prometheus-k8s
  serviceMonitorNamespaceSelector:
    matchLabels:
      monitoring-role: others 
  serviceMonitorSelector: {}
  additionalScrapeConfigs:
    name: additional-scrape-configs
    key: prometheus-additional.yaml
  version: v2.17.2

須要注意的配置：

經過 remoteWrite 指定 remote write 寫入的遠程存儲。
經過 ruleSelector 指定 PrometheusRule。
限制內存使用上限爲 2Gi，可根據實際狀況自行調整。
經過 retention 指定數據在本地磁盤的保存時間爲 2 小時。由於指定了遠程存儲，本地不須要保存那麼長時間，儘可能縮短。
Prometheus 的自定義配置能夠經過 additionalScrapeConfigs 在 others 實例中指定，固然你也能夠繼續拆分，放到其餘實例中。

刪除默認的 Prometheus 實例：

$ kubectl -n monitoring delete prometheus k8s

建立新的 Prometheus 實例：

$ kubectl apply -f prometheus-prometheus.yaml

查看運行情況：

$ kubectl -n monitoring get prometheus
NAME     VERSION   REPLICAS   AGE
system   v2.17.2   1          29h
others   v2.17.2   1          29h

$ kubectl -n monitoring get sts
NAME                READY   AGE
prometheus-system   1/1     29h
prometheus-others   1/1     29h
alertmanager-main   1/1     25d

查看每一個 Prometheus 實例的內存佔用：

$ kubectl -n monitoring top pod -l app=prometheus
NAME                  CPU(cores)   MEMORY(bytes)
prometheus-others-0   12m          110Mi
prometheus-system-0   121m         1182Mi

最後還要修改 Prometheus 的 Service，yaml 以下：

apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: system 
  name: prometheus-system
  namespace: monitoring
spec:
  ports:
  - name: web
    port: 9090
    targetPort: web
  selector:
    app: prometheus
    prometheus: system
  sessionAffinity: ClientIP
---
apiVersion: v1
kind: Service
metadata:
  labels:
    prometheus: others
  name: prometheus-others
  namespace: monitoring
spec:
  ports:
  - name: web
    port: 9090
    targetPort: web
  selector:
    app: prometheus
    prometheus: others
  sessionAffinity: ClientIP

刪除默認的 Service：

$ kubectl -n monitoring delete svc prometheus-k8s

建立新的 Service：

$ kubectl apply -f prometheus-service.yaml

修改 Grafana 數據源

Prometheus 拆分紅功以後，最後還要修改 Grafana 的數據源爲 VictoriaMetrics 的地址，這樣就能夠在 Grafana 中查看全局視圖，也能聚合查詢。

打開 Grafana 的設置頁面，將數據源修改成 http://victoriametrics.kube-system.svc.cluster.local:8428：

點擊 Explore 菜單：

在查詢框內輸入 up，而後按下 Shift+Enter 鍵查詢：

能夠看到查詢結果中包含了全部的 namespace。

若是你對個人 Grafana 主題配色很感興趣，能夠關注公衆號『雲原生實驗室』，後臺回覆 grafana 便可獲取祕訣。

寫這篇文章的原由是個人 k3s 集羣每臺節點的資源很緊張，並且監控的 target 不少，致使 Prometheus 直接把節點的內存資源消耗完了，不停地 OOM。爲了充分利用個人雲主機，不得不另謀他路，這纔有了這篇文章。

Kubernetes 1.18.2 1.17.5 1.16.9 1.15.12離線安裝包發佈地址http://store.lameleg.com ，歡迎體驗。使用了最新的sealos v3.3.6版本。做了主機名解析配置優化，lvscare 掛載/lib/module解決開機啓動ipvs加載問題，修復lvscare社區netlink與3.10內核不兼容問題,sealos生成百年證書等特性。更多特性 https://github.com/fanux/sealos 。歡迎掃描下方的二維碼加入釘釘羣，釘釘羣已經集成sealos的機器人實時能夠看到sealos的動態。

相關標籤/搜索

prometheus+alertmanager

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。