手動部署k8s-prometheus

時間 2019-12-10

標籤手動部署 k8s prometheus 简体版

原文原文鏈接

簡介

Prometheus 最初是 SoundCloud 構建的開源系統監控和報警工具，是一個獨立的開源項目，於2016年加入了 CNCF 基金會，做爲繼 Kubernetes 以後的第二個託管項目。node

特徵

Prometheus 相比於其餘傳統監控工具主要有如下幾個特色：git

具備由 metric 名稱和鍵/值對標識的時間序列數據的多維數據模型
有一個靈活的查詢語言
不依賴分佈式存儲，只和本地磁盤有關
經過 HTTP 的服務拉取時間序列數據
也支持推送的方式來添加時間序列數據
還支持經過服務發現或靜態配置發現目標
多種圖形和儀表板支持

組件

Prometheus 由多個組件組成，可是其中許多組件是可選的：github

Prometheus Server：用於抓取指標、存儲時間序列數據
exporter：暴露指標讓任務來抓
pushgateway：push 的方式將指標數據推送到該網關
alertmanager：處理報警的報警組件
adhoc：用於數據查詢

大多數 Prometheus 組件都是用 Go 編寫的，所以很容易構建和部署爲靜態的二進制文件。web

架構

下圖是 Prometheus 官方提供的架構及其一些相關的生態系統組件：shell

架構後端

總體流程比較簡單，Prometheus 直接接收或者經過中間的 Pushgateway 網關被動獲取指標數據，在本地存儲全部的獲取的指標數據，並對這些數據進行一些規則整理，用來生成一些聚合數據或者報警信息，Grafana 或者其餘工具用來可視化這些數據。api

安裝

因爲 Prometheus 是 Golang 編寫的程序，因此要安裝的話也很是簡單，只須要將二進制文件下載下來直接執行便可，前往地址：https://prometheus.io/download 下載咱們對應的版本便可。架構

Prometheus 是經過一個 YAML 配置文件來進行啓動的，若是咱們使用二進制的方式來啓動的話，可使用下面的命令：app

$ ./prometheus --config.file=prometheus.yml

其中 prometheus.yml 文件的基本配置以下：分佈式

global: scrape_interval: 15s evaluation_interval: 15s rule_files: # - "first.rules" # - "second.rules" scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090']

上面這個配置文件中包含了3個模塊：global、rule_files 和 scrape_configs。

其中 global 模塊控制 Prometheus Server 的全局配置：

scrape_interval：表示 prometheus 抓取指標數據的頻率，默認是15s，咱們能夠覆蓋這個值
evaluation_interval：用來控制評估規則的頻率，prometheus 使用規則產生新的時間序列數據或者產生警報

rule_files 模塊制定了規則所在的位置，prometheus 能夠根據這個配置加載規則，用於生成新的時間序列數據或者報警信息，當前咱們沒有配置任何規則。

scrape_configs 用於控制 prometheus 監控哪些資源。因爲 prometheus 經過 HTTP 的方式來暴露的它自己的監控數據，prometheus 也可以監控自己的健康狀況。在默認的配置裏有一個單獨的 job，叫作prometheus，它採集 prometheus 服務自己的時間序列數據。這個 job 包含了一個單獨的、靜態配置的目標：監聽 localhost 上的9090端口。prometheus 默認會經過目標的/metrics路徑採集 metrics。因此，默認的 job 經過 URL：http://localhost:9090/metrics採集 metrics。收集到的時間序列包含 prometheus 服務自己的狀態和性能。若是咱們還有其餘的資源須要監控的話，直接配置在該模塊下面就能夠了。

因爲咱們這裏是要跑在 Kubernetes 系統中，因此咱們直接用 Docker 鏡像的方式運行便可。

爲了方便管理，咱們將全部的資源對象都安裝在kube-ops的 namespace 下面，沒有的話須要提早安裝。

爲了可以方便的管理配置文件，咱們這裏將 prometheus.yml 文件用 ConfigMap 的形式進行管理：（prometheus-cm.yaml）

apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-ops data: prometheus.yml: | global: scrape_interval: 15s scrape_timeout: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090']

咱們這裏暫時只配置了對 prometheus 的監控，而後建立該資源對象：

$ kubectl create -f prometheus-cm.yaml
configmap "prometheus-config" created

配置文件建立完成了，之後若是咱們有新的資源須要被監控，咱們只須要將上面的 ConfigMap 對象更新便可。如今咱們來建立 prometheus 的 Pod 資源：(prometheus-deploy.yaml)

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus namespace: kube-ops labels: app: prometheus spec: template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus containers: - image: prom/prometheus:v2.4.3 name: prometheus command: - "/bin/prometheus" args: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention=24h" - "--web.enable-admin-api" # 控制對admin HTTP API的訪問，其中包括刪除時間序列等功能 - "--web.enable-lifecycle" # 支持熱更新，直接執行localhost:9090/-/reload當即生效 ports: - containerPort: 9090 protocol: TCP name: http volumeMounts: - mountPath: "/prometheus" subPath: prometheus name: data - mountPath: "/etc/prometheus" name: config-volume resources: requests: cpu: 100m memory: 512Mi limits: cpu: 100m memory: 512Mi securityContext: runAsUser: 0 volumes: - name: data persistentVolumeClaim: claimName: prometheus - configMap: name: prometheus-config name: config-volume

咱們在啓動程序的時候，除了指定了 prometheus.yml 文件以外，還經過參數storage.tsdb.path指定了 TSDB 數據的存儲路徑、經過storage.tsdb.retention設置了保留多長時間的數據，還有下面的web.enable-admin-api參數能夠用來開啓對 admin api 的訪問權限，參數web.enable-lifecycle很是重要，用來開啓支持熱更新的，有了這個參數以後，prometheus.yml 配置文件只要更新了，經過執行localhost:9090/-/reload就會當即生效，因此必定要加上這個參數。

咱們這裏將 prometheus.yml 文件對應的 ConfigMap 對象經過 volume 的形式掛載進了 Pod，這樣 ConfigMap 更新後，對應的 Pod 裏面的文件也會熱更新的，而後咱們再執行上面的 reload 請求，Prometheus 配置就生效了，除此以外，爲了將時間序列數據進行持久化，咱們將數據目錄和一個 pvc 對象進行了綁定，因此咱們須要提早建立好這個 pvc 對象：(prometheus-volume.yaml)

apiVersion: v1 kind: PersistentVolume metadata: name: prometheus spec: capacity: storage: 10Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Recycle nfs: server: 10.151.30.57 path: /data/k8s --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: prometheus namespace: kube-ops spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi

咱們這裏簡單的經過 NFS 做爲存儲後端建立一個 pv、pvc 對象：

$ kubectl create -f prometheus-volume.yaml

除了上面的注意事項外，咱們這裏還須要配置 rbac 認證，由於咱們須要在 prometheus 中去訪問 Kubernetes 的相關信息，因此咱們這裏管理了一個名爲 prometheus 的 serviceAccount 對象：(prometheus-rbac.yaml)

apiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: kube-ops --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: - "" resources: - nodes - services - endpoints - pods - nodes/proxy verbs: - get - list - watch - apiGroups: - "" resources: - configmaps - nodes/metrics verbs: - get - nonResourceURLs: - /metrics verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-ops

因爲咱們要獲取的資源信息，在每個 namespace 下面都有可能存在，因此咱們這裏使用的是 ClusterRole 的資源對象，值得一提的是咱們這裏的權限規則聲明中有一個nonResourceURLs的屬性，是用來對非資源型 metrics 進行操做的權限聲明，這個在之前咱們不多遇到過，而後直接建立上面的資源對象便可：

$ kubectl create -f prometheus-rbac.yaml
serviceaccount "prometheus" created clusterrole.rbac.authorization.k8s.io "prometheus" created clusterrolebinding.rbac.authorization.k8s.io "prometheus" created

還有一個要注意的地方是咱們這裏必需要添加一個securityContext的屬性，將其中的runAsUser設置爲0，這是由於如今的 prometheus 運行過程當中使用的用戶是 nobody，不然會出現下面的permission denied之類的權限錯誤：

level=error ts=2018-10-22T14:34:58.632016274Z caller=main.go:617 err="opening storage failed: lock DB directory: open /data/lock: permission denied"

如今咱們就能夠添加 promethues 的資源對象了：

$ kubectl create -f prometheus-deploy.yaml
deployment.extensions "prometheus" created $ kubectl get pods -n kube-ops NAME READY STATUS RESTARTS AGE prometheus-6dd775cbff-zb69l 1/1 Running 0 20m $ kubectl logs -f prometheus-6dd775cbff-zb69l -n kube-ops ...... level=info ts=2018-10-22T14:44:40.535385503Z caller=main.go:523 msg="Server is ready to receive web requests."

Pod 建立成功後，爲了可以在外部訪問到 prometheus 的 webui 服務，咱們還須要建立一個 Service 對象：(prometheus-svc.yaml)

apiVersion: v1 kind: Service metadata: name: prometheus namespace: kube-ops labels: app: prometheus spec: selector: app: prometheus type: NodePort ports: - name: web port: 9090 targetPort: http

爲了方便測試，咱們這裏建立一個NodePort類型的服務，固然咱們能夠建立一個Ingress對象，經過域名來進行訪問：

$ kubectl create -f prometheus-svc.yaml
service "prometheus" created $ kubectl get svc -n kube-ops NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE prometheus NodePort 10.111.118.104 <none> 9090:30987/TCP 24s

而後咱們就能夠經過http://任意節點IP:30987訪問 prometheus 的 webui 服務了。

prometheus webui

爲了數據的一致性，prometheus 全部的數據都是使用的 UTC 時間，因此咱們默認打開的 dashboard 中有這樣一個警告，咱們須要在查詢的時候指定咱們當前的時間才能夠。而後咱們能夠查看當前監控系統中的一些監控目標：

因爲咱們如今尚未配置任何的報警信息，因此 Alerts 菜單下面如今沒有任何數據，隔一下子，咱們能夠去 Graph 菜單下面查看咱們抓取的 prometheus 自己的一些監控數據了，其中- insert metrics at cursor -下面就是咱們蒐集到的一些監控數據指標：

好比咱們這裏就選擇scrape_duration_seconds這個指標，而後點擊Execute，若是這個時候沒有查詢到任何數據，咱們能夠切換到Graph這個 tab 下面從新選擇下時間，選擇到當前的時間點，從新執行，就能夠看到相似於下面的圖表數據了：

除了簡單的直接使用採集到的一些監控指標數據以外，這個時候也可使用強大的 PromQL 工具，PromQL其實就是 prometheus 便於數據聚合展現開發的一套 ad hoc 查詢語言的，你想要查什麼找對應函數取你的數據好了。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。