Prometheus的監控解決方案（含監控kubernetes）

時間 2019-11-13

標籤 prometheus 監控解決方案 kubernetes 简体版

原文原文鏈接

prometheus的簡介和安裝node

Prometheus（普羅米修斯）是一個開源系統監控和警報工具，最初是在SoundCloud創建的。自2012年成立以來，許多公司和組織都採用了普羅米修斯，該項目擁有一個很是活躍的開發者和用戶社區。它如今是一個獨立的開放源碼項目，而且獨立於任何公司。爲了強調這一點，爲了澄清項目的治理結構，普羅米修斯在2016年加入了雲計算基金會，成爲繼Kubernetes以後的第二個託管項目。mysql

特徵：linux

Prometheus的主要特徵有：git

多維度數據模型
靈活的查詢語言
不依賴分佈式存儲，單個服務器節點是自主的
以HTTP方式，經過pull模型拉去時間序列數據
也經過中間網關支持push模型
經過服務發現或者靜態配置，來發現目標服務對象
支持多種多樣的圖表和界面展現，grafana也支持它
組件github

Prometheus生態包括了不少組件，它們中的一些是可選的：web

主服務Prometheus Server負責抓取和存儲時間序列數據
客戶庫負責檢測應用程序代碼
支持短生命週期的PUSH網關
基於Rails/SQL儀表盤構建器的GUI
多種導出工具，能夠支持Prometheus存儲數據轉化爲HAProxy、StatsD、Graphite等工具所須要的數據存儲格式
警告管理器
命令行查詢工具
其餘各類支撐工具
多數Prometheus組件是Go語言寫的，這使得這些組件很容易編譯和部署。sql

架構vim

下面這張圖說明了Prometheus的總體架構，以及生態中的一些組件做用: api

Prometheus服務，能夠直接經過目標拉取數據，或者間接地經過中間網關拉取數據。它在本地存儲抓取的全部數據，並經過必定規則進行清理和整理數據，並把獲得的結果存儲到新的時間序列中，PromQL和其餘API可視化地展現收集的數據服務器

適用場景

Prometheus在記錄純數字時間序列方面表現很是好。它既適用於面向服務器等硬件指標的監控，也適用於高動態的面向服務架構的監控。對於如今流行的微服務，Prometheus的多維度數據收集和數據篩選查詢語言也是很是的強大。

Prometheus是爲服務的可靠性而設計的，當服務出現故障時，它可使你快速定位和診斷問題。它的搭建過程對硬件和服務沒有很強的依賴關係。

不適用場景

Prometheus，它的價值在於可靠性，甚至在很惡劣的環境下，你均可以隨時訪問它和查看系統服務各類指標的統計信息。若是你對統計數據須要100%的精確，它並不適用，例如：它不適用於實時計費系統

Prometheus的安裝

tar xvfz prometheus-*.tar.gz

cd prometheus-*

在運行Prometheus服務以前，咱們須要指定一個該服務運行所須要的配置文件

Prometheus經過Http方式拉取目標機上的度量指標。Prometheus服務也暴露本身運行所產生的數據，它可以抓取和監控本身的健康情況。

實際上，Prometheus服務收集本身運行所產生的時間序列數據，是沒有什麼意義的。可是它是一個很是好的入門級教程。保存在Prometheus配置到文件中，並自定義命名該文件名，如：prometheus.yml

在啓動普羅米修斯以前，須要進行配置

配置prometheus.yml文件，

global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
在示例配置文件中有三個模塊：global, rule_files, and scrape_configs.

global普羅米修斯服務器的全局配置。咱們有兩種選擇。第一個，scrape_interval，控制普羅米修斯的目標。您能夠將其覆蓋到單個目標。在這種狀況下，全球設置是每15秒刷新一次。evaluation_interval控制普羅米修斯評估規則的頻率。普羅米修斯使用規則建立新的時間序列並生成警報。

rule_files指定咱們但願普羅米修斯服務器加載的任何規則的位置。

scrape_configs控制普羅米修斯監視的資源。因爲普羅米修斯也將自身做爲HTTP端點的數據公開，所以它能夠對本身的健康進行刷新和監控。在默認的配置中，有一個單獨的任務，叫作prometheus。這將使普羅米修斯服務器暴露的時間序列數據受到影響。該做業包含一個單獨的、靜態配置的目標，即端口9090端口上的localhost。這個默認做業是經過URL抓取的:http://localhost:9090 /指標。

普羅米修斯經過導航到本身的指標端點來提供關於自身的度量：

http://ip:9090/metrics.

查看http相關參數：

官方文檔：https://prometheus.io/docs/prometheus/latest/getting_started/

中文翻譯：https://github.com/1046102779/prometheus/blob/master/introduction/install.md

安裝grafana

官網安裝步驟：

http://docs.grafana.org/installation/rpm/

下載安裝grafana

wgethttps://s3-us-west-2.amazonaws.com/grafana-releases/release/grafana-5.0.4-1.x86_64.rpm

yum install initscripts fontconfig

rpm -Uvh grafana-5.0.4-1.x86_64.rpm

配置prometheus數據源

常見匹配符和函數
官方文檔：

https://prometheus.io/docs/prometheus/latest/querying/operators/

中文翻譯：

https://github.com/1046102779/prometheus/blob/master/prometheus/querying/operators.md

常見匹配符：

+，-，*，/，%，^（加，減，乘，除，取餘，冪次方）

==，!=，>，<，>=，<=（等於，不等於，大於，小於，大於等於，小於等於）
聚合操做符：

sum(求和),min(取最小),max(取最大),avg(取平均)，count (計數器)

stddev (計算誤差),stdvar (計算方差)，count_values(每一個元素獨立值數量)，bottomk (取倒數幾個),topk(取前幾位)
具體使用：

查詢指標name爲http_requests_total 條件爲job，handler 的數據:
http_requests_total{job="prometheus", handler="query"}
取5min內其餘條件同上的數據:
http_requests_total{job="prometheus", handler="query"}[5m]
匹配job名稱以server結尾的數據:
http_requests_total{job=~".*eus"}
匹配status不等於4xx的數據：
http_requests_total{status!~"4.."}
查詢5min內，每秒指標爲http_requests_total的數據比率：
rate(http_requests_total[5m])
根據job分組，取每秒數據數量：
sum(rate(http_requests_total[5m])) by (job)
取各個實例的未使用內存量（以MB爲單位）
(node_memory_CommitLimit_bytes - node_memory_NFS_Unstable_bytes) / 1024
以instance, job爲分組，取未使用內存量（以MB爲單位）
sum(node_memory_CommitLimit_bytes - node_memory_NFS_Unstable_bytes) by (instance, job) / 1024
假如數據以下：
http_requests_total{code="503",handler="query_range",instance="localhost:9090",job="prometheus",method="get"}
http_requests_total{code="400",handler="query_range",instance="localhost:9090",job="prometheus",method="get"}
http_requests_total{code="400",handler="query",instance="localhost:9090",job="prometheus",method="get"}
取http_requests_total前五數據
topk(5, http_requests_total)
以handler,instance爲分組，取http_requests_total前三的數據：
topk(3, http_requests_total) by (handler,instance)
取數據的個數：
count(container_cpu_system_seconds_total) by (id)
函數使用方法：

一、absent()
absent(v instant-vector)，若是賦值給它的向量具備樣本數據，則返回空向量；若是傳遞的瞬時向量參數沒有樣本數據，則返回不帶度量指標名稱且帶有標籤的樣本值爲1的結果
當監控度量指標時，若是獲取到的樣本數據是空的，使用absent方法對告警是很是有用的
absent(nonexistent{job="promethues"})
二、irate
irate(v range-vector)函數, 輸入：範圍向量，輸出：key: value = 度量指標： (last值-last前一個值)/時間戳差值。它是基於最後兩個數據點，自動調整單調性，如：服務實例重啓，則計數器重置。
下面表達式針對範圍向量中的每一個時間序列數據，返回兩個最新數據點過去5分鐘的HTTP請求速率。
irate(http_requests_total{job="node-mysql"}[5m])
三、predict_linear
predict_linear(v range-vector, t scalar)預測函數，輸入：範圍向量和從如今起t秒後，輸出：不帶有度量指標，只有標籤列表的結果值。
predict_linear(http_requests_total{code="200",instance="localhost:9090",job="prometheus",method="get"}[5m], 5)
四、rate()
rate(v range-vector)函數, 輸入：範圍向量，輸出：key: value = 不帶有度量指標，且只有標籤列表：(last值-first值)/時間差s
http每秒的平均響應時間：
rate(http_request_size_bytes_sum [5m]) / rate(http_request_size_bytes_count [5m])
Prometheus監控服務

官方文檔：

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

中文翻譯：

https://github.com/1046102779/prometheus/blob/master/operating/configuration.md

Prometheus能夠經過命令行參數和配置文件來配置它的服務參數。命令行主要用於配置系統參數（例如：存儲位置，保留在磁盤和內存中的數據量大小等），配置文件主要用於配置與抓取任務和任務下的實例相關的全部內容, 而且加載指定的抓取規則file。

能夠經過運行prometheus -h命令, 查看Prometheus服務全部可用的命令行參數

使用-config.file命令行參數來指定Prometheus啓動所須要的配置文件。

這個配置文件是YAML格式，經過下面描述的範式定義, 括號表示參數是可選的。對於非列表參數，這個值被設置了默認值。

全局配置示例。

全局配置指定的參數，在其餘上下文配置中是生效的。這也默認這些全局參數在其餘配置區域有效。

# my global config
global:
scrape_interval: 15s
# Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s
# Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- /etc/prometheus/rules.yml
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
#監控node節點和node節點mysql
- job_name: node-mysql
static_configs:
- targets: ['192.168.81.173:9100','192.168.81.173:9104']
#monitor k8s監控kubernetes
- job_name: 'kubernetes-nodes-cadvisor'
kubernetes_sd_configs:
- api_server: 'http://localhost:8080';;
role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:10255'
target_label: __address__
- job_name: 'kubernetes_node'
kubernetes_sd_configs:
- role: node
api_server: 'http://localhost:8080';;
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__]
action: replace
target_label: nodeIp
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
Prometheus監控服務主要是經過exporter來監控，須要客戶端安裝相應的exporter來轉換成prometheus能識別的方式，prometheus已經維護了大多數常見服務的exporter：https://prometheus.io/docs/instrumenting/exporters/

監控MySQL

在prometheus服務端配置job和static-configs等，如上圖配置，而後在客戶端需安裝mysql-exporter

Wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.10.0/mysqld_exporter-0.10.0.linux-amd64.tar.gz -O mysqld_exporter-0.10.0.linux-amd64.tar.gz

mysql受權：

GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'prom'@'localhost' identified by 'abc123';

GRANT SELECT ON performance_schema.* TO 'prom'@'localhost';

配置mysql-exporter配置文件

vim .my.cnf

[client]

user=prom

password=abc123
啓動mysql-exporter

./mysqld_exporter -config.my-cnf=".my.cnf"

而後能夠看到新監聽了一個9104端口，MySQL監控配置完成

監控kubernetes

prometheus獲取監控端點的方式有不少，其中就包括k8s，prometheu會經過調用master的apiserver獲取到節點信息，而後去調取每一個節點的數據。

配置方式：在prometheus服務端配置文件中配置job等相應信息，如上配置會監控每一個節點的容器信息和節點監控信息。須要在k8s中部署node-exporter pod,yaml文件以下：

apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
labels:
app: node-exporter
name: node-exporter
name: node-exporter
spec:
clusterIP: None
ports:
- name: scrape
port: 9100
protocol: TCP
selector:
app: node-exporter
type: ClusterIP
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
spec:
template:
metadata:
labels:
app: node-exporter
name: node-exporter
spec:
containers:
- image: prom/node-exporter
name: node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: scrape
hostNetwork: true
構建node_export的pod

kubectl create -f node_export_pod.yaml

查看prometheus監控狀態

報警
官方文檔：https://prometheus.io/docs/alerting/configuration/

中文翻譯：

https://github.com/1046102779/prometheus/blob/master/alerting/configuration.md

Pormetheus的警告由獨立的兩部分組成。Prometheus服務中的警告規則發送警告到Alertmanager。而後這個Alertmanager管理這些警告。包括silencing, inhibition, aggregation，以及經過一些方法發送通知，例如：email，webhook和HipChat。

prometheus設置報警的思路:

一、./alertmanager --config.file=simple.yml加載的報警的媒介（如郵件、webhook）

二、./prometheus --config.file=prometheus.yml中指定配置通訊的主機和規則文件。

三、在上述配置的規則文件中配置預警策略和模板

配置預警

一、下載安裝解壓alermanager

tar -zxvf alertmanager-0.15.0-rc.1.linux-amd64.tar.gz

cd alertmanager-0.15.0-rc.1.linux-amd64

配置報警媒介文件

vim aler.yml
global:
resolve_timeout: 6m
smtp_smarthost: 'mail.baiwutong.com:25'
smtp_from: 'monit@baiwutong.com'
smtp_auth_username: 'monit@baiwutong.com'
smtp_auth_password: 'xxxxxxx'
smtp_require_tls: false
templates:
- '/root/alertmanager/template/*.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 3s
group_interval: 5m
repeat_interval: 10m
receiver: default-receiver
routes:
- match:
job: ".*"
routes:
- match:
status: yellow
receiver: default-receiver
receivers:
- name: 'default-receiver'
email_configs:
- to: 'guoyinzhao@baiwutong.com'
send_resolved: true
#headers: { Subject: "[mail] 測試技術部監控告警郵件" }
啓動警告器

nohup ./alertmanager --config.file=alert.yml &

配置通訊主機及路徑

在prometheus配置文件中指定通訊主機和報警規則文件路徑

alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# - alertmanager:9093
rule_files:
- /etc/prometheus/rules.yml
配置報警規則

groups:- name: test-rule rules: - alert: NodeMemoryUsage expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80 for: 1m labels: severity: warning annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}" - alert: NodeFilesystemUsage expr: (node_filesystem_size_bytes{device="rootfs"} - node_filesystem_free_bytes{device="rootfs"}) / node_filesystem_size_bytes{device="rootfs"} * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Filesystem usage detected" description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80 for: 3m labels: team: node annotations: summary: "{{$labels.instance}}: High CPU usage detected" description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"主要參考文檔：官網：https://prometheus.io/docs/introduction/overview/

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。