Prometheus監控使用實踐

時間 2019-11-25

標籤 prometheus 監控使用實踐简体版

原文原文鏈接

Prometheus介紹

Prometheus是一套開源監控報警系統（包括時序列數據庫TSDB），自2012年來被許多公司與組織所採用。其中Prometheus的特色以下：java

多維數據模型（時序列數據由metric名和一組key/value組成）
在多維度上靈活的查詢語言(PromQL)
不依賴分佈式存儲，單主節點工做.
經過基於HTTP的pull方式採集時序數據
能夠經過中間網關進行時序列數據推送(pushing)
目標服務器能夠經過發現服務或者靜態配置實現
多種可視化和儀表盤支持

其中Prometheus生態系統能夠有多個組件構成，大多組件都是獨立工做的，能夠有選擇的配置本身須要的服務，主要有如下：python

Prometheus 主服務,用來抓取和存儲時序數據
client library 用來構造應用或 exporter 代碼 (go,java,python,ruby)
push 網關可用來支持短鏈接任務
可視化的dashboard (兩種選擇,promdash 和 grafana.目前主流選擇是 grafana.)
實驗性的報警管理端(alertmanager,單獨進行報警彙總,分發,屏蔽等 )

2. Prometheus簡單部署(Linux-Centos)web

先在官網下載對應安裝包（https://prometheus.io/download/），放在Linux中進行解壓。具體執行命令以下：數據庫

tar xvfz prometheus-*.tar.gz
cd prometheus-*

在prometheus文件目錄中有一個prometheus.yml文件，使用的是是整個Prometheus運行的主配置文件，其中默認配置包含了大多標準配置及自控配置：api

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

其中job-name做爲監控的對象名稱，每一個job-name不能夠重複，其中static_configs下的targets參數很關鍵，決定了監聽的服務地址。在prometheus文件目錄下，能夠經過以下命令去啓動關閉服務。ruby

//啓動prometheus服務
nohup ./prometheus --config.file=prometheus.yml

//查詢啓動的prometheus服務
ps -ef |grep prometheus

//關閉prometheus服務
kill -9 {prometheus-id}

在啓動服務以後，咱們能夠在虛擬機內訪問 http://localhost:9090查看監控情況，也可經過虛擬機的映射地址遠程訪問。大體以下圖所示服務器

3. Prometheus中PromQL語法記錄app

在上面的第一個查詢框中，能夠經過PromQL語句對採集的數據進行處理展現，故把經常使用的PromQL語法記錄以下。分佈式

常見匹配符：函數

+，-，*，/，%，^（加，減，乘，除，取餘，冪次方）
==，!=，>，<，>=，<=（等於，不等於，大於，小於，大於等於，小於等於）

常見函數：

sum(求和),min(取最小),max(取最大),avg(取平均)，count (計數器)
stddev (計算誤差),stdvar (計算方差)，count_values(每一個元素獨立值數量)，bottomk (取倒數幾個),topk(取前幾位)

具體使用：

查詢指標name爲http_requests_total   條件爲job，handler 的數據:
http_requests_total{job="apiserver", handler="/api/comments"}

取5min內 其餘條件同上的數據:
http_requests_total{job="apiserver", handler="/api/comments"}[5m]

匹配job名稱以server結尾的數據:
http_requests_total{job=~".*server"}

匹配status不等於4xx的數據：
http_requests_total{status!~"4.."}

查詢5min內，每秒指標爲http_requests_total的數據比率：
rate(http_requests_total[5m])

根據job分組，取每秒數據數量：
sum(rate(http_requests_total[5m])) by (job)

取各個實例的未使用內存量（以MB爲單位）
(instance_memory_limit_bytes - instance_memory_usage_bytes) / 1024 / 1024
以app,proc爲分組，取未使用內存量（以MB爲單位）
sum( instance_memory_limit_bytes - instance_memory_usage_bytes) by (app, proc) / 1024 / 1024

假如數據以下：
instance_cpu_time_ns{app="lion", proc="web", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="elephant", proc="worker", rev="34d0f99", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="turtle", proc="api", rev="4d3a513", env="prod", job="cluster-manager"}
instance_cpu_time_ns{app="fox", proc="widget", rev="4d3a513", env="prod", job="cluster-manager"}

以app,proc爲分組，取花費時間前三的數據：
topk(3, sum(rate(instance_cpu_time_ns[5m])) by (app, proc))

以app分組，取數據的個數：
count(instance_cpu_time_ns) by (app)

http每秒的平均響應時間：
rate(basename_sum[5m]) / rate(basename_count[5m])

後續會繼續更新prometheus配合其餘組件、監控多種服務的實踐記錄，若此文存在不足或漏洞，也請在評論中不吝指教。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。