Prometheus快速入門

時間 2019-11-26

標籤 prometheus 快速入門简体版

原文原文鏈接

Prometheus是一個開源的，基於metrics(度量)的一個開源監控系統，它有一個簡單而強大的數據模型和查詢語言，讓咱們分析應用程序。Prometheus誕生於2012年主要是使用go語言編寫的，並在Apache2.0許可下得到許可，目前有大量的組織正在使用Prometheus在生產。2016年，Prometheus成爲雲計算組織(CNCF)第二個成員。node

Prometheus部署

建立 prometheus用戶mysql

下載對應平臺的安裝包解壓的目錄linux

hostname$ tar xf prometheus-2.10.0.linux-amd64.tar.gz
hostname$ mv prometheus-2.10.0.linux-amd64 /opt/

啓動腳本web

hostname$ sudo vim  /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus instance
Wants=network-online.target
After=network-online.target
After=postgresql.service mariadb.service mysql.service

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=on-failure
WorkingDirectory=/opt/prometheus/
RuntimeDirectory=prometheus
RuntimeDirectoryMode=0750
ExecStart=/opt/prometheus/prometheus  \
--storage.tsdb.retention=15d \
--config.file=/opt/prometheus/prometheus.yml  \
--web.max-connections=512  \
--web.read-timeout=5m  \
--storage.tsdb.path="/opt/data/prometheus" \
--query.timeout=2m \
 --query.max-concurrency=200
LimitNOFILE=10000
TimeoutStopSec=20

[Install]
WantedBy=multi-user.target

啓動腳本

啓動參數說明sql

--web.read-timeout=5m 請求鏈接的最大等待時間，防止太多的空閒連接，佔用資源
--web.max-connections=512 最大連接數
--storage.tsdb.retention=15d prometheus開始採集監控數據後會存在內存中和硬盤中，太長的話，硬盤和內存都吃不消，過短的話，歷史數據就沒有了，設置15天爲宜
--storage.tsdb.path="/opt/data/prometheus 存儲數據路徑，這個很重要，不要隨便放在一個地方，會把/根目錄塞滿
--query.timeout=2m --query.max-concurrency=200 防止太多的用戶同時查詢，也防止單個用戶執行過大的查詢而一直不退出

配置文件vim

# my global config
global:
  scrape_interval:     15s #設置採集數據的頻率，默認是1分鐘.
  evaluation_interval: 15s #每15秒評估一次規則。默認值是每1分鐘一次
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    static_configs:
    - targets: ['192.168.48.130:9090']  # 設置本機的ip

/opt/prometheus/prometheus.yml

瀏覽器訪問9090端口，Prometheus已經正常運行了瀏覽器

Node_exporter部署

Prometheus社區爲咱們提供了 node_exporter程序來採集被監控端的系統信息，下載在c1.heboan.com 節點上進行部署服務器

建立 prometheus用戶網絡

下載對應平臺的安裝包解壓的目錄curl

hostname$ tar xf node_exporter-0.18.1.linux-amd64.tar.gz
hostname$ mv node_exporter-0.18.1.linux-amd64 /opt/node_exporter

hostname$ sudo vim /usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/opt/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

啓動腳本

node_exporter默認監聽9100端口提供http服務

$ curl http://c1.heboan.com:9100/metrics
....
# HELP node_memory_MemFree_bytes Memory information field MemFree_bytes.   #這行表示說明監控項的意思
# TYPE node_memory_MemFree_bytes gauge  #這行說明監控的是數據類型是gauuge
node_memory_MemFree_bytes 1.619521536e+09   #這行是監控項 k/v


node_export蒐集了不少監控項，每一個監控項都有這三行

node-exporter配置好了之後，咱們就須要把它接入到 prometheus 中去，修改prometheus.yml，而後重啓prometheus

scrape_configs:
  ...
  - job_name: 'aliyun'
    static_configs:
    - targets: ['c1.heboan.com:9100']   #這裏能夠寫多個node_export地址

而後訪問prometheus web界面，能夠看到c1.heboan.com的已經被監控上了

按以上步驟把 c2.heboan.com也監控上

查看監控數據

上面咱們已經把c1.heboan.com機器部署了node_export來採集系統信息，而且接入到了prometheus , 如今咱們能夠在prometheus web 界面經過查詢語言來獲取咱們想要的監控項數據

舉個栗子： 獲取被監控端5分鐘內cpu使用率

計算公式： (1-5分鐘空閒增量 / 5分鐘總增量) * 100

首先查出cpu工做運行的全部時間， cpu是分了system、iowait、irq、user、idle...這些加起來的時間就是運行的總時間，並且咱們看到這些是按每核來計算的

根據label過濾出idle(空閒時間)

計算出5分鐘內的增量

由於這是分開多核計算，因此咱們須要把它用sum加起來

雖然加起來了，可是這是把全部機器的全部核加起來了，而咱們須要時把屬於一臺機器的全部核心加起來，所以咱們須要用到by()

上面已經算出了5 分鐘內idle(CPU空閒)的增量，那cpu總的時間增量就是

#不加過濾條件
sum(increase(node_cpu_seconds_total[5m])) by (instance)

再根據公式計算便可

能夠點擊Graph查看圖標走勢

這裏的圖表都是零時的，咱們要向保存想下，隨時想看，就能夠用Grafana

Grafana部署使用

安裝Grafana

# 官網下載安裝包， 例如： grafana-6.2.5-1.x86_64.rpm 
# 而後本地安裝
yum localinstall grafana-6.2.5-1.x86_64.rpm

# 啓動
systemctl start grafana-server

Grafana監聽端口是3000，咱們訪問web 界面, 默認的帳號密碼是admin/admin, 登陸後會要求修改密碼，自行修改便可, 登陸進入以後點擊 "Add data source" 添加數據源，選擇"prometheus"

添加一個dashboard ，回到Home Dashboard 點擊"New dashboard"---"Add Query"

點擊齒輪圖標，進入面板設置，來添加變量

General

設置幾個變量

$interval

$env

$node

保存面板後查看，效果以下

如今咱們來畫圖，cpu的使用率，點擊 add_panel圖標--選擇 "Choose Visualization"

數據源選擇prometheus,以前咱們配置的數據源， query語句以下

 (1- sum(increase(node_cpu_seconds_total{mode="idle", instance=~"$node"}[5m]))/ sum(increase(node_cpu_seconds_total{instance=~"$node"}[5m]))) * 100

#$node是咱們以前配置的變量，來匹配每一個節點

Visualization

General

最後查看效果以下

Alertmanager告警

有了監控項後，還不夠，當監控項出現問題後還須要發出告警通知，這個功能須要Alertmanager角色來處理

prometheus是由咱們決定什麼狀況下該報警，而後prometheus發出警報，被髮送給Alermanager， Alertmanager接受到警報後將警報進行分組節流進行通知

首先咱們先在prometheus server 上配置警報規則

...
rule_files:
  - "first_rules.yml"

...

/opt/prometheus/prometheus.yml

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr:  1 - sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / sum(increase(node_cpu_seconds_total[5m])) by (instance) > 0.8
    for: 1m


#當5分鐘內cpu使用率大於80%而且持續1分鐘出發警報

進行cpu壓測，由於是雙核的，因此打開2個c1.heboan.com的終端，執行如下命令壓測

time echo "scale=50000; 4*a(1)" | bc -l -q

查看下圖標

看下prometheus web界面已經出發警報了

要想進行告警通知，好比郵件，咱們就要用到Alertmanager了。我在prometheus那臺服務器上安裝Alertmanager, 實際上它能夠安裝在任何其餘地方，主要網絡OK就行

hostname$ tar xf alertmanager-0.17.0.linux-amd64.tar.gz 
hostname$ mv  alertmanager-0.17.0 /opt/ alertmanager

/opt/alertmanager/alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.qq.com:25'
  smtp_from: 'sellsa@qq.com'
  smtp_auth_username: 'sellsa@qq.com'
  smtp_auth_password: '郵箱受權碼'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'email'


receivers:
- name: 'email'
  email_configs:
  - to: 'heboan@qq.com'

啓動Alertmanage, 它監聽9093端口

cd /opt/alertmanager
./alertmanager --config.file="alertmanager.yml"

prometheus.yml配置警報推送到那個Alertmanager

...
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']

...

最後在進行cpu壓測，咱們就能夠收到告警郵件了

這個告警信息貌似並不相信，沒有具體的描述信息，咱們能夠修改下first_rules.yml添加些信息

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr:  1 - sum(increase(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) / sum(increase(node_cpu_seconds_total[5m])) by (instance) > 0.8
    for: 1m
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} cpu 5分鐘內使用率超過80%,而且持續1分鐘'
      summary: 'Instance {{ $labels.instance }}'

1. Prometheus入門教程（三）：Grafana 圖表配置快速入門
2. ES6快速入門 ES6 快速入門
3. 快速入門
4. Prometheus快速瞭解
5. prometheus快速啓動
6. 無監控不運維——Prometheus 快速入門
7. Hadoop快速入門
8. Sqoop 快速入門
9. Shell快速入門
10. vim快速入門
更多相關文章...
• SQL 快速參考 - SQL 教程
• Eclipse 快速修復 - Eclipse 教程
• YAML 入門教程
• Java Agent入門實戰（一）-Instrumentation介紹與使用

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。