prometheus+grafana

時間 2019-12-07

標籤 prometheus+grafana prometheus grafana 简体版

原文原文鏈接

prometheus集中管理服務搭建

#搭建在監控服務主機上，用於收集節點服務器信息node

下載：https://prometheus.io/download/linux

解壓web

運行：nohup ./prometheus --config.file=./prometheus.yml &>> ./prometheus.log &vim

訪問http://192.168.1.24:9090centos

node-exporter節點收集服務搭建

#搭建在須要主機服務器收集的服務器上服務器

下載：https://prometheus.io/download/app

解壓框架

運行：nohup ./node_exporter &>> ./node_exporter.log &運維

從新加載：kill -1 PIDdom

訪問http://192.168.1.24:9100

添加到prometheus監控羣中：

vim prometheus.yml

添加：

- job_name: '21'

static_configs:

- targets: ['192.168.1.21:9100']

- job_name: '24'

static_configs:

- targets: ['192.168.1.24:9100']

- job_name: '20'

static_configs:

- targets: ['192.168.1.20:9100']

#指定指標數據源的地址，多個地址之間用逗號隔開

alertmanager監控報警服務搭建

搭建在任意服務器上，收集報警信息，信息形式發給運維人員

下載：https://prometheus.io/download/

解壓

運行：nohup ./alertmanager --config.file=./alertmanager.yml &>> ./alertmanager.log &

訪問：http://192.168.1.24:9093

grafana圖形框架服務搭建

人性化web展現，更好的監控服務器性能

下載：https://grafana.com/get

解壓

運行：nohup ./grafana-server &>> ./grafana-server.log &

訪問：http://192.168.1.24:3000

添加監控主機到grafana上：

點擊保存

添加監控模板Kubernetes到grafana中

下載：https://grafana.com/dashboards

選擇下載的模板

選擇監控主機

添加並查看使用

須要收集數據一段時間纔會有數據，耐心等待

grafana簡單的使用

郵箱報警

alertmanager.yml指定郵箱的相關信息，詳細請看看配置文件詳解

prometheus.yml指定alertmanager地址和rule_files地址

vim first_rules.yml指定報警的規則

相關配置文件詳解

prometheus.yml

# my global config

global:

scrape_interval: 15s

用於向pushgateway採集數據的頻率，上圖所示：每隔15秒向pushgateway採集一次指標數據

evaluation_interval: 15s

表示規則計算的頻率，上圖所示：每隔15秒根據所配置的規則集，進行規則計算

external_labels:

monitor: 'codelab-monitor'

爲指標增長額外的維度，可用於區分不一樣的prometheus,在應用中多個prometheus能夠對應一個alertmanager

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

設置altermanager的地址，後文會寫到安裝altermanager

- targets: ["192.168.1.24:9093"]

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

指定所配置規則文件，文件中每行可表示一個規則

- "/work/prometheus-2.5.0.linux-amd64/first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

指定任務名稱，在指標中會增長該維度，表示該指標所屬的job

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

static_configs:

- targets: ['localhost:9090']

- job_name: '21'

static_configs:

- targets: ['192.168.1.21:9100']

- job_name: '24'

static_configs:

- targets: ['192.168.1.24:9100']

- job_name: '20'

static_configs:

- targets: ['192.168.1.20:9100']

指定指標數據源的地址，多個地址之間用逗號隔開

alertmanager.yml

global:

resolve_timeout: 5m

smtp_smarthost: 'smtp.163.com:25'

smtp_from: 'hxqxiaoqi1990@163.com'

smtp_auth_username: 'hxqxiaoqi1990@163.com'

smtp_auth_password: 'Hxq7996026'

smtp_require_tls: false

#郵箱地址

templates:

#指定告警信息展現的模版

- '/work/alertmanager-0.15.3.linux-amd64/template/123.tmpl'

route:

group_by: ['alertname']

group_wait: 10s

group_interval: 10s

repeat_interval: 1h

receiver: 'mail'

receivers:

#- name: 'web.hook'

# webhook_configs:

# - url: 'http://127.0.0.1:5001/'

- name: 'mail'

email_configs:

- to: 'hxqxiaoqi1990@163.com'

inhibit_rules:

- source_match:

severity: 'critical'

target_match:

severity: 'warning'

equal: ['alertname', 'dev', 'instance']

first_rules.yml

groups:

- name: test-rule

rules:

- alert: clients

expr: node_load1 > 1

for: 1m

labels:

severity: warning

annotations:

summary: "{{$labels.instance}}: Too many clients detected"

description: "{{$labels.instance}}: Client num is above 80% (current value is: {{ $value }}"

set from=hxqxiaoqi1990@163.com #做爲發送郵件的帳號

set smtp=smtp.163.com #發送郵件的服務器

set smtp-auth-user=hxqxiaoqi1990@163.com #你的郵箱賬號

set smtp-auth-password=Hxq7996026 #受權碼

set smtp-auth=login

cat /dev/urandom | md5sum

內存規則

groups:

- name: test-rule

rules:

- alert: "內存報警"

expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 10

for: 1s

labels:

severity: warning

annotations:

summary: "服務名:{{$labels.alertname}}"

description: "業務500報警: {{ $value }}"

value: "{{ $value }}"

- name: test-rule2

rules:

- alert: "內存報警"

expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40

for: 1s

labels:

severity: test

annotations:

summary: "服務名:{{$labels.alertname}}"

description: "業務500報警: {{ $value }}"

value: "{{ $value }}"

((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > ${value}

cpu規則

100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > ${value}

磁盤規則

(node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > ${value}

流量規則：

(irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > ${value}

應用佔比

process_cpu_usage{job="${app}"} * 100 > ${value}

報警模板

groups:

- name: down

rules:

- alert: "down報警"

expr: up == 0

for: 1m

labels:

severity: warning

annotations:

summary: "down報警"

description: "報警時間:"

value: "已使用：{{ $value }}"

- name: memory

rules:

- alert: "內存報警"

expr: ((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > 1

for: 1m

labels:

severity: warning

annotations:

summary: "內存報警"

description: "報警時間:"

value: "已使用：{{ $value }}%"

- name: cpu

rules:

- alert: "cpu報警"

expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 80

for: 1m

labels:

severity: warning

annotations:

summary: "cpu報警"

description: "報警時間:"

value: "已使用：{{ $value }}%"

- name: disk

rules:

- alert: "disk報警"

expr: 100 - (node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > 80

for: 1m

labels:

severity: warning

annotations:

summary: "disk報警"

description: "報警時間:"

value: "已使用：{{ $value }}%"

- name: net

rules:

- alert: "net報警"

expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 80000

for: 1m

labels:

severity: warning

annotations:

summary: "net報警"

description: "報警時間:"

value: "已使用：{{ $value }}KB"

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。