監控prometheus

時間 2019-12-01

標籤監控 prometheus 简体版

原文原文鏈接

1、prometheus-webhook-daingtalakhtml

github地址：[Releases · timonwong/prometheus-webhook-dingtalk · GitHub](https://github.com/timonwong/prometheus-webhook-dingtalk/releases)
下載地址：[](https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v0.3.0/prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz)node

本身去GitHub上下載須要的版本，而後解壓：mysql

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v0.3.0/prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz
tar xf prometheus-webhook-dingtalk-0.3.0.linux-amd64.tar.gz -C /data; cd /data
mv prometheus-webhook-dingtalk-0.3.0.linux-amd64 prometheus-webhook-dingtalk

修改配置文件:
# cat default.tmpllinux

{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }}> - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})

{{ end }}{{ end }}

{{ define "ding.link.title" }}{{ template "__subject" . }}{{ end }}
{{ define "ding.link.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{ end }}

啓動服務：
# cat prometheus-webhook-dingtalk.shnginx

#!/bin/bash
nohup prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="test=https://oapi.dingtalk.com/robot/send?access_token=89f3cedfb3c3cdb031bdf10f8fc52bf1add575e9b5fb6f462a8cca6859af4" >>/data/prometheus-webhook-daingtalak/nohub.out 2>&1 &

--ding.profile是釘釘機器人生成的，本身建立個釘釘機器人。git

2、Alertmanager
github地址：[Releases · prometheus/alertmanager · GitHub](https://github.com/prometheus/alertmanager/releases)

下載地址：[Releases · prometheus/alertmanager · GitHub](https://github.com/prometheus/alertmanager/releases)

本身去GitHub上下載須要的版本，而後解壓：github

wget https://github.com/prometheus/alertmanager/releases/download/v0.15.1/alertmanager-0.15.1.linux-amd64.tar.gz
tar xf alertmanager-0.15.1.linux-amd64.tar.gz -C /data ;cd /data
mv alertmanager-0.15.1.linux-amd64 alertmanager

修改配置文件，因爲我本身使用的是釘釘告警，因此本文使用的釘釘：
# cat alertmanager.ymlweb

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'test'
receivers:
- name: 'test'
  webhook_configs:
   - url: "http://127.0.0.1:8060/dingtalk/test/send"
     send_resolved: true

此處的url是prometheus-webhook-daingtalak的地址，用於將告警信息轉換成釘釘能夠接受的消息格式。redis

啓動alertmanager：
# cat alertmanager.shsql

#!/bin/bash
nohup alertmanager --config.file="/data/alertmanager/alertmanager.yml" --storage.path="/data/alertmanager/data" --web.listen-address="0.0.0.0:9093" >>/data/alertmanager/nohub.out 2>&1 &

alertmanager訪問地址：
http://ip:9093

3、Prometheus

github地址：[Releases · prometheus/prometheus · GitHub](https://github.com/prometheus/prometheus/releases)

一、prometheus組成
1）prometheus：主程序，主要負責採集數據以及數據存儲，而且對外提供PromQL實現監控數據的查詢以及聚合分析；
2）*_exporter：於向Prometheus Server暴露數據採集的endpoint,Prometheus輪訓這些Exporter採集而且保存數據；
3）alertManager: 負責實現告警，結合郵件或釘釘
4）pushgateway: Prometheus爲一些臨時存在的進程，如批處理任務，提供了Push Gateway，這些客戶端能夠將數據push到Push Gateway中，而後由Push Gateway提供pull接口將數據暴露給PrometheusServer。

5）prometheus主要經過pull的方式獲取數據，這樣就大大減小了被監控端的壓力和系統資源的佔用。

二、安裝
下載地址：[Releases · prometheus/prometheus · GitHub](https://github.com/prometheus/prometheus/releases)
本身去GitHub上下載須要的版本，而後解壓：

wget https://github.com/prometheus/prometheus/releases/download/v2.3.2/prometheus-2.3.2.linux-amd64.tar.gz
tar xf prometheus-2.3.2.linux-amd64.tar.gz -C /data ;cd /data
mv prometheus-2.3.2.linux-amd64 prometheus

而後修改配置文件，定義相應的監控項job:
# cat prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
#remote_write:
#  - url: "http://10.2.79.208:9201/write"
#remote_read:
#  - url: "http://10.2.79.208:9201/read"
# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 127.0.0.1:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - "/data/prometheus/mongodb-rules.yml"
  - "/data/prometheus/consul-rules.yml"
  - "/data/prometheus/redis-rules.yml"
  - "/data/prometheus/nginx-rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'mongodb1'
    static_configs:
    - targets: ['10.10.8.70:9218']
  - job_name: 'mongodb1-system'
    static_configs:
    - targets: ['10.10.8.70:9100']

  - job_name: 'mongodb2'
    static_configs:
    - targets: ['10.10.5.108:9218']

rule_files:指定告警規則文件的路徑，能夠定義本身的告警規則

# cat consul-rules.yml

---
groups:
- name: consul
  rules:
  - alert: consul_catalog_service_node_healthy
    expr: consul_catalog_service_node_healthy < 1
    for: 60s
    labels:
      serverity: critical
    annotations:
      descrition: '{{ $labels.node }}  {{ $labels.service_id }} is Unhealth'
      summary: 'some service is unhealth,you must chek it out by consul'

  - alert: consul_node_health
    expr: consul_exporter_build_info < 1
    for: 60s
    labels:
       serverity: critical
    annotations:
       descrition: '{{ $labels.instance }} consul server is down '
       summary: 'consul server is down'

  - alert: consul_health_service_status
    expr: consul_health_service_status < 1
    for: 60s
    labels:
      serverity: critical
    annotations:
      descrition: '{{ $labels.node }}  {{ $labels.service_id }} is Unhealth'
      summary: 'some service is unhealth,you must chek it out by consul'

# cat mongodb-rules.yml

---
groups:
- name: mongodb
  rules:
  - alert: mongodb_mongod_connections
    expr: mongodb_mongod_connections{state='current'} and  mongodb_mongod_connections < 0
    for: 10s
    labels:
      serverity: critical
    annotations:
      description: '{{ $labels.instance }}   of      {{ $labels.job }}   connections is low  11'
      summary: 'connections is too Low,Mongodb mybe is Down!'

  - alert: mongodb_mongod_connections
    expr: mongodb_mongod_connections{state='current'} and  mongodb_mongod_connections > 600
    for: 10s
    labels:
      serverity: warning
    annotations:
      description: '{{ $labels.instance }}   of      {{ $labels.job }}   connections is high  570'
      summary: 'connections is too much'

  - alert: mongodb_mongod_memory
    expr:  mongodb_mongod_memory{type='virtual'} and mongodb_mongod_memory < 5000
    for: 5s
    labels:
      serverity: critical
    annotations:
      description: '{{ $labels.instance }} of  {{ $labels.job }} {{ $labels.type }}   is too low'
      summary: 'mongodb mybe is down'

  - alert: mongodb_mongod_replset_member_health
    expr: mongodb_mongod_replset_member_health != 1
    for: 5s
    labels:
      serverity: critical
    annotations:
      description: ' {{ $labels.name }}  {{ $labels.state}} is down'
      summary: 'one of replsets node is down'

  - alert: mongodb_mongod_replset_my_state
    expr: mongodb_mongod_replset_my_state{job='mongodb3'} and mongodb_mongod_replset_my_state != 1
    for: 5s
    labels:
      serverity: critical
    annotations:
      description: ' replsets master have been  changed, {{ $labels.job }}  is not master'
      summary: 'mongodb3 master is down,chek the status'

#cat redis-rules.yml

---
groups:
- name: redis
  rules:
  - alert: redis_instantaneous_ops_per_sec
    expr: redis_instantaneous_ops_per_sec < 50
    for: 120s
    labels:
      serverity: critical
    annotations:
      descrition: '{{ $labels.job }}   is Unhealth'
      summary: 'redis-prod options/sec is too low,redis maybe traffic jam ,you must check it out by "redis-cli slowlog get"'

#cat nginx-rules.yml

---
groups:
- name: nginx-exporter
  rules:
  - alert: status_code_499
    expr: status_code_499 > 300
    for: 60s
    labels:
      serverity: critical
    annotations:
      descrition: ' status_code_499:{{ status_code_499 }}'
      summary: 'nginx status code 499 is too much,check loadbalance /var/log/nginx/share.log'


  - alert: status_code_400
    expr: status_code_400 > 50
    for: 60s
    labels:
      serverity: critical
    annotations:
      descrition: 'status_code_400: {{ status_code_400 }}'
      summary: 'nginx status code 400 is too much,check loadbalance /var/log/nginx/share.log'

nginx是我本身寫的一個exportor，地址：https://github.com/cuishuaigit/nginx_exporter

啓動：
# cat prometheus.sh

#!/bin/bash
nohup prometheus --config.file="/data/prometheus/prometheus.yml" --web.listen-address="0.0.0.0:9090"  --storage.tsdb.path="/data/prometheus/data"  --web.console.libraries="/data/prometheus/console_libraries"  --web.console.templates="/data/prometheus/consoles"  --web.enable-admin-api --log.level=info >>/data/prometheus/nohub.out 2>&1 &