prometheus 筆記

時間 2019-11-24

標籤 prometheus 筆記简体版

原文原文鏈接

前言

prometheus 是監控應用軟件相似於nagios.node

安裝python

1.官網下載prometheus-2.2.0.linux-amd64壓縮包，解壓,執行./prometheus便可。這裏重要的是配置文件。linux

a.若是要遠程熱加載配置文件,啓動時加上--web.enable-lifecycle參數。調用指令是curl -X POST http://localhost:9090/-/reloadios

b.重要掌握 prometheus.yml 配置文件.prometheus啓動時會加載它。web

[root@vm-local1 prometheus-2.2.0.linux-amd64]# cat prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.評估間隔
  # scrape_timeout is set to the global default (10s). 默認抓取超時10秒

# Alertmanager configuration #管理報警配置
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["localhost:9093"]  #管理報警包須要單獨下載，默認啓動端口是9093
      
    

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - rules/mengyuan.rules     #要發送報警，就得寫規則，定義規則文件

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:    #抓取配置，就是你要抓取那些主機
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'  #任務名稱

    # metrics_path defaults to '/metrics'  #默認抓取監控機的url後綴地址是/metrics
    # scheme defaults to 'http'.   #模式是http

    static_configs:
      - targets: ['localhost:9090','localhost:9100']
        labels:
          group: 'zus'    #targets就是要抓取的主機，對應的客戶端，我這有兩個，把它們倆規定爲一個組，組名是zus
  - job_name: dj   #又創建個任務名稱
    static_configs:
      - targets: ['localhost:8000']  #我用django自定義的客戶端

注意：數據庫

localhost:9090,默認prometheus提供了數據抓取接口，9100端口是prometheus提供的一個監控客戶端django

2.安裝prometheus客戶端json

官網下載node_exporter-0.16.0-rc.1.linux-amd64客戶端，解壓,執行./node_exporter 便可，默認是9100端口bash

3.如何自定義一個客戶端，其實很簡單，只要返回的數據庫類型是這樣就能夠.我這用的django..只要格式正確就能夠curl

def metrics(req):
    ss = "feiji 32" + "\n" + "caidian 31"
    return HttpResponse(ss)

4.編寫 rules/mengyuan.rules 規則，規則是發送報警的前提

[root@vm-local1 rules]# cat mengyuan.rules 
groups:
- name: zus
  rules:

  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown   #報警名字隨便寫
    expr: up == 0   #這是一個表達式，若是主機up狀態爲0,表示關機了，條件爲真就會觸發報警 能夠經過$value獲得值
    for: 5s         #5s內，仍是0，就發送報警信息，固然是發送給報警管理器
    labels:
      severity: page  #這個類型的報警定了個標籤
    annotations:
      summary: "Instance {{ $labels.instance }} down dangqian  {{ $value }}"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

5.如今安裝報警管理器

a.官網下載alertmanager-0.15.0-rc.1.linux-amd64　　

重要的仍是配置文件，建立修改它

[root@vm-local1 alertmanager-0.15.0-rc.1.linux-amd64]# cat alertmanager.yml 
route:
  receiver: mengyuan2  #接收的名字，默認必須有一個，對應receivers的- name
  group_wait: 1s  #等待1s
  group_interval: 1s #發送間隔1s
  repeat_interval: 1m  #重複發送等待1m分鐘再發
  group_by: ["zus"]   
  routes:      #路由了，匹配規則標籤的severity:page 走 receiver: mengyuan , 若是routes不寫，就會走默認的mengyuan2
  - receiver: mengyuan  
    match:
      severity: page

receivers:
- name: 'mengyuan'
  webhook_configs:  #這我用的webhook_configs 鉤子方法,  默認會把規則的報警信息發送到127.0.0.1:8000
  - url: http://127.0.0.1:8000
    send_resolved: true
- name: 'mengyuan2'
  webhook_configs:
  - url: http://127.0.0.1:8000/2
    send_resolved: true

6.django接收報警發過來的消息

用Django的 request.body會受到json格式的數據,大概像這樣

{"receiver":"mengyuan","status":"resolved","alerts":[{"status":"resolved","labels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"annotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian 0"},"startsAt":"2018-04-06T22:34:13.51281763+08:00","endsAt":"2018-04-06T23:07:43.514552824+08:00","generatorURL":"http://vm-local1:9090/graph?g0.expr=up+%3D%3D+0\u0026g0.tab=1"}],"groupLabels":{},"commonLabels":{"alertname":"InstanceDown","group":"zus","instance":"localhost:9100","job":"prometheus","severity":"page"},"commonAnnotations":{"description":"localhost:9100 of job prometheus has been down for more than 5 minutes.","summary":"Instance localhost:9100 down dangqian 0"},"externalURL":"http://vm-local1:9093","version":"4","groupKey":"{}/{severity=\"page\"}:{}"}

到此，我就能夠根據收到的數據，調用郵件接口，或其餘第三方報警接口了。

總結：

本人也是剛入門。作的一個筆記。

相關標籤/搜索

prometheus+alertmanager

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。