promethus

時間 2020-06-02

標籤 promethus 简体版

原文原文鏈接

promethus

Prometheus是一個開源的系統監控和報警的工具包。

主要特色是：

多維數據模型（有metric名稱和鍵值對肯定的時間序列）

靈活的查詢語言

不依賴分佈式存儲

經過pull方式採集時間序列，經過http協議傳輸

支持經過中介網關的push時間序列的方式

監控數據經過服務或者靜態配置來發現

支持圖表和dashboard等多種方式

Components

Prometheus包含多個組件，其中有許可能是可選的：

Prometheus主服務器，用來收集和存儲時間序列數據

應用程序client代碼庫

短時jobs的push gateway

基於Rails/SQL的GUI dashboard

特殊用途的exporter（包括HAProxy、StatsD、Ganglia等）

用於報警的alertmanager

命令行工具查詢

Architecture

Prometheus和它的組件的總體架構：

Prometheus經過直接或者短時jobs中介網關收集監控數據，在本地存儲全部收集到的數據，而且經過定義好的rules產生新的時間序列數據，或者發送警報。Promdash或者其餘使用API的clients能夠將採集到的數據可視化。

一、Prometheus Server：

主要負責數據採集和存儲，提供PromQL查詢語言的支持。

二、客戶端SDK：

官方提供的客戶端類庫有go、java、scala、python、ruby，其餘還有不少第三方開發的類庫，支持nodejs、php、erlang等。

三、Push Gateway：

支持臨時性Job主動推送指標的中間網關。

四、PromDash：

使用Rails開發可視化的Dashboard，用於可視化指標數據。

五、Exporter：

Exporter是Prometheus的一類數據採集組件的總稱。它負責從目標處蒐集數據，並將其轉化爲Prometheus支持的格式。與傳統的數據採集組件不一樣的是，它並不向中央服務器發送數據，而是等待中央服務器主動前來抓取。

Prometheus提供多種類型的Exporter用於採集各類不一樣服務的運行狀態。目前支持的有數據庫、硬件、消息中間件、存儲系統、HTTP服務器、JMX等。

六、alertmanager：

警告管理器，用來進行報警。

七、prometheus_cli：

命令行工具。

八、其餘輔助性工具：

多種導出工具，能夠支持Prometheus存儲數據轉化爲HAProxy、StatsD、Graphite等工具所須要的數據存儲格式。

Prometheus服務過程

一、Prometheus Daemon負責定時去目標上抓取metrics(指標)數據，每一個抓取目標須要暴露一個http服務的接口給它定時抓取。Prometheus支持經過配置文件、文本文件、Zookeeper、Consul、DNS SRV Lookup等方式指定抓取目標。Prometheus採用PULL的方式進行監控，即服務器能夠直接經過目標PULL數據或者間接地經過中間網關來Push數據。

二、Prometheus在本地存儲抓取的全部數據，並經過必定規則進行清理和整理數據，並把獲得的結果存儲到新的時間序列中。

三、Prometheus經過PromQL和其餘API可視化地展現收集的數據。Prometheus支持不少方式的圖表可視化，例如Grafana、自帶的Promdash以及自身提供的模版引擎等等。Prometheus還提供HTTP API的查詢方式，自定義所須要的輸出。

四、PushGateway支持Client主動推送metrics到PushGateway，而Prometheus只是定時去Gateway上抓取數據。

五、Alertmanager是獨立於Prometheus的一個組件，能夠支持Prometheus的查詢語句，提供十分靈活的報警方式。

--------------------------------

基本環境配置：

一、prometheus安裝配置：

tar -xf prometheus-2.0.0.linux-amd64.tar.gz

cd prometheus-2.0.0.linux-amd64

配置文件：cat prometheus.yml

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.--默認抓取間隔

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. --數據計算的間隔

# scrape_timeout is set to the global default (10s). --默認抓取超時10秒

# Alertmanager configuration --管理報警配置

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093 --管理報警包須要單獨下載，默認啓動端口是9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml" --要發送報警，就得寫規則，定義規則文件

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs: --抓取配置，就是要抓取哪些主機

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus' -- 任務名稱

#- job_name: 'node_exporter'

# metrics_path defaults to '/metrics' -- 默認抓取監控機的url後綴地址是/metrics

# scheme defaults to 'http'. --模式是http

static_configs:

- targets: ['localhost:9090']

- targets:

- 192.168.31.113:9100

- 192.168.31.117:9100

- 192.168.31.117:8000

- job_name: mysql

static_configs:

- targets: ["localhost:9104"]

labels:

instance: db1

- job_name: linux

static_configs:

- targets: ["localhost:9100"]

labels:

instance: db1

- job_name: 'kubernetes-node'

scheme: https // 默認scheme http,聲明爲 https

tls_config:

insecure_skip_verify: true // 跳過不安全的認證提示

kubernetes_sd_configs:

- api_servers:

- 'http://10.3.1.141:8080'

role: node

relabel_configs: // 複寫meta label

- action: labelmap

regex: __meta_kubernetes_node_label_(.+) // 複寫後指標爲 kubernetes_io_hostname="xxxx", 用於 grafana 做圖

-----------------------------

啓動prometheus：

./prometheus

./prometheus --config.file=prometheus.yml

nohup ./prometheus --config.file=prometheus.yml &

---------------------------

默認配置文件包括三個分區：global、rule_files、scrape_configs。

global控制 Prometheus 服務器的全局配置。

scrape_interval 決定數據抓取的間隔。

evaluation_interval 決定數據計算的間隔，Prometheus會根據rule_file來產生新的時間序列值。

rule_files決定規則文件的保存路徑。

scrape_config決定Prometheus監控的資源。Prometheus經過HTTP暴露本身的數據，所以也能夠監控本身的健康情況。

------------------------------------------

二、node_exporter 安裝

node_exporter 用來收集服務器的監控信息。

node_exporter 默認使用 9100 端口監聽，Prometheus 會從 node_exporter 中獲取信息。

tar -xf node_exporter-0.15.1.linux-amd64.tar.gz

./ node_exporter

自定義一個客戶端

只要返回的數據庫類型是這樣就能夠.這裏用的django..只要格式正確就能夠

def metrics(req):

ss = "feiji 32" + "\n" + "caidian 31"

return HttpResponse(ss)

三、編寫 rules/mengyuan.rules 規則，規則是發送報警的前提

vi mengyuan.rules

groups:

- name: zus

rules:

# Alert for any instance that is unreachable for >5 minutes.

- alert: InstanceDown #報警名字隨便寫

expr: up == 0 #這是一個表達式，若是主機up狀態爲0,表示關機了，條件爲真就會觸發報警能夠經過$value獲得值

for: 5s #5s內，仍是0，就發送報警信息，固然是發送給報警管理器

labels:

severity: page #這個類型的報警定了個標籤

annotations:

summary: "Instance {{ $labels.instance }} down current {{ $value }}"

description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

四、安裝報警管理器

下載安裝alertmanager-0.15.0-rc.1.linux-amd64

修改建立：

cat alertmanager.yml

route:

receiver: mengyuan2 #接收的名字，默認必須有一個，對應receivers的- name

group_wait: 1s #等待1s

group_interval: 1s #發送間隔1s

repeat_interval: 1m #重複發送等待1m分鐘再發

group_by: ["zus"]

routes: #路由了，匹配規則標籤的severity:page走receiver: mengyuan ,若是routes不寫，就會走默認的mengyuan2

- receiver: mengyuan

match:

severity: page

receivers:

- name: 'mengyuan'

webhook_configs: #這我用的webhook_configs 鉤子方法,默認會把規則的報警信息發送到127.0.0.1:8000

- url: http://127.0.0.1:8000

send_resolved: true

- name: 'mengyuan2'

webhook_configs:

- url: http://127.0.0.1:8000/2

send_resolved: true

- job_name: '***'

scrape interval:120s

scrape timeout:30s

file_sd_configs:

- files:

- /prometheus/*.json

relabe_configs:

相關標籤/搜索

promethus

promethus+spring

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。