Spring Boot 微服務應用集成Prometheus + Grafana 實現監控告警

時間 2020-01-23

標籤 spring boot 微服應用集成 prometheus grafana 實現監控告警欄目 Spring 简体版

原文原文鏈接

前言

關鍵詞：Prometheus; Grafana; Alertmanager; SpringBoot; SpringBoot Actuator; 監控; 告警;

在前一篇Spring Boot Actuator 模塊詳解：健康檢查，度量，指標收集和監控中，咱們學習了 Spring Boot Actuator 模塊的做用、配置和重要端點的介紹。java

我也提到了，我主要目的是想要給咱們項目的微服務應用都加上監控告警。Spring Boot Actuator的引入只是第一步，在本章中，我會介紹：git

如何集成監控告警系統Prometheus 和圖形化界面Grafana
如何自定義監控指標，作應用監控埋點
Prometheus 如何集成 Alertmanager 進行告警

理論部分

Prometheus

Prometheus 中文名稱爲普羅米修斯，受啓發於Google 的Brogmon 監控系統，從2012年開始由前Google工程師在Soundcloud 以開源軟件的形式進行研發，2016年6月發佈1.0版本。Prometheus 能夠看做是 Google 內部監控系統Borgmon 的一個實現。github

下圖說明了Prometheus 的體系結構及其部分生態系統組件。其中 Alertmanager 用於告警，Grafana 用於監控數據可視化，會在文章後面繼續提到。正則表達式

在這裏咱們瞭解到Prometheus 這幾個特徵便可：spring

數據收集器，它以配置的時間間隔按期經過HTTP提取指標數據。
一個時間序列數據庫，用於存儲全部指標數據。
一個簡單的用戶界面，您能夠在其中可視化，查詢和監視全部指標。

詳細瞭解請閱讀 Prometheus 官方文檔

Grafana

Grafana 是一款採用 go 語言編寫的開源應用，容許您從Elasticsearch，Prometheus，Graphite，InfluxDB等各類數據源中獲取數據，並經過精美的圖形將其可視化。docker

除了Prometheus的AlertManager 能夠發送報警，Grafana 同時也支持告警。Grafana 能夠無縫定義告警在數據中的位置，可視化的定義閾值，並能夠經過釘釘、email等平臺獲取告警通知。最重要的是可直觀的定義告警規則，不斷的評估併發送通知。shell

因爲Grafana alert告警比較弱，大部分告警都是經過Prometheus Alertmanager進行告警.

請注意Prometheus儀表板也具備簡單的圖形。可是Grafana的圖形化要好得多。數據庫

延伸閱讀：tomcat

官方文檔

Grafana全面瓦解

Alertmananger

Prometheus 監控平臺中除了負責採集數據和存儲，還能定製事件規則，可是這些事件規則要實現告警通知的話須要配合Alertmanager 組件來完成。springboot

AlertManager 支持告警分組（將多個告警合併一塊兒發送）、告警抑制以及告警靜默（同一個時間段內不發出重複的告警）功能。

延伸閱讀：官網對Alertmanager的介紹

監控Java 應用

監控模式

目前，監控系統採集指標有兩種方式，一種是『推』，另外一種就是『拉』：

推的表明有 ElasticSearch，InfluxDB，OpenTSDB 等，須要你從程序中將指標使用 TCP，UDP 等方式推送至相關監控應用，只是使用 TCP 的話，一旦監控應用掛掉或存在瓶頸，容易對應用自己產生影響，而使用 UDP 的話，雖然不用擔憂監控應用，可是容易丟數據。

拉的表明，主要表明就是 Prometheus，讓咱們不用擔憂監控應用自己的狀態。並且能夠利用 DNS-SRV 或者 Consul 等服務發現功能就能夠自動添加監控。

如何監控

Prometheus 監控應用的方式很是簡單，只須要進程暴露了一個用於獲取當前監控樣本數據的 HTTP 訪問地址。這樣的一個程序稱爲Exporter，Exporter 的實例稱爲一個 Target 。Prometheus 經過輪訓的方式定時從這些 Target 中獲取監控數據樣本，對於應用來說，只須要暴露一個包含監控數據的 HTTP 訪問地址便可，固然提供的數據須要知足必定的格式，這個格式就是 Metrics 格式.

metric name>{<label name>=<label value>, ...}

主要分爲三個部分
各個部分需符合相關的正則表達式

metric name：指標的名稱，主要反映被監控樣本的含義 a-zA-Z_:*_
label name: 標籤反映了當前樣本的特徵維度 [a-zA-Z0-9_]*
label value: 各個標籤的值，不限制格式

須要注意的是，label value 最好使用枚舉值，而不要使用無限制的值，好比用戶 ID，Email 等，否則會消耗大量內存，也不符合指標採集的意義。

MicroMeter

前面簡述了Prometheus 監控的原理。那麼咱們的Spring Boot 應用怎麼提供這樣一個 HTTP 訪問地址，提供的數據還得符合上述的 Metrics 格式？

還記得嗎，在Spring Boot Actuator 模塊詳解：健康檢查，度量，指標收集和監控中，我有提到過Actuator 模塊也能夠和一些外部的應用監控系統整合，其中就包括Prometheus 。那麼Spring Boot Actuator 怎麼讓 Spring Boot 應用和Prometheus 這種監控系統結合起來呢？

這個橋樑就是[MicroMeter]()。Micrometer 爲 Java 平臺上的性能數據收集提供了一個通用的 API，應用程序只須要使用 Micrometer 的通用 API 來收集性能指標便可。Micrometer 會負責完成與不一樣監控系統的適配工做。

實操部分一

接下去咱們一邊結合實際的Demo，一邊講解說明。

初始的Demo項目建立請參照 Spring Boot Actuator 模塊詳解：健康檢查，度量，指標收集和監控

實操部分會將分爲兩個部分，本部分主要是將應用如何集成Prometheus 和 Grafana 完成指標收集和可視化。

1、添加依賴

爲了讓Spring Boot 應用和Prometheus 集成，你須要增長micrometer-registry-prometheus依賴。

<!-- Micrometer Prometheus registry  -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

添加上述依賴項以後，Spring Boot 將會自動配置 PrometheusMeterRegistry 和 CollectorRegistry來以Prometheus 能夠抓取的格式（即上文提到的 Metrics 格式）收集和導出指標數據。

全部的相關數據，都會在Actuator 的 /prometheus端點暴露出來。Prometheus 能夠抓取該端點以按期獲取度量標準數據。

Actuator 的 `/prometheus`端點

咱們仍是以咱們以前的Demo項目爲例子。深究一下這個端點的內容。添加micrometer-registry-prometheus依賴後，咱們訪問http://localhost:8080/actuator/prometheus地址，能夠看到一下內容：

# HELP jvm_buffer_total_capacity_bytes An estimate of the total capacity of the buffers in this pool
# TYPE jvm_buffer_total_capacity_bytes gauge
jvm_buffer_total_capacity_bytes{id="direct",} 90112.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0
# HELP tomcat_sessions_expired_sessions_total  
# TYPE tomcat_sessions_expired_sessions_total counter
tomcat_sessions_expired_sessions_total 0.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total 1.0
# HELP jvm_buffer_count_buffers An estimate of the number of buffers in the pool
# TYPE jvm_buffer_count_buffers gauge
jvm_buffer_count_buffers{id="direct",} 11.0
jvm_buffer_count_buffers{id="mapped",} 0.0
# HELP system_cpu_usage The "recent cpu usage" for the whole system
# TYPE system_cpu_usage gauge
system_cpu_usage 0.0939447637893599
# HELP jvm_gc_max_data_size_bytes Max size of old generation memory pool
# TYPE jvm_gc_max_data_size_bytes gauge
jvm_gc_max_data_size_bytes 2.841116672E9

# 此處省略超多字...

能夠看到，這些都是按照上文提到的 Metrics 格式組織起來的程序監控指標數據。

metric name>{<label name>=<label value>, ...}

2、Prometheus 安裝與配置

安裝請參閱官方文檔。內容很少可是很細緻。你能夠選擇二進制安裝或者是docker 的方式。這裏不贅述。

Prometheus官方網站

配置Prometheus

接下去，咱們須要配置Prometheus 去收集咱們 Demo 項目/actuator/prometheus的指標數據。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  # demo job
  -  job_name: 'springboot-actuator-prometheus-test' # job name
     metrics_path: '/actuator/prometheus' # 指標獲取路徑
     scrape_interval: 5s # 間隔
     basic_auth: # Spring Security basic auth 
       username: 'actuator'
       password: 'actuator'
     static_configs:
     - targets: ['10.60.45.113:8080'] # 實例的地址，默認的協議是http

重點請關注這裏的配置：

# demo job
  -  job_name: 'springboot-actuator-prometheus-test' # job name
     metrics_path: '/actuator/prometheus' # 指標獲取路徑
     scrape_interval: 5s # 間隔
     basic_auth: # Spring Security basic auth 
       username: 'actuator'
       password: 'actuator'
     static_configs:
     - targets: ['10.60.45.113:8080'] # 實例的地址，默認的協議是http

測試

配置完成以後，咱們啓動Prometheus 測試一下，若是你是docker 方式的話，在prometheus.yml 文件所在目錄執行以下命令，便可啓動Prometheus：

docker run -d -p 9090:9090 \
    -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus --config.file=/etc/prometheus/prometheus.yml

訪問http://ip:9090 ，可看到以下界面：

點擊 Insert metric at cursor ，便可選擇監控指標；點擊 Graph ，便可讓指標以圖表方式展現；點擊Execute 按鈕，便可看到相似下圖的結果：

你也能夠在輸入框中輸入PromQL來進行更高級的查詢。
PromQL是Prometheus 的自定義查詢語言，經過PromQL用戶能夠很是方便地對監控樣本數據進行統計分析。

配置熱加載

curl -X POST http://ip:9090/-/reload

3、Grafana安裝和配置

能夠看到，Prometheus 自帶的監控面板很是「簡陋」。因此引入Grafana 來實現更友好、更貼近生產的監控可視化。

1. 啓動

$ docker run -d --name=grafana -p 3000:3000 grafana/grafana

2. 登陸

訪問 http://ip:3000/login ，初始帳號/密碼爲：admin/admin ，第一次登陸會讓你修改密碼。

3. 配置數據源

點擊Configuration中Add Data Source，會看到以下界面：

這裏咱們選擇Prometheus 當作數據源，這裏咱們就配置一下Prometheus 的訪問地址，點擊 Save & Test：

4. 建立監控Dashboard

點擊導航欄上的 + 按鈕，並點擊Dashboard，將會看到相似以下的界面：

點擊 Add Query ，便可看到相似以下的界面：

在Metrics處輸入要查詢的指標，指標的取值詳見Spring Boot應用的 /actuator/prometheus 端點，例如jvm_memory_used_bytes 、jvm_threads_states_threads 、jvm_threads_live_threads 等，Grafana會給你較好的提示，而且能夠用PromQL實現較爲複雜的計算，例如聚合、求和、平均等。若是想要繪製多個線條，可點擊Add Query 按鈕，

再點擊下面那個Visualization，能夠選擇可視化的類型和一些相關的配置。這裏就很少贅述，留給讀者本身探索。

再點擊下一步General進行基礎配置，不贅述：

5. Dashboard 市場

到這裏，我想聰明的讀者們應該已經學會如何去可視化一個指標數據了。可是應該不少人都會以爲，若是有好多指標的話，配置起來其實是蠻繁瑣的。

是否有開箱即用、通用型的DashBoard模板呢？

前往 Grafana Lab - Dashboards ，輸入關鍵詞便可搜索指定Dashboard。你就能夠得到你想要的😎😎。

另外，這些已有的dashboard也可讓咱們更快掌握一些panel的配置和dashboard的使用。

6. 引入dashboard

這裏直接給出兩款我以爲比較好用的dashboard：

JVM (Micrometer)
Spring Boot Statistics
這一款我須要提一下，剛開始我引入的時候是無效的，不知道讀者會不會遇到和我同樣的問題，若是遇到了，請到dashboard的設置裏面，修改 variables 中 $application和$instance兩個變量的Definition。

還有我我的是推薦，在這兩款dashboard上面作一些定製化操做，或者說把二者的panel結合起來。

引入的操做很簡單，首選你要在 Grafana Lab - Dashboards中選好你心儀的dashboard，而後記下它的ID

就是點擊Import按鈕：

輸入ID 以後，完成配置，點擊Import按鈕：

效果以下：

實操部分二

在實操部分二，主要講如何自定義監控指標（好比咱們的一些業務數據，這也叫作埋點）和如何使用Alertmanager完成監控告警。

1、自定義（業務）監控指標

模擬需求：有一個訂單服務，監控 [實時訂單金額]、[10分鐘內下單失敗率]

1. 建立 Prometheus 監控管理類`PrometheusCustomMonitor`

這裏面咱們自定義了三個metrics：

requests_error_total: 下單失敗次數
order_request_count：下單總次數
order_amount_sum：下單金額統計

@Component
public class PrometheusCustomMonitor {

    /**
     * 記錄請求出錯次數
     */
    private Counter requestErrorCount;

    /**
     * 訂單發起次數
     */
    private Counter orderCount;

    /**
     * 金額統計
     */
    private DistributionSummary amountSum;

    private final MeterRegistry registry;

    @Autowired
    public PrometheusCustomMonitor(MeterRegistry registry) {
        this.registry = registry;
    }

    @PostConstruct
    private void init() {
        requestErrorCount = registry.counter("requests_error_total", "status", "error");
        orderCount = registry.counter("order_request_count", "order", "test-svc");
        amountSum = registry.summary("order_amount_sum", "orderAmount", "test-svc");
    }

    public Counter getRequestErrorCount() {
        return requestErrorCount;
    }

    public Counter getOrderCount() {
        return orderCount;
    }

    public DistributionSummary getAmountSum() {
        return amountSum;
    }
}

2. 新增`/order`接口

當 flag="1"時，拋異常，模擬下單失敗狀況。在接口中統計order_request_count和order_amount_sum。

@RestController
public class TestController {

    @Resource
    private PrometheusCustomMonitor monitor;
    
    //....

    @RequestMapping("/order")
    public String order(@RequestParam(defaultValue = "0") String flag) throws Exception {
        // 統計下單次數
        monitor.getOrderCount().increment();
        if ("1".equals(flag)) {
            throw new Exception("出錯啦");
        }
        Random random = new Random();
        int amount = random.nextInt(100);
        // 統計金額
        monitor.getAmountSum().record(amount);
        return "下單成功, 金額: " + amount;
    }
}

PS：實際項目中，採集業務監控數據的時候，建議使用AOP的方式記錄，不要侵入業務代碼。不要像我Demo中這樣寫。

3. 新增全局異常處理器`GlobalExceptionHandler`

統計下單失敗次數requests_error_total：

@ControllerAdvice
public class GlobalExceptionHandler {

    @Resource
    private PrometheusCustomMonitor monitor;

    @ResponseBody
    @ExceptionHandler(value = Exception.class)
    public String handle(Exception e) {
        monitor.getRequestErrorCount().increment();
        return "error, message: " + e.getMessage();
    }
}

測試：

啓動項目，訪問http://localhost:8080/order和http://localhost:8080/order?flag=1模擬下單成功和失敗的狀況，而後咱們訪問http://localhost:8080/actuator/prometheus，能夠看到咱們自定義指標已經被/prometheus端點暴露出來了：

# HELP requests_error_total  
# TYPE requests_error_total counter
requests_error_total{application="springboot-actuator-prometheus-test",status="error",} 41.0
# HELP order_request_count_total  
# TYPE order_request_count_total counter
order_request_count_total{application="springboot-actuator-prometheus-test",order="test-svc",} 94.0
# HELP order_amount_sum  
# TYPE order_amount_sum summary
order_amount_sum_count{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 53.0
order_amount_sum_sum{application="springboot-actuator-prometheus-test",orderAmount="test-svc",} 2701.0

4. 在Grafana 中添加對應監控面板

這裏我新增一個dashboard做爲演示用，一些步驟前面講過這裏就直接省略：

首先是建立10分鐘內下單失敗率

sum(rate(requests_error_total{application="springboot-actuator-prometheus-test"}[10m])) / sum(rate(order_request_count_total{application="springboot-actuator-prometheus-test"}[10m])) * 100

而後是統計訂單總金額：

最終結果

2、添加監控

模擬告警規則：

服務是否下線

10分鐘內下單失敗率是否大於10%

1. 部署 Alertmanager

這裏採用二進制包的方式部署。

Alertmanager最新版本的下載地址能夠從Prometheus官方網站https://prometheus.io/downloa...
下載完成後，解壓後會包含一個默認的alertmanager.yml配置文件，咱們在裏面添加發送郵件配置

# 全局配置
global:
  resolve_timeout: 5m
  smtp_smarthost: 'xxxxxx'
  smtp_from: 'xxxx@xx.com'
  smtp_auth_username: 'xxxx@xx.com'
  smtp_auth_password: 'XXXXXX'
# 路由配置
route:
  receiver: 'default-receiver' # 父節點
  group_by: ['alertname'] # 分組規則
  group_wait: 10s # 爲了可以一次性收集和發送更多的相關信息時，能夠經過group_wait參數設置等待時間
  group_interval: 1m  #定義相同的Group之間發送告警通知的時間間隔
  repeat_interval: 1m
  routes: # 子路由，根據match路由
  - receiver: 'rhf-mail-receiver'
    group_wait: 10s
    match: # 匹配自定義標籤
      team: rhf    
# 告警接收者配置
receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'xxxx@xx.com'
- name: 'rhf-mail-receiver'
  email_configs:
  - to: 'xxxx@xx.com'

目前官方內置的第三方通知集成包括：郵件、即時通信軟件（如Slack、Hipchat）、移動應用消息推送(如Pushover)和自動化運維工具（例如：Pagerduty、Opsgenie、Victorops）。Alertmanager的通知方式中還能夠支持Webhook，經過這種方式開發者能夠實現更多個性化的擴展支持（釘釘、企業微信等）。

相關配置延伸閱讀：

延伸閱讀1

延伸閱讀2

啓動

Alermanager會將數據保存到本地中，默認的存儲路徑爲data/。所以，在啓動Alertmanager以前須要建立相應的目錄：

./alertmanager

用戶也在啓動Alertmanager時使用參數修改相關配置。--config.file用於指定alertmanager配置文件路徑，--storage.path用於指定數據存儲路徑。

查看運行狀態，啓動以後咱們訪問9093端口：

Alert菜單下能夠查看Alertmanager 接收到的告警內容。Silences菜單下則能夠經過UI建立靜默規則。Status菜單下面能夠看到Alertmanager 的配置信息。

配置熱加載

curl -X POST http://ip:9093/-/reload

2. 設置告警規則

在Prometheus 目錄下新建test-svc-alert-rule.yaml來設置告警規則，內容以下：

groups:
- name: svc-alert-rule
  rules:
  - alert: svc-down # 服務是否下線
    expr: sum(up{job="springboot-actuator-prometheus-test"}) == 0
    for: 1m
    labels: # 自定義標籤
      severity: critical
      team: rhf # 咱們小組的名字，對應上面match 的標籤匹配
    annotations:
      summary: "訂單服務已下線，請檢查！！"
  - alert: order-error-rate-high # 10分鐘內下單失敗率是否大於10%
    expr: sum(rate(requests_error_total{application="springboot-actuator-prometheus-test"}[10m])) / sum(rate(order_request_count_total{application="springboot-actuator-prometheus-test"}[10m])) > 0.1
    for: 1m
    labels:
      severity: major
      team: rhf
    annotations:
      summary: "訂單服務響應異常！！"
      description: "10分鐘訂單錯誤率已經超過10% (當前值: {{ $value }} ！！！"

實際項目中，能夠用一個 rule目錄存放全部的告警規則，而後 rule/*.yaml的方式配置

3. 配置Prometheus

在 prometheus.yml文件下，引用test-svc-alert-rule.yaml告警規則配置，並開啓 Alertmanager。

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # alertmanage default start port 9093
      - localhost:9093  
rule_files:
  - /data/prometheus-stack/prometheus/rule/*.yml