構建高大上的黑盒監控平臺

時間 2020-04-22

標籤構建高大黑盒監控平臺简体版

原文原文鏈接

概述

在監控體系裏面，一般咱們把監控分爲：白盒監控和黑盒監控：
mysql

黑盒監控:主要關注的現象，通常都是正在發生的東西，例如出現一個告警，業務接口不正常，那麼這種監控就是站在用戶的角度能看到的監控，重點在於能對正在發生的故障進行告警。linux

白盒監控:主要關注的是緣由，也就是系統內部暴露的一些指標，例如redis的info中顯示redis slave down，這個就是redis info顯示的一個內部的指標，重點在於緣由，多是在黑盒監控中看到redis down，而查看內部信息的時候，顯示redis port is refused connection。nginx

白盒監控:有不少種，有中間件，有存儲，有web服務器例如redis可使用info暴露內部的指標信息；例如mysql可使用show variables暴露內部指標信息；例如nginx可使用nginx_status來暴露內部信息，系統業務指標能夠經過埋點或者命令進行採集。git

Blackbox Exporter

在前面的知識中，咱們介紹Prometheus下如何進行白盒監控：咱們監控主機的資源用量、容器的運行狀態、數據庫中間件的運行數據，經過採集相關指標來預測咱們的服務健康狀態。在黑盒監控方面。Blackbox Exporter是Prometheus社區提供的官方黑盒監控解決方案，其容許用戶經過：HTTP、HTTPS、DNS、TCP以及ICMP的方式對網絡進行探測。github

Blackbox_exporter 應用場景

HTTP 測試
定義 Request Header 信息
判斷 Http status / Http Respones Header / Http Body 內容
TCP 測試
業務組件端口狀態監聽
應用層協議定義與監聽
ICMP 測試
主機探活機制
POST 測試
接口聯通性
SSL 證書過時時間

結合grafana 生成的相關模板：

一、首先看下咱們這邊的相關圖表，門戶多項指標與ssl監控：web

二、線路監控：redis

三、接口狀態監控：sql

Blackbox Exporter 部署：

一、安裝Exporter:數據庫

[root@cinder1 src]# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.16.0/blackbox_exporter-0.16.0.linux-amd64.tar.gz
[root@cinder1 src]#tar -zxvf blackbox_exporter-0.16.0.linux-amd64.tar.gz -C /usr/local
[root@cinder1 src]#mv /usr/local/blackbox_exporter-0.16.0.linux-amd64 /usr/local/blackbox_exporter

二、添加到啓動項：json

[root@cinder1 src]# cat /etc/systemd/system/blackbox_exporter.service 
[Unit]
Description=blackbox_exporter
After=network.target 

[Service]
WorkingDirectory=/usr/local/blackbox
ExecStart=/usr/local/blackbox/blackbox_exporter \
         --config.file=/usr/local/blackbox/blackbox.yml
[Install]
WantedBy=multi-user.target

三、檢測是否正常啓動：

[root@cinder1 src]# ss -tunlp|grep 9115
tcp    LISTEN     0      128      :::9115                 :::*                   users:(("blackbox_export",pid=2517722,fd=3))

icmp監控

經過icmp 這個指標的採集，咱們能夠確認到對方的線路是否有問題。這個也是監控裏面比較重要的一個環節。咱們要了解全國各地到咱們機房的線路有哪條有問題咱們總結了兩種方案：
一、全國各地各節點ping 和訪問數據採集。這種相似聽雲運營商有提供這類服務，可是要花錢。
二、我如今用的方法就是：找各地測試ping 的節點，咱們從機房主動ping 看是否到哪一個線路有故障，下面咱們開始。

1、prometheus 添加相關監控，Blackbox 使用默認配置啓動便可：

- job_name: "icmp_ping"
    metrics_path: /probe
    params:
      module: [icmp]  # 使用icmp模塊
    file_sd_configs:
    - refresh_interval: 10s
      files:
      - "/home/prometheus/conf/ping_status*.yml"  #具體的配置文件
    relabel_configs:
    - source_labels: [__address__]
      regex: (.*)(:80)?
      target_label: __param_target
      replacement: ${1}
    - source_labels: [__param_target]
      target_label: instance
    - source_labels: [__param_target]
      regex: (.*)
      target_label: ping
      replacement: ${1}
    - source_labels: []
      regex: .*
      target_label: __address__
      replacement: 192.168.1.14:9115

2、相關ping節點配置：

[root@cinder1 conf]# cat ping_status.yml 
- targets: ['220.181.38.150','14.215.177.39','180.101.49.12','14.215.177.39','180.101.49.11','14.215.177.38','14.215.177.38']
  labels:
    group: '一線城市-電信網絡監控'
- targets: ['112.80.248.75','163.177.151.109','61.135.169.125','163.177.151.110','180.101.49.11','61.135.169.121','180.101.49.11']
  labels:
    group: '一線城市-聯通網絡監控'
- targets: ['183.232.231.172','36.152.44.95','182.61.200.6','36.152.44.96','220.181.38.149']
  labels:
    group: '一線城市-移動網絡監控'

#這些數據是從全國各地ping 網站進行採集，你們能夠從那些網站獲取

3、添加grafana

這個grafana是本身定義的，看到網上沒有就本身定義了一個。你們能夠從github上下載，再看看效果：

http 相關指標監控：

1、prometheus 配置http_get訪問：

- job_name: "blackbox"
    metrics_path: /probe
    params:
      module: [http_2xx]  #使用http模塊
    file_sd_configs: 
    - refresh_interval: 1m
      files: 
      - "/home/prometheus/conf/blackbox*.yml"
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 192.168.1.14:9115

2、相關配置文件，相似舉例以下：

[root@cinder1 conf]# cat /home/prometheus/conf/blackbox-dis.yml 
- targets:
  - https://www.zhibo8.cc
  - https://www.baidu.com 
#配置相關URL

3、添加grafana模板：

能夠選擇模板的9965模板，這個模板咱們也看到前面的，提供了相關的ssl 過時檢測。

接口get請求檢測

1、prometheus 配置，其實跟咱們以前的配置同樣，咱們直接看配置文件：

- job_name: "check_get"
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    file_sd_configs:
    - refresh_interval: 1m
      files:
      - "/home/prometheus/conf/service_get.yml"
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 192.168.1.14:9115

2、相關接口配置參考：

[root@cinder1 conf]# cat service_get.yml 
- targets:
  - http://10.10.1.123:10000/pmkb/atc_tcbi
  - http://10.10.1.123:10000/pmkb/get_ship_lock_count
  - http://10.10.1.123:10000/pmkb/get_terminal_count_by_city
  - http://10.10.1.123:10000/pmkb/get_terminal_monitor?industry=1
  - http://10.10.1.123:10000/pmkb/get_terminal_comparison?industry=1
  - http://10.10.1.123:10000/pmkb/get_terminal_city_count_industry?industry=1
  - http://10.10.1.123:10000/pmkb/industry_stat?industry=1
  - http://10.10.1.123:10000/pmkb/get_company_car_count?industry=1
  - http://10.10.1.123:10000/pmkb/get_terminal_month_countbyi?industry=1
  labels:
    group: 'service'

3、grafana 和前面同樣本身訂製的，能夠從github上下載。

接口post 請求狀態檢測：

1、這裏首先咱們要改一下post 相關接口的blackbox.yml配置，咱們本身定義一個模塊：

[root@cinder1 blackbox]# cat blackbox.yml 
modules:
  http_2xx:
    prober: http
  http_post_2xx:   #這個模塊名稱能夠本身定義
    prober: http
    http:
      method: POST
      headers:
        Content-Type: application/json   #添加頭部
      body: '{"username":"admin","password":"123456"}'  #發送的相關數據，這裏咱們以登陸接口爲例

2、添加到prometheus:

- job_name: "check_service"
    metrics_path: /probe
    params:
      module: [http_post_2xx]  # 這裏要對應配置文件裏，定義的模塊
    file_sd_configs: 
    - refresh_interval: 1m
      files: 
      - "/home/prometheus/conf/service_post.yml"
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 192.168.1.14:9115

3、相關配置：

[root@cinder1 conf]# cat service_post.yml 
- targets:
  - http://10.2.4.103:5000/devops/api/v1.0/login
  labels:
    group: 'service'

4、添加grafana相關配置，這個也是本身定義的，能夠從github上下載。

tcp端口狀態檢測：

我的理解的是這個跟telnet差很少都是檢測端口是否在線

1、prometheus 配置：

- job_name: 'port_status'
    metrics_path: /probe
    params:
      module: [tcp_connect]  #使用tcp模塊
    static_configs:
      - targets: ['10.10.1.35:8068','10.10.1.35:8069']  #對應主機接口
        labels:
          instance: 'port_status'
          group: 'tcp'
    relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target 
    - target_label: __address__
      replacement: 192.168.1.14:9115

2、圖表：

圖表能夠集成到前面的grafana 9965模板：

告警規則定義：

1、業務正常性：
icmp、tcp、http、post 監測是否正常能夠觀察probe_success 這一指標
probe_success == 0 ##聯通性異常
probe_success == 1 ##聯通性正常
告警也是判斷這個指標是否等於0，如等於0 則觸發異常報警

2、經過http模塊咱們能夠獲取證書的過時時間，能夠根據過時時間添加相關告警

probe_ssl_earliest_cert_expiry ：能夠查詢證書到期時間。

#通過單位轉換咱們能夠獲得一下，按天來計算：(probe_ssl_earliest_cert_expiry - time())/86400

3、因此咱們結合上面的配置能夠定製以下告警規則

[root@cinder1 rules]# cat blackbox.yml 
groups:
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: probe_success == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "接口/主機/端口 {{ $labels.instance }}  沒法聯通"
      description: "請儘快檢測"

##ssl檢測

[root@cinder1 rules]# cat ssl.yml 
groups:
- name: check_ssl_status
  rules:
  - alert: "ssl證書過時警告"
    expr: (probe_ssl_earliest_cert_expiry - time())/86400 <30
    for: 1h
    labels:
      severity: warn
    annotations:
      description: '域名{{$labels.instance}}的證書還有{{ printf "%.1f" $value }}天就過時了,請儘快更新證書'
      summary: "ssl證書過時警告"

4、重啓完成以後咱們能夠登陸web界面查看下：

5、咱們發現有個接口已經存在問題，這個時候咱們也收到了一條相應的微信告警：