Prometheus 監控 Redis 集羣的正確姿式

時間 2020-08-07

原文原文鏈接

Prometheus 監控Redis的正確姿式(redis集羣)

Prometheus 監控 Redis cluster，其實套路都是同樣的，使用 exporter。
exporter 負責採集指標，經過 http 暴露給 Prometheus 拉取。granafa 則經過這些指標繪圖展現數據。Prometheus 收集的數據還會根據你設置的告警規則判斷是否要發送給 Alertmanager， Alertmanager 則要判斷是否要發出告警。html

Alertmanager 告警分爲三個階段linux

Inactive 觸發告警的規則會被髮送到這來。
Pending 你設置的等待時間，即規則裏面的 for
Firing 發送告警到郵件、釘釘之類的

扯遠了，開始監控 Redis clustergit

redis_exporter 監控 Redis cluster

監控什麼應用，使用的相應的 exporter，能夠在官網查到。EXPORTERS AND INTEGRATIONS
github

Redis 使用 redis_exporter ，連接：redis_exporterweb

支持 Redis 2.x - 5.xredis

安裝及參數

下載地址api

wget https://github.com/oliver006/redis_exporter/releases/download/v1.3.5/redis_exporter-v1.3.5.linux-amd64.tar.gz   
tar zxvf redis_exporter-v1.3.5.linux-amd64.tar.gz
cd redis_exporter-v1.3.5.linux-amd64/
./redis_exporter <flags>

redis_exporter 支持的參數不少，對咱們有用的就幾個。ruby

./redis_exporter --help
Usage of ./redis_exporter:
    -redis.addr string
    	Address of the Redis instance to scrape (default "redis://localhost:6379")
    -redis.password string
    	Password of the Redis instance to scrape
    -web.listen-address string
    	Address to listen on for web interface and telemetry. (default ":9121")

單實例 redis 監控

nohup ./redis_exporter -redis.addr 172.18.11.138:6379 -redis.password xxxxx &

Prometheus 添加單實例bash

- job_name: redis_since
    static_configs:
    - targets: ['172.18.11.138:9121']

Redis 集羣監控方案

這個挺費勁的，網上查了不少資料，大都是監控單實例的，就這個是集羣的，恰恰他的集羣是沒密碼的。
prometheus監控redis集羣
post

我試過的方案：
如下兩種都會提示認證失敗

level=error msg="Redis INFO err: NOAUTH Authentication required."

方法一

nohup ./redis_exporter -redis.addr 172.18.11.139:7000 172.18.11.139:7001 172.18.11.140:7002 172.18.11.140:7003 172.18.11.141:7004 172.18.11.141:7005 -redis.password xxxxx &

方法二

nohup ./redis_exporter -redis.addr redis://h:Lcsmy.312==/@172.18.11.139:7000 redis://h:Lcsmy.312==/@172.18.11.139:7001 redis://h:Lcsmy.312==/@172.18.11.140:7002 redis://h:Lcsmy.312==/@172.18.11.140:7003 redis://h:Lcsmy.312==/@172.18.11.141:7004 redis://h:Lcsmy.312==/@172.18.11.141:7005 -redis.password xxxxx &

原本想採起最low 的方法，一個實例啓一個 redis_exporter。這樣子的話，集羣那裏不少語句都用不了，好比 cluster_slot_fail。放棄該方法

nohup ./redis_exporter -redis.addr 172.18.11.139:7000  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9121 > /dev/null 2>&1 &
nohup ./redis_exporter -redis.addr 172.18.11.139:7001  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9122 > /dev/null 2>&1 &
nohup ./redis_exporter -redis.addr 172.18.11.140:7002  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9123 > /dev/null 2>&1 &
nohup ./redis_exporter -redis.addr 172.18.11.140:7003  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9124 > /dev/null 2>&1 &
nohup ./redis_exporter -redis.addr 172.18.11.141:7004  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9125 > /dev/null 2>&1 &
nohup ./redis_exporter -redis.addr 172.18.11.141:7005  -redis.password xxxxxx  -web.listen-address 172.18.11.139:9126 > /dev/null 2>&1 &

最後只好去 github 提 issue。用個人中國式英語和做者交流，終於明白了。。。其實官方文檔已經寫了。

scrape_configs:
  ## config for the multiple Redis targets that the exporter will scrape
  - job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://first-redis-host:6379
        - redis://second-redis-host:6379
        - redis://second-redis-host:6380
        - redis://second-redis-host:6381
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: <<REDIS-EXPORTER-HOSTNAME>>:9121
  
  ## config for scraping the exporter itself
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
        - <<REDIS-EXPORTER-HOSTNAME>>:9121

Redis 集羣實際操做

啓動 redis_exporter

nohup ./redis_exporter -redis.password xxxxx  &

重點
在 prometheus 裏面如何配置：

- job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://172.18.11.139:7000
        - redis://172.18.11.139:7001
        - redis://172.18.11.140:7002
        - redis://172.18.11.140:7003
        - redis://172.18.11.141:7004
        - redis://172.18.11.141:7005
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.18.11.139:9121
  ## config for scraping the exporter itself
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
        - 172.18.11.139:9121

這樣子就能採集到集羣的數據了。可是日誌裏提示

time="2019-12-17T09:10:49+08:00" level=error msg="Couldn't connect to redis instance"

午休的時候忽然想明白了，只要能鏈接到一個集羣的一個節點，天然就能查詢其餘節點的指標了。因而啓動命令改成：

nohup ./redis_exporter -redis.addr 172.18.11.141:7005  -redis.password xxxxx &

Prometheus 配置不變

送上幾張圖片：

告警規則

groups:
- name:  Redis
  rules: 
    - alert: RedisDown
      expr: redis_up  == 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Redis down (instance {{ $labels.instance }})"
        description: "Redis 掛了啊，mmp\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: MissingBackup
      expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Missing backup (instance {{ $labels.instance }})"
        description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: OutOfMemory
      expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of memory (instance {{ $labels.instance }})"
        description: "Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ReplicationBroken
      expr: delta(redis_connected_slaves[1m]) < 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Replication broken (instance {{ $labels.instance }})"
        description: "Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: TooManyConnections
      expr: redis_connected_clients > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Too many connections (instance {{ $labels.instance }})"
        description: "Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: NotEnoughConnections
      expr: redis_connected_clients < 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Not enough connections (instance {{ $labels.instance }})"
        description: "Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: RejectedConnections
      expr: increase(redis_rejected_connections_total[1m]) > 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Rejected connections (instance {{ $labels.instance }})"
        description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"