Redis監控工具,命令和調優

時間 2019-11-26

標籤 redis 監控工具命令欄目 Redis 简体版

原文原文鏈接

Redis監控工具,命令和調優

1.圖形化監控

因爲要對Redis作性能測試，發現了GitHub上有個python寫的RedisLive監控工具評價不錯。結果鼓搗了半天，最後發現其主頁中引用了Google的jsapi腳本，必須在線鏈接谷歌的服務。Stackoverflow上說把js腳本下載到本地也無法解決這個問題，坑爹！css

正要放棄時發現了一個從RedisLive fork出去的項目redis-monitor，應該是國人改的吧，去掉了對谷歌jsapi的依賴，並無缺了多Redis實例的管理，最終最終看到了久違的曲線圖。html

首先要保證安裝了python。python

以後下載下列python包安裝。可以手動下載tar.gz解壓後運行python setup.py install逐一安裝，或直接用pip下載：ios

tornado：一個python的web框架
redis.py：python的redis客戶端
python-dateutil
backports.ssl_match_hostname
argparse
setuptools
six

以後從GitHub上下載解壓redis-monitor-master，改動src/redis_live.conf。git

必須配置一個單獨的Redis實例存儲監控數據，同一時候可以配置多個要監控的Redis實例。github

以後啓動redis-monitor有些麻煩，需要啓動兩個前臺進程和兩個後臺進程：web

#in src/script/redis-monitor.sh add redis-monitor as a startup service

#start web with port 8888
$ python redis_live.py

# start info collector
$ python redis_monitor.py

#start daemon
$ python redis_live_daemon.py
$ python redis_monitor_daemon.py

2.命令行監控

前面可以看到，雖然圖形化監控Redis比較美觀、直接。但是安裝起來比較麻煩。redis

假設僅僅是想簡單看一下Redis的負載狀況的話，全然可以用它提供的一些命令來完畢。算法

2.1 吞吐量

Redis提供的INFO命令不只可以查看實時的吞吐量(ops/sec)，還能看到一些實用的運行時信息。如下用grep過濾出一些比較重要的實時信息，比方已鏈接的和在堵塞的客戶端、已用內存、拒絕鏈接、實時的tps和數據流量等：數據庫

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous"

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.00
instantaneous_output_kbps:0.00
rejected_connections:0
evicted_keys:0

2.2 延遲

2.2.1 客戶端PING

從客戶端可以監控Redis的延遲，利用Redis提供的PING命令，不斷PING服務端，記錄服務端響應PONG的時間。

如下開兩個終端，一個監控延遲。一個監視服務端收到的命令：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1, avg: 0.08

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> monitor
OK
1439361594.867170 [0 127.0.0.1:59737] "PING"
1439361594.877413 [0 127.0.0.1:59737] "PING"
1439361594.887643 [0 127.0.0.1:59737] "PING"
1439361594.897858 [0 127.0.0.1:59737] "PING"
1439361594.908063 [0 127.0.0.1:59737] "PING"
1439361594.918277 [0 127.0.0.1:59737] "PING"
1439361594.928469 [0 127.0.0.1:59737] "PING"
1439361594.938693 [0 127.0.0.1:59737] "PING"
1439361594.948899 [0 127.0.0.1:59737] "PING"
1439361594.959110 [0 127.0.0.1:59737] "PING"

假設咱們有益用DEBUG命令製造延遲，就能看到一些輸出上的變化：

[root@vm redis-3.0.3]# src/redis-cli --latency -h 127.0.0.1
min: 0, max: 1995, avg: 1.60 (2361 samples)

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> debug sleep 1
OK
(1.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK
(0.50s)
127.0.0.1:6379> debug sleep 2
OK
(2.00s)

2.2.2 服務端內部機制

服務端內部的延遲監控略微麻煩一些。因爲延遲記錄的默認閾值是0。雖然空間和時間耗費很小。Redis爲了高性能仍是默認關閉了它。

因此首先咱們要開啓它，設置一個合理的閾值。好比如下命令中設置的100ms：

127.0.0.1:6379> CONFIG SET latency-monitor-threshold 100
OK

因爲Redis運行命令很快，因此咱們用DEBUG命使人爲製造一些慢運行命令：

127.0.0.1:6379> debug sleep 2
OK
(2.00s)
127.0.0.1:6379> debug sleep .15
OK
127.0.0.1:6379> debug sleep .5
OK

如下就用LATENCY的各類子命令來查看延遲記錄：

LATEST：四列分別表示事件名、近期延遲的Unix時間戳、近期的延遲、最大延遲。
HISTORY：延遲的時間序列。
可用來產生圖形化顯示或報表。
GRAPH：以圖形化的方式顯示。最如下以豎行顯示的是指延遲在多久曾經發生。
RESET：清除延遲記錄。

127.0.0.1:6379> latency latest
1) 1) "command"
   2) (integer) 1439358778
   3) (integer) 500
   4) (integer) 2000

127.0.0.1:6379> latency history command
1) 1) (integer) 1439358773
   2) (integer) 2000
2) 1) (integer) 1439358776
   2) (integer) 150
3) 1) (integer) 1439358778
   2) (integer) 500

127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
# 
|  
|  
|_#

666
mmm

在運行一條DEBUG命令會發現GRAPH圖的變化，多出一條新的柱狀線，如下的時間2s就是指延遲剛發生兩秒鐘：

127.0.0.1:6379> debug sleep 1.5
OK
(1.50s)
127.0.0.1:6379> latency graph command
command - high 2000 ms, low 150 ms (all time high 2000 ms) --------------------------------------------------------------------------------
# 
| # |  | |_#| 2222
333s
mmm

另外一個有趣的子命令DOCTOR，它能列出一些指導建議。好比開啓慢日誌進一步追查問題緣由，查看是否有大對象被踢出或過時。以及操做系統的配置建議等。

127.0.0.1:6379> latency doctor
Dave, I have observed latency spikes in this Redis instance. You don't mind talking about it, do you Dave?

1. command: 3 latency spikes (average 883ms, mean deviation 744ms, period 210.00 sec). Worst all time event 2000ms.

I have a few advices for you:

- Check your Slow Log to understand what are the commands you are running which are too slow to execute. Please check http://redis.io/commands/slowlog for more information.
- Deleting, expiring or evicting (because of maxmemory policy) large objects is a blocking operation. If you have very large objects that are often deleted, expired, or evicted, try to fragment those objects into multiple smaller objects.
- I detected a non zero amount of anonymous huge pages used by your process. This creates very serious latency events in different conditions, especially when Redis is persisting on disk. To disable THP support use the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled', make sure to also add it into /etc/rc.local so that the command will be executed again after a reboot. Note that even if you have already disabled THP, you still need to restart the Redis process to get rid of the huge pages already created.

2.2.3 度量延遲Baseline

延遲中的一部分是來自環境的，比方操做系統內核、虛擬化環境等等。Redis提供了讓咱們度量這一部分延遲基線(Baseline)的方法：

[root@vm redis-3.0.3]# src/redis-cli --intrinsic-latency 100 -h 127.0.0.1
Max latency so far: 2 microseconds.
Max latency so far: 3 microseconds.
Max latency so far: 26 microseconds.
Max latency so far: 37 microseconds.
Max latency so far: 1179 microseconds.
Max latency so far: 1623 microseconds.
Max latency so far: 1795 microseconds.
Max latency so far: 2142 microseconds.

35818026 total runs (avg latency: 2.7919 microseconds / 27918.90 nanoseconds per run).
Worst run took 767x longer than the average latency.

–intrinsic-latency後面是測試的時長(秒)，通常100秒足夠了。

2.3 持續實時監控

Unix的WATCH命令是一個很實用的工具，它可以實時監視隨意命令的輸出結果。

比方上面咱們提到的命令，稍加改造就能變成持續地實時監控工具：

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 info | grep -e "connected_clients" -e "blocked_clients" -e "used_memory_human" -e "used_memory_peak_human" -e "rejected_connections" -e "evicted_keys" -e "instantaneous""

Every 1.0s: src/redis-cli -h 127.0.0.1 info | grep -e...  Wed Aug 12 14:30:40 2015

connected_clients:1
blocked_clients:0
used_memory_human:799.66K
used_memory_peak_human:852.35K
instantaneous_ops_per_sec:0
instantaneous_input_kbps:0.01
instantaneous_output_kbps:1.23
rejected_connections:0
evicted_keys:0

[root@vm redis-3.0.3]# watch -n 1 -d "src/redis-cli -h 127.0.0.1 latency graph command"

Every 1.0s: src/redis-cli -h 127.0.0.1 latency graph command                                                                                                               Wed Aug 12 14:33:25 2015

command - high 2000 ms, low 150 ms (all time high 2000 ms)
--------------------------------------------------------------------------------
#
|  #
|  |
|_#|

4441
0006
mmmm

2.4 慢操做日誌

像SORT、LREM、SUNION等操做在大對象上會很耗時。使用時要注意參照官方API上每個命令的算法複雜度。用前面介紹過的慢操做日誌監控操做的運行時間。就像主流數據庫提供的慢SQL日誌同樣，Redis也提供了記錄慢操做的日誌。

注意這部分日誌僅僅會計算純粹的操做耗時。

slowlog-log-slower-than設置慢操做的閾值，slowlog-max-len設置保存個數，因爲慢操做日誌與延遲記錄同樣，都是保存在內存中的：

127.0.0.1:6379> config set slowlog-log-slower-than 500
OK 
127.0.0.1:6379> debug sleep 1
OK
(0.50s)
127.0.0.1:6379> debug sleep .6
OK
127.0.0.1:6379> slowlog get 10
1) 1) (integer) 2
   2) (integer) 1439369937
   3) (integer) 473178
   4) 1) "debug"
      2) "sleep"
      3) ".6"
2) 1) (integer) 1
   2) (integer) 1439369821
   3) (integer) 499357
   4) 1) "debug"
      2) "sleep"
      3) "1"
3) 1) (integer) 0
   2) (integer) 1439365058
   3) (integer) 417846
   4) 1) "debug"
      2) "sleep"
      3) "1"

輸出的四列的含義各自是：記錄的自增ID、命令運行時的時間戳、命令的運行耗時(ms)、命令的內容。

注意上面的DEBUG命令並無包括休眠時間。而僅僅是命令的處理時間。

3.官方優化建議

3.1 網絡延遲

客戶端可以經過TCP/IP或Unix域Socket鏈接到Redis。

一般在千兆網絡環境中。TCP/IP網絡延遲是200us(微秒)，Unix域Socket可以低到30us。

關於Unix域Socket(Unix Domain Socket)仍是比較常用的技術。詳細請參考Nginx+PHP-FPM的域Socket配置方法。

什麼是域Socket？
維基百科：「Unix domain socket 或者 IPCsocket 是一種終端，可以使同一臺操做系統上的兩個或多個進程進行數據通訊。與管道相比。Unix domain sockets 既可以使用字節流數和數據隊列，而管道通訊則僅僅能經過字節流。U**nix domain sockets的接口和Internet socket很像，但它不使用網絡底層協議來通訊。Unix domain socket的功能是POSIX操做系統裏的一種組件。Unix domain sockets使用系統文件的地址來做爲本身的身份。它可以被系統進程引用。

因此兩個進程可以同一時候打開一個Unix domain sockets來進行通訊。

只是這樣的通訊方式是發生在系統內核裏而不會在網絡裏傳播**。

」

網絡方面咱們能作的就是下降在網絡往返時間RTT(Round-Trip Time)。官方提供瞭如下一些建議：

長鏈接：不要頻繁鏈接/斷開到服務器的鏈接，儘量保持長鏈接(Jedis現在就是這樣作的)。
域Socket：假設客戶端與Redis服務端在同一臺機器上的話。使用Unix域Socket。
多參數命令：相比管道，優先使用多參數命令，如mset/mget/hmset/hmget等。
管道化：其次使用管道下降RTT。
LUA腳本：對於有數據依賴而沒法使用管道的命令，可以考慮在Redis服務端運行LUA腳本。

3.2 磁盤I/O

3.2.1 寫磁盤

雖然Redis也是基於多路I/O複用的單線程機制。但是卻沒有像Nginx同樣提供CPU Affinity的設置，避免fork出的子進程也跑在Redis主進程依附的CPU內核上。致使後臺進程影響主進程。因此仍是讓操做系統本身去調度Redis主進程和後臺進程吧。但反過來，假設不開啓持久化機制的話，爲Redis設置親和性可否進一步提高性能呢？

3.2.2 操做系統Swap

假設系統內存不足，可能會將Redis相應的某些頁從內存swap到磁盤文件上。可以經過/proc目錄中的smaps文件查看是否有數據頁被swap。

假設發現大量頁被swap。則可以用vmstat和iostat進一步追查緣由：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1 info | grep process_id
process_id:24191

[root@vm redis-3.0.3]# cat /proc/24191/smaps | grep "Swap"
Swap: 0 kB
Swap: 0 kB
Swap: 0 kB
Swap: 0 kB
Swap: 0 kB
            ...
Swap: 0 kB
Swap: 0 kB
Swap: 0 kB
Swap: 0 kB

3.3 其它因素

3.3.1 Fork子進程

寫RDB文件和rewrite AOF文件都需要fork出一個後臺進程，fork操做的主要消耗在於頁表的拷貝，不一樣系統的耗時會有些差別。當中，Xen問題比較嚴重。

3.3.2 Transparent Huge Page

此外。假設Linux開啓了THP(Transparent Huge Page)功能的話，會極大地影響延遲。

3.3.3 Key過時

Redis同一時候使用主動和被動兩種方式剔除已通過期的Key：

被動：當客戶端訪問到Key時，發現已通過期。則剔除
主動：每100ms剔除一批Key。假如過時Key超過25%則重複運行

因此，要避免同一時間超過25%的Key過時致使的Redis堵塞。設置過時時間時可以略微隨機化一些。

4.最後一招：WatchDog

官方說法提供的最後一招(last resort)就是WatchDog。它可以將慢操做的整個函數運行棧打印到Redis日誌中。因爲它與前面介紹過的將記錄保存在內存中的延遲和滿操做記錄不一樣。因此記得使用前要在redis.conf中配置logfile日誌路徑：

[root@vm redis-3.0.3]# src/redis-cli -h 127.0.0.1
127.0.0.1:6379> CONFIG SET watchdog-period 500
OK
127.0.0.1:6379> debug sleep 1
OK

[root@vm redis-3.0.3]# tailf redis.log 
      `-._    `-.__.-' _.-'                                       
          `-._        _.-' `-.__.-'                                               

51091:M 12 Aug 15:36:53.337 # Server started, Redis version 3.0.3
51091:M 12 Aug 15:36:53.338 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
51091:M 12 Aug 15:36:53.338 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
51091:M 12 Aug 15:36:53.343 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
51091:M 12 Aug 15:36:53.343 * DB loaded from disk: 0.000 seconds
51091:M 12 Aug 15:36:53.343 * The server is now ready to accept connections on port 6379

51091:signal-handler (1439365058) 
--- WATCHDOG TIMER EXPIRED ---
src/redis-server 127.0.0.1:6379(logStackTrace+0x43)[0x450363]
/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]
/lib64/libpthread.so.0[0x3c0740f710]
/lib64/libpthread.so.0[0x3c0740f710]
/lib64/libpthread.so.0(__nanosleep+0x2d)[0x3c0740ef3d]
src/redis-server 127.0.0.1:6379(debugCommand+0x58d)[0x45180d]
src/redis-server 127.0.0.1:6379(call+0x72)[0x4201b2]
src/redis-server 127.0.0.1:6379(processCommand+0x3e5)[0x4207d5]
src/redis-server 127.0.0.1:6379(processInputBuffer+0x4f)[0x42c66f]
src/redis-server 127.0.0.1:6379(readQueryFromClient+0xc2)[0x42c7b2]
src/redis-server 127.0.0.1:6379(aeProcessEvents+0x13c)[0x41a52c]
src/redis-server 127.0.0.1:6379(aeMain+0x2b)[0x41a7eb]
src/redis-server 127.0.0.1:6379(main+0x2cd)[0x423c8d]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3c0701ed5d]
src/redis-server 127.0.0.1:6379[0x419b49]
51091:signal-handler (51091:signal-handler (1439365058) ) --------