redis集羣主從中斷，報io太高不錯

時間 2020-05-23

標籤 redis 集羣主從中斷太高不錯欄目 Redis 简体版

原文原文鏈接

問題緣由：
一、因爲這個集羣redis操做很是頻繁，1分鐘操做數據達到1-2G，全部自動aof很是頻繁，主從複製打包rdb也很是頻繁，以前配置已經沒法知足要求
報異常以下
6943:M 19 Jul 20:22:57.326 # Connection with slave 10.215.84.40:6009 lost.
32944:C 19 Jul 20:23:14.920 * DB saved on disk
32944:C 19 Jul 20:23:14.990 * RDB: 379 MB of memory used by copy-on-write
6943:M 19 Jul 20:23:15.174 * Background saving terminated with success
6943:M 19 Jul 20:23:15.838 * Slave 10.215.84.40:6009 asks for synchronization
6943:M 19 Jul 20:23:15.838 * Full resync requested by slave 10.215.84.40:6009
6943:M 19 Jul 20:23:15.838 * Starting BGSAVE for SYNC with target: disk
6943:M 19 Jul 20:23:15.912 * Background saving started by pid 32945
6943:M 19 Jul 20:23:17.064 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
6943:M 19 Jul 20:23:19.020 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
6943:M 19 Jul 20:23:20.875 # Client id=697 addr=10.215.84.40:38199 fd=16 name= age=5 idle=5 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=10494 oll=54 omem=272849776 events=r
cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits.
6943:M 19 Jul 20:23:20.925 # Connection with slave 10.215.84.40:6009 lost.
32945:C 19 Jul 20:23:38.764 * DB saved on disk
32945:C 19 Jul 20:23:38.829 * RDB: 482 MB of memory used by copy-on-write
6943:M 19 Jul 20:23:38.984 * Background saving terminated with success
6943:M 19 Jul 20:23:39.870 * Slave 10.215.84.40:6009 asks for synchronization
6943:M 19 Jul 20:23:39.870 * Full resync requested by slave 10.215.84.40:6009
6943:M 19 Jul 20:23:39.870 * Starting BGSAVE for SYNC with target: disk
6943:M 19 Jul 20:23:39.943 * Background saving started by pid 32946
6943:M 19 Jul 20:23:40.044 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
6943:M 19 Jul 20:23:45.561 # Client id=698 addr=10.215.84.40:41896 fd=16 name= age=6 idle=6 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=15819 oll=52 omem=272817312 events=r
cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits
解決方法：
第一步：
以上日誌是以從節點的視角呈現的，由於以從節點的角度更能反映主從同步流程，因此如下的分析也以從節點的視角爲主。日誌很清楚的說明了Redis主從同步的流程，主要步驟爲：
從節點接收RDB文件
從節點清空舊數據
從節點加載RDB文件
到此一次全量主從同步完成。等等日誌中「Connection with master lost」是什麼鬼，爲何接下來又進行了一次主從同步。
「Connection with master lost」的字面意思是從節點與主節點的鏈接超時。在Redis中主從節點須要互相感知彼此的狀態，這種感知是經過從節點定時PING主節點而且主節點返回PONG消息來實現的。那麼當主節點或者從節
點由於其餘緣由不能及時收到PING或者PONG消息時，則認爲主從鏈接已經斷開。
問題又來了何爲及時，Redis經過參數repl-timeout來設定，它的默認值是60s。Redis配置文件（redis.conf）中詳細解釋了repl-timeout的含義：
# The following option sets the replication timeout for:
#
# 1) Bulk transfer I/O during SYNC, from the point of view of slave.
# 2) Master timeout from the point of view of slaves (data, pings).
# 3) Slave timeout from the point of view of masters (REPLCONF ACK pings).
#
# It is important to make sure that this value is greater than the value
# specified for repl-ping-slave-period otherwise a timeout will be detected
# every time there is low traffic between the master and the slave.
#
# repl-timeout 60
咱們回過頭再來看上邊的同步日誌，從節點加載RDB文件花費將近三分鐘的時間，超過了repl-timeout，因此從節點認爲與主節點的鏈接斷開，因此它嘗試從新鏈接並進行主從同步。
部分同步
這裏補充一點當進行主從同步的時候Redis都會先嚐試進行部分同步，部分同步失敗纔會嘗試進行全量同步。
Redis中主節點接收到的每一個寫請求，都會寫入到一個被稱爲repl_backlog的緩存空間中，這樣當進行主從同步的時候，首先檢查repl_backlog中的緩存是否能知足同步需求，這個過程就是部分同步。
考慮到全量同步是一個很重量級別而且耗時很長的操做，部分同步機制能在不少狀況下極大的減少同步的時間與開銷。
重同步問題
經過上面的介紹大概瞭解了主從同步原理，咱們在將注意力放在加載RDB文件所花費的三分鐘時間上。在這段時間內，主節點不斷接收前端的請求，這些請求不斷的被加入到repl_backlog中，可是由於Redis的單線程特性，從節
點是不能接收主節點的同步寫請求的。因此不斷有數據寫入到repl_backlog的同時卻沒有消費。
當repl_backlog滿的時候就不能知足部分同步的要求了，因此部分同步失敗，須要又一次進行全量同步，如此造成無限循環，致使了主從重同步現象的出現。不只侵佔了帶寬，並且影響主節點的服務。
解決方案
至此解決方案就很明顯了，調大repl_backlog。
Redis中默認的repl_backlog大小爲1M，這是一個比較小的值，咱們的集羣中曾經設置爲100M，有時候仍是會出現主從重同步現象，後來改成200M，一切太平。能夠經過如下命令修改repl_backlog的大小：
//200Mredis-cli -h xxx -p xxx config set repl-backlog-size 209715200
修改完成cpu運行已經沒有那麼高，可是尚未解決
第二步：修改自動aof大小
修改成：
127.0.0.1:6007> config get auto*
1) "auto-aof-rewrite-percentage"
2) "200"
3) "auto-aof-rewrite-min-size"
4) "264217728"
127.0.0.1:6007>
默認是
127.0.0.1:6007> config get auto*
1) "auto-aof-rewrite-percentage"
2) "0"
3) "auto-aof-rewrite-min-size"
4) "64217728"
127.0.0.1:6007>前端

第三步：修改主從複製限制量和限制時間（2個節點同步改）
127.0.0.1:6007> config set client-output-buffer-limit "slave 2528435456 135108864 300"
具體看：Redis主從中斷報錯 Unable to partial resync with the slave for lack of backlog (Slave request was: 2595405802).致使從機rdb每一分鐘刷一次內存
若是主從還不能恢復，能夠增大 2528435456 這個值。redis

最後：一、第三步修改的參數，等同步完成，主從已經恢復，須要將流量改小爲 config set client-output-buffer-limit "slave 528435456 135108864 180" ，否則佔用內存空間太大影響比較大。
二、將這個集羣全部節點配置改爲同樣。

緩存