某次搶購時,一個redis集羣的某個分片,從實例響應時間陡增到幾十秒,報警後運維將其中一個本應該下線的slave下掉,問題減輕但沒有解決,又把另外一個正常的slave下線掉,問題消失。redis
09:59:11.842 # Client id=19768058 addr=xx.xxx.xx.xx:46599 fd=7 name= age=235951 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=10581 omem=268636408 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buffer limits.
09:59:11.851 # Client id=19770026 addr=xx.xxx.xx.x:64139 fd=6 name= age=208571 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=10581 omem=268636408 events=rw cmd=replconf scheduled to be closed ASAP for overcoming of output buffer limits.
09:59:11.863 # Connection with slave xx.xxx.xx.x:xxxx lost.
09:59:11.878 # Connection with slave xx.xxx.xx.x:xxxx lost.
複製代碼
09:59:11.866 # Connection with master lost.
09:59:43.057 # I/O error trying to sync with MASTER: connection lost
10:00:17.720 # I/O error trying to sync with MASTER: connection lost
10:00:48.585 # I/O error trying to sync with MASTER: connection lost
10:01:20.326 # I/O error trying to sync with MASTER: connection lost
複製代碼
兩個slave的日誌是同樣的,因此只摘取了其中一個。bash
經過master日誌能夠看出是master由於slave client buffer達到上限,主動關閉了鏈接,master當時的cpu達到100%,且不斷進行bgsave操做,慢日誌裏有不少psync命令。運維
從slave日誌來看,master鏈接被斷開,而後slave不斷進行鏈接創建和數據拷貝。spa
經過grafana監控發現該業務的寫流量第一次達到123MB,可是client-output-buffer-limit slave 256mb 64mb 60,因此主在流量超過60mb以後會將主動將slave鏈接關閉,這時候從經過psync命令請求數據同步,可是repl-backlog-size 64mb,此時按照現有的流量,增量同步須要的數據已經不在複製緩衝區裏面了,master會進行全量複製,全量複製成功後,slave阻塞式加載rdb數據。日誌
以上過程是一個循環,不斷重複,致使從實例基本不能響應讀請求(時間基本在30s),主實例受影響並不大,這也是爲何摘除從實例能夠減輕甚至恢復問題的緣由。code
對於寫流量比較大的業務,主從複製有關的buffer包括repl-backlog-size、client-output-buffer-limit slave能夠調成峯值流量的兩倍甚至同普通客戶端buffer同樣不作限制。cmd