Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。redis
在以前的主從複製方案中,若是主節點出現問題,須要手動將一個從節點升級爲主節點,而後將其它從節點指向新的主節點,而且須要修改應用方主節點的地址。整個過程都須要人工干預。算法
下面經過日誌具體看看Sentinel的切換流程。網絡
Sentinel的切換流程運維
集羣拓撲圖以下。socket
角色 IP 端口 runIDide
主節點 127.0.0.1 6379ui
從節點-1 127.0.0.1 6380 this
從節點-2 127.0.0.1 6381spa
Sentinel-1 127.0.0.1 26379 d4424b8684977767be4f5abd1e364153fbb0adbd設計
Sentinel-2 127.0.0.1 26380 18311edfbfb7bf89fe4b67d08ef432053db62fff
Sentinel-3 127.0.0.1 26381 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8
kill -9 將主節點進程殺死。
1. 最早反應的是從節點。
其會立刻輸出以下信息。
28244:S 08 Oct 16:03:34.184 # Connection with master lost. 28244:S 08 Oct 16:03:34.184 * Caching the disconnected master state. 28244:S 08 Oct 16:03:34.548 * Connecting to MASTER 127.0.0.1:6379 28244:S 08 Oct 16:03:34.548 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:03:34.548 # Error condition on socket for SYNC: Connection refused 28244:S 08 Oct 16:03:35.556 * Connecting to MASTER 127.0.0.1:6379 28244:S 08 Oct 16:03:35.556 * MASTER <-> SLAVE sync started ...
2. Sentinel的日誌30s後纔有輸出,這個與「sentinel down-after-milliseconds mymaster 30000」的設置有關。
下面,依次貼出哨兵各個節點及slave的日誌輸出。
Sentinel-1
28087:X 08 Oct 16:04:04.277 # +sdown master mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:04.379 # +new-epoch 1 28087:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28087:X 08 Oct 16:04:05.388 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28087:X 08 Oct 16:04:05.388 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018 28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:35.656 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-2
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28163:X 08 Oct 16:04:04.366 # +new-epoch 1 28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.554 # -odown master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
Sentinel-3
28234:X 08 Oct 16:04:04.288 # +sdown master mymaster 127.0.0.1 6379 28234:X 08 Oct 16:04:04.378 # +new-epoch 1 28234:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28234:X 08 Oct 16:04:04.385 # +odown master mymaster 127.0.0.1 6379 #quorum 2/2 28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018 28234:X 08 Oct 16:04:05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28234:X 08 Oct 16:04:05.630 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381 28234:X 08 Oct 16:04:35.709 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
slave2
28244:S 08 Oct 16:04:04.762 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:04.762 # Error condition on socket for SYNC: Connection refused 28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free= 32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success. 28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event. 28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue... 28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302). 28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master. 28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
slave3
28253:S 08 Oct 16:04:03.655 * MASTER <-> SLAVE sync started 28253:S 08 Oct 16:04:03.655 # Error condition on socket for SYNC: Connection refused 28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdc c5c0203128253:M 08 Oct 16:04:04.586 * Discarding previously cached master state. 28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf- free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success. 28253:M 08 Oct 16:04:05.770 * Slave 127.0.0.1:6380 asks for synchronization 28253:M 08 Oct 16:04:05.770 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 156 bytes of backlog starting from offset 24302.
結合上面的日誌,能夠看到,
各個Sentinel節點都判斷127.0.0.1 6379爲主觀下線(Subjectively Down,縮寫爲sdown)。
28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
達到quorum的設置,Sentinel-2判斷其爲客觀下線(Objectively Down,縮寫爲odown)。結合其它兩個Sentinel節點的日誌,能夠看到,Sentinel-2最早斷定其客觀下線。接下來,會進行Sentinel的領導者選舉。通常來講,誰先完成客觀下線的斷定,誰就是領導者,只有Sentinel領導者才能進行failover。
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2 28163:X 08 Oct 16:04:04.366 # +new-epoch 1 28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379 28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1 28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
尋找合適的slave做爲master
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable slave for promotion.
將127.0.0.1 6381設置爲新主
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+selected-slave <instance details> -- We found the specified good slave to promote.
命令6381節點執行slaveof no one,使其成爲主節點
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
+failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted slave as master, waiting for it to switch.
等待6381節點升級爲主節點
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
確認6381節點已經升級爲主節點
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
再來看看16:04:04.528到16:04:05.543這個時間段slave3的日誌輸出。能夠看到,其開啓了MASTER模式,且重寫了配置文件。
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M 08 Oct 16:04:04.586 * Discarding previously cached master state. 28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
failover進入從新配置從節點階段
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
命令6380節點複製新的主節點
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-sent <instance details> -- The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.
看看這個時間點slave2的日誌輸出,基本吻合。其進行的是增量同步。
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success. 28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event. 28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue... 28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302). 28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master. 28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
同時,在這個時間點,sentinel也有日誌輸出,以sentinel1爲例。從日誌中,能夠看到,在這個時間點它會更改配置信息。
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379 28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.
同步過程還沒有完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-inprog <instance details> -- The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.
主從同步完成。
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
+slave-reconf-done <instance details> -- The slave is now synchronized with the new master.
failover切換完成。
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
failover成功後,發佈主節點的切換消息
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
關聯新主節點的slave信息,須要注意的是,原來的主節點會做爲新主節點的slave。
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381 28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
+slave <instance details> -- A new slave was detected and attached.
過了30s後,斷定原來的主節點主觀下線。
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
綜合來看,Sentinel進行failover的流程以下
1. 每隔1秒,每一個Sentinel節點會向主節點、從節點、其他Sentinel節點發送一條ping命令作一次心跳檢測,來確認這些節點當前是否可達。當這些節點超過down-after-milliseconds沒有進行有效回覆,Sentinel節點就會斷定該節點爲主觀下線。
2. 若是被斷定爲主觀下線的節點是主節點,該Sentinel節點會經過sentinel is master-down-by-addr命令向其餘Sentinel節點詢問對主節點的判斷,當超過<quorum>個數,Sentinel節點會斷定該節點爲客觀下線。若是從節點、Sentinel節點被斷定爲主觀下線,並不會進行後續的故障切換操做。
3. 對Sentinel進行領導者選舉,由其來進行後續的故障切換(failover)工做。選舉算法基於Raft。
4. Sentinel領導者節點開始進行故障切換。
5. 選擇合適的從節點做爲新主節點。
6. Sentinel領導者節點對上一步選出來的從節點執行slaveof no one命令讓其成爲主節點。
7. 向剩餘的從節點發送命令,讓它們成爲新主節點的從節點,複製規則和parallel-syncs參數有關。
8. 將原來的主節點更新爲從節點,並將其歸入到Sentinel的管理,讓其恢復後去複製新的主節點。
Sentinel的領導者選舉流程。
Sentinel的領導者選舉基於Raft協議。
1. 每一個在線的Sentinel節點都有資格成爲領導者,當它確認主節點主觀下線時候,會向其餘Sentinel節點發送sentinel is-master-down-by-addr命令,要求將本身設置爲領導者。
2. 收到命令的Sentinel節點,若是沒有贊成過其餘Sentinel節點的sentinel is-master-down-by-addr命令,將贊成該請求,不然拒絕。
3. 若是該Sentinel節點發現本身的票數已經大於等於max(quorum,num(sentinels)/2+1),那麼它將成爲領導者。
新主節點的選擇流程。
1. 刪除全部已經處於下線或斷線狀態的從節點。
2. 刪除最近5秒沒有回覆過領導者Sentinel的INFO命令的從節點。
3. 刪除全部與已下線主節點鏈接斷開超過down-after-milliseconds*10毫秒的從節點。
4. 選擇優先級最高的從節點。
5. 選擇複製偏移量最大的從節點。
6. 選擇runid最小的從節點。
三個定時監控任務
1. 每隔10秒,每一個Sentinel節點會向主節點和從節點發送info命令獲取最新的拓撲結構。其做用以下:
1> 經過向主節點執行info命令,獲取從節點的信息,這也是爲何Sentinel節點不須要顯式配置監控從節點。
2> 當有新的從節點加入時可馬上感知出來。
3> 節點不可達或者故障切換後,可經過info命令實時更新節點拓撲信息。
2. 每隔2秒,每一個Sentinel節點會向Redis數據節點的__sentinel__:hello頻道上發送該Sentinel節點對於主節點的判斷以及當前Sentinel節點的信息,同時每一個Sentinel節點也會訂閱該頻道,來了解其它Sentinel節點以及它們對主節點的判斷。其做用以下:
1> 發現新的Sentinel節點:經過訂閱主節點的__sentinel__:hello瞭解其它Sentinel節點信息,若是是新加入的Sentinel節點,將該Sentinel節點信息保存起來,並與該Sentinel節點建立鏈接。
2> Sentinel節點之間交換主節點的狀態,做爲後面客觀下線以及領導者選舉的依據。
3. 每隔1秒,每一個Sentinel節點會向主節點、從節點、其他Sentinel節點發送一條ping命令作一次心跳檢測,來確認這些節點當前是否可達。這個定時任務是節點失敗斷定的重要依據。
Sentinel的相關參數
# bind 127.0.0.1 192.168.1.1 # protected-mode no port 26379 # sentinel announce-ip <ip> # sentinel announce-port <port> dir /tmp sentinel monitor mymaster 127.0.0.1 6379 2 # sentinel auth-pass <master-name> <password> sentinel down-after-milliseconds mymaster 30000 sentinel parallel-syncs mymaster 1 sentinel failover-timeout mymaster 180000 # sentinel notification-script mymaster /var/redis/notify.sh # sentinel client-reconfig-script mymaster /var/redis/reconfig.sh sentinel deny-scripts-reconfig yes
其中,
dir:設置Sentinel的工做目錄。
sentinel monitor mymaster 127.0.0.1 6379 2:其中2是quorum,即權重,表明至少須要兩個Sentinel節點認爲主節點主觀下線,纔可斷定主節點爲客觀下線。通常建議將其設置爲Sentinel節點的一半加1。不只如此,quorum還與Sentinel節點的領導者選舉有關。爲了選出Sentinel的領導者,至少須要max(quorum, num(sentinels) / 2 + 1)個Sentinel節點參與選舉。
sentinel down-after-milliseconds mymaster 30000:每一個Sentinel節點都要經過按期發送ping命令來判斷Redis節點和其他Sentinel節點是否可達。
若是在指定的時間內,沒有收到主節點的有效回覆,則判斷其爲主觀下線。須要注意的是,該參數不只用來判斷主節點狀態,一樣也用來判斷該主節點下面的從節點及其它Sentinel的狀態。其默認值爲30s。
sentinel parallel-syncs mymaster 1:在failover期間,容許多少個slave同時指向新的主節點。若是numslaves設置較大的話,雖然複製操做並不會阻塞主節點,但多個節點同時指向新的主節點,會增長主節點的網絡和磁盤IO負載。
sentinel failover-timeout mymaster 180000:定義故障切換超時時間。默認180000,單位秒,即3min。須要注意的是,該時間不是總的故障切換的時間,而是適用於故障切換的多個場景。
# Specifies the failover timeout in milliseconds. It is used in many ways: # # - The time needed to re-start a failover after a previous failover was # already tried against the same master by a given Sentinel, is two # times the failover timeout. # # - The time needed for a slave replicating to a wrong master according # to a Sentinel current configuration, to be forced to replicate # with the right master, is exactly the failover timeout (counting since # the moment a Sentinel detected the misconfiguration). # # - The time needed to cancel a failover that is already in progress but # did not produced any configuration change (SLAVEOF NO ONE yet not # acknowledged by the promoted slave). # # - The maximum time a failover in progress waits for all the slaves to be # reconfigured as slaves of the new master. However even after this time # the slaves will be reconfigured by the Sentinels anyway, but not with # the exact parallel-syncs progression as specified.
第一種適用場景:若是Redis Sentinel對一個主節點故障切換失敗,那麼下次再對該主節點作故障切換的起始時間是failover-timeout的2倍。這點從Sentinel的日誌就可體現出來(28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct 8 16:10:04 2018)
sentinel notification-script:定義通知腳本,當Sentinel出現WARNING級別的事件時,會調用該腳本,其會傳入兩個參數:事件類型,事件描述。
sentinel client-reconfig-script:當主節點發生切換時,會調用該參數定義的腳本,其會傳入如下參數:<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>
關於腳本,其必須遵循必定的規則。
# SCRIPTS EXECUTION # # sentinel notification-script and sentinel reconfig-script are used in order # to configure scripts that are called to notify the system administrator # or to reconfigure clients after a failover. The scripts are executed # with the following rules for error handling: # # If script exits with "1" the execution is retried later (up to a maximum # number of times currently set to 10). # # If script exits with "2" (or an higher value) the script execution is # not retried. # # If script terminates because it receives a signal the behavior is the same # as exit code 1. # # A script has a maximum running time of 60 seconds. After this limit is # reached the script is terminated with a SIGKILL and the execution retried.
sentinel deny-scripts-reconfig:不容許使用SENTINEL SET設置notification-script和client-reconfig-script。
Sentinel的常見操做
sentinel masters
輸出被監控的主節點的狀態信息
127.0.0.1:26379> sentinel masters 1) 1) "name" 2) "mymaster" 3) "ip" 4) "127.0.0.1" 5) "port" 6) "6379" 7) "runid" 8) "6ab2be5db3a37c10f2473c8fb9daed147a32df3e" 9) "flags" 10) "master" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "639" 19) "last-ping-reply" 20) "639" 21) "down-after-milliseconds" 22) "30000" 23) "info-refresh" 24) "2075" 25) "role-reported" 26) "master" 27) "role-reported-time" 28) "759682" 29) "config-epoch" 30) "0" 31) "num-slaves" 32) "2" 33) "num-other-sentinels" 34) "2" 35) "quorum" 36) "2" 37) "failover-timeout" 38) "180000" 39) "parallel-syncs" 40) "1"
也可單獨查看某個主節點的狀態
sentinel master mymaster
sentinel slaves mymaster
查看某個主節點slave的狀態
127.0.0.1:26379> sentinel slaves mymaster 1) 1) "name" 2) "127.0.0.1:6380" 3) "ip" 4) "127.0.0.1" 5) "port" 6) "6380" 7) "runid" 8) "983b87fd070c7f052b26f5135bbb30fdeb170a54" 9) "flags" 10) "slave" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "178" 19) "last-ping-reply" 20) "178" 21) "down-after-milliseconds" 22) "30000" 23) "info-refresh" 24) "6160" 25) "role-reported" 26) "slave" 27) "role-reported-time" 28) "489019" 29) "master-link-down-time" 30) "0" 31) "master-link-status" 32) "ok" 33) "master-host" 34) "127.0.0.1" 35) "master-port" 36) "6379" 37) "slave-priority" 38) "100" 39) "slave-repl-offset" 40) "70375" 2) 1) "name" 2) "127.0.0.1:6381" 3) "ip" 4) "127.0.0.1" 5) "port" 6) "6381" 7) "runid" 8) "b88059cce9104dd4e0366afd6ad07a163dae8b15" 9) "flags" 10) "slave" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "178" 19) "last-ping-reply" 20) "178" 21) "down-after-milliseconds" 22) "30000" 23) "info-refresh" 24) "2918" 25) "role-reported" 26) "slave" 27) "role-reported-time" 28) "489019" 29) "master-link-down-time" 30) "0" 31) "master-link-status" 32) "ok" 33) "master-host" 34) "127.0.0.1" 35) "master-port" 36) "6379" 37) "slave-priority" 38) "100" 39) "slave-repl-offset" 40) "71040"
sentinel sentinels mymaster
查看其它Sentinel的狀態
127.0.0.1:26379> sentinel sentinels mymaster 1) 1) "name" 2) "738ccbddaa0d4379d89a147613d9aecfec765bcb" 3) "ip" 4) "127.0.0.1" 5) "port" 6) "26381" 7) "runid" 8) "738ccbddaa0d4379d89a147613d9aecfec765bcb" 9) "flags" 10) "sentinel" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "475" 19) "last-ping-reply" 20) "475" 21) "down-after-milliseconds" 22) "30000" 23) "last-hello-message" 24) "79" 25) "voted-leader" 26) "?" 27) "voted-leader-epoch" 28) "0" 2) 1) "name" 2) "7251bb129ca373ad0d8c7baf3b6577ae2593079f" 3) "ip" 4) "127.0.0.1" 5) "port" 6) "26380" 7) "runid" 8) "7251bb129ca373ad0d8c7baf3b6577ae2593079f" 9) "flags" 10) "sentinel" 11) "link-pending-commands" 12) "0" 13) "link-refcount" 14) "1" 15) "last-ping-sent" 16) "0" 17) "last-ok-ping-reply" 18) "475" 19) "last-ping-reply" 20) "475" 21) "down-after-milliseconds" 22) "30000" 23) "last-hello-message" 24) "985" 25) "voted-leader" 26) "?" 27) "voted-leader-epoch" 28) "0"
sentinel get-master-addr-by-name <master name>
返回指定<master name>主節點的IP地址和端口。若是在進行故障切換,則顯示的是新主的信息。
127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"
sentinel reset <pattern>
對符合<pattern>(通配符風格)主節點的配置進行重置。
若是某個slave宕機了,其依然處於sentinel的管理中,因此,在其恢復正常後,其依然會加入到以前的複製環境中,即便配置文件中沒有指定slaveof選項。不只如此,若是主節點宕機了,在其重啓後,其默認會做爲從節點接入到以前的複製環境中。
但不少時候,咱們可能就是想移除old master,slave,這個時候,sentinel reset就派上用場了。其會基於當前主節點的狀態,重置其配置(they'll refresh the list of slaves within the next 10 seconds, only adding the ones listed as correctly replicating from the current master INFO output)。關鍵的是,對於非正常狀態的slave,會從當前的配置中剔除。這樣,被剔除節點在恢復正常後(注意此時的配置文件,需剔除slaveof的配置),也不會自動加入到以前的複製環境中。
須要注意的是,該命令僅對當前sentinel節點有效,若是要剔除某個節點,須要在全部的sentinel節點上執行reset操做。
sentinel failover <master name>
對指定 <master name> 主節點進行強制故障切換。相對於常規的故障切換,其無需進行Sentinel節點的領導者選舉。直接由當前Sentinel節點進行後續的故障切換。
sentinel ckquorum <master name>
檢測當前可達的Sentinel節點總數是否達到<quorum>的個數
127.0.0.1:26379> sentinel ckquorum mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached
sentinel flushconfig
將Sentinel節點的配置信息強制刷到磁盤上,這個命令Sentinel節點自身用得比較多,對於開發和運維人員只有當外部緣由(例如磁盤損壞)形成配置文件損壞或者丟失時,纔會用上。
sentinel remove <master name>
取消當前Sentinel節點對於指定<master name>主節點的監控。
[root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf port 26379 dir "/tmp" sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8 sentinel deny-scripts-reconfig yes sentinel monitor mymaster 127.0.0.1 6381 2 sentinel config-epoch mymaster 12 sentinel leader-epoch mymaster 0 sentinel known-slave mymaster 127.0.0.1 6380 sentinel known-slave mymaster 127.0.0.1 6379 sentinel known-sentinel mymaster 127.0.0.1 26381 738ccbddaa0d4379d89a147613d9aecfec765bcb sentinel known-sentinel mymaster 127.0.0.1 26380 7251bb129ca373ad0d8c7baf3b6577ae2593079f sentinel current-epoch 12 [root@slowtech redis-4.0.11]# redis-cli -p 26379 127.0.0.1:26379> sentinel remove mymaster OK 127.0.0.1:26379> quit [root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf port 26379 dir "/tmp" sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8 sentinel deny-scripts-reconfig yes sentinel current-epoch 12
sentinel set <name> <option> <value>
參數 用法
quorum sentinel set mymaster quorum 3
down-after-milliseconds sentinel set mymaster down-after-milliseconds 30000
failover-timeout sentinel set mymaster failover-timeout 18000
parallel-syncs sentinel set mymaster parallel-syncs 3
notification-script sentinel set mymaster notification-script /tmp/a.sh
client-reconfig-script sentinel set mymaster client-reconfig-script /tmp/b.sh
auth-pass sentinel set mymaster auth-pass masterpassword
須要注意的是:
1. sentinel set命令只對當前Sentinel節點有效。
2. sentinel set命令若是執行成功會當即刷新配置文件,這點和Redis普通數據節點不一樣,後者修改完配置後,須要執行config rewrite刷新到配置文件。
3. 建議全部Sentinel節點的配置儘量一致。
4. Sentinel不支持config命令。如何要查看參數的設置,可痛過SENTINEL MASTER命令查看。
參考:
1. 《Redis開發與運維》
2. 《Redis設計與實現》
3. 《Redis 4.X Cookbook》
4. 官方文檔