深刻理解Redis高可用方案-Sentinel

Redis Sentinel是Redis的高可用方案。是Redis 2.8中正式引入的。redis

在以前的主從複製方案中,若是主節點出現問題,須要手動將一個從節點升級爲主節點,而後將其它從節點指向新的主節點,而且須要修改應用方主節點的地址。整個過程都須要人工干預。算法

 

下面經過日誌具體看看Sentinel的切換流程。網絡

 

Sentinel的切換流程運維

集羣拓撲圖以下。socket

角色                 IP              端口           runIDide

主節點             127.0.0.1   6379ui

從節點-1          127.0.0.1   6380 this

從節點-2          127.0.0.1   6381spa

Sentinel-1        127.0.0.1   26379    d4424b8684977767be4f5abd1e364153fbb0adbd設計

Sentinel-2        127.0.0.1   26380    18311edfbfb7bf89fe4b67d08ef432053db62fff

Sentinel-3        127.0.0.1   26381    3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8

 

kill -9 將主節點進程殺死。

1. 最早反應的是從節點。

其會立刻輸出以下信息。

28244:S 08 Oct 16:03:34.184 # Connection with master lost.
28244:S 08 Oct 16:03:34.184 * Caching the disconnected master state.
28244:S 08 Oct 16:03:34.548 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:34.548 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:03:34.548 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:03:35.556 * Connecting to MASTER 127.0.0.1:6379
28244:S 08 Oct 16:03:35.556 * MASTER <-> SLAVE sync started
...

2. Sentinel的日誌30s後纔有輸出,這個與「sentinel down-after-milliseconds mymaster 30000」的設置有關。

下面,依次貼出哨兵各個節點及slave的日誌輸出。

Sentinel-1

28087:X 08 Oct 16:04:04.277 # +sdown master mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:04.379 # +new-epoch 1
28087:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28087:X 08 Oct 16:04:05.388 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28087:X 08 Oct 16:04:05.388 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018
28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:35.656 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

Sentinel-2

28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.554 # -odown master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

Sentinel-3

28234:X 08 Oct 16:04:04.288 # +sdown master mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:04.378 # +new-epoch 1
28234:X 08 Oct 16:04:04.385 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28234:X 08 Oct 16:04:04.385 # +odown master mymaster 127.0.0.1 6379 #quorum 2/2
28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018
28234:X 08 Oct 16:04:05.630 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28234:X 08 Oct 16:04:05.630 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:05.630 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381
28234:X 08 Oct 16:04:35.709 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

 

slave2

28244:S 08 Oct 16:04:04.762 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:04.762 # Error condition on socket for SYNC: Connection refused
28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=
32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started
28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event.
28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue...
28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302).
28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master.
28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031
28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

 

slave3

28253:S 08 Oct 16:04:03.655 * MASTER <-> SLAVE sync started
28253:S 08 Oct 16:04:03.655 # Error condition on socket for SYNC: Connection refused
28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdc
c5c0203128253:M 08 Oct 16:04:04.586 * Discarding previously cached master state.
28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-
free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.
28253:M 08 Oct 16:04:05.770 * Slave 127.0.0.1:6380 asks for synchronization
28253:M 08 Oct 16:04:05.770 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 156 bytes of backlog starting from offset 24302.

 

結合上面的日誌,能夠看到,

各個Sentinel節點都判斷127.0.0.1 6379爲主觀下線(Subjectively Down,縮寫爲sdown)。

28163:X 08 Oct 16:04:04.289 # +sdown master mymaster 127.0.0.1 6379

 

達到quorum的設置,Sentinel-2判斷其爲客觀下線(Objectively Down,縮寫爲odown)。結合其它兩個Sentinel節點的日誌,能夠看到,Sentinel-2最早斷定其客觀下線。接下來,會進行Sentinel的領導者選舉。通常來講,誰先完成客觀下線的斷定,誰就是領導者,只有Sentinel領導者才能進行failover。

28163:X 08 Oct 16:04:04.366 # +odown master mymaster 127.0.0.1 6379 #quorum 3/2
28163:X 08 Oct 16:04:04.366 # +new-epoch 1
28163:X 08 Oct 16:04:04.366 # +try-failover master mymaster 127.0.0.1 6379
28163:X 08 Oct 16:04:04.373 # +vote-for-leader 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # 3e9eb1aa9378d89cfe04fe21bf4a05a901747fa8 voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.385 # d4424b8684977767be4f5abd1e364153fbb0adbd voted for 18311edfbfb7bf89fe4b67d08ef432053db62fff 1
28163:X 08 Oct 16:04:04.450 # +elected-leader master mymaster 127.0.0.1 6379

 

尋找合適的slave做爲master

28163:X 08 Oct 16:04:04.450 # +failover-state-select-slave master mymaster 127.0.0.1 6379

+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable slave for promotion.

 

將127.0.0.1 6381設置爲新主

28163:X 08 Oct 16:04:04.528 # +selected-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

+selected-slave <instance details> -- We found the specified good slave to promote.

 

命令6381節點執行slaveof no one,使其成爲主節點

28163:X 08 Oct 16:04:04.528 * +failover-state-send-slaveof-noone slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

+failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted slave as master, waiting for it to switch.

 

等待6381節點升級爲主節點

28163:X 08 Oct 16:04:04.586 * +failover-state-wait-promotion slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

 

確認6381節點已經升級爲主節點

28163:X 08 Oct 16:04:05.543 # +promoted-slave slave 127.0.0.1:6381 127.0.0.1 6381 @ mymaster 127.0.0.1 6379

 

再來看看16:04:04.528到16:04:05.543這個時間段slave3的日誌輸出。能夠看到,其開啓了MASTER模式,且重寫了配置文件。

28253:M 08 Oct 16:04:04.586 # Setting secondary replication ID to b95802ca8afd97c578b355a5838d219681d0af27, valid up to offset: 24302. New replication ID is a4022bb5c361353a4773fd460cec5cdcc5c02031
28253:M
08 Oct 16:04:04.586 * Discarding previously cached master state. 28253:M 08 Oct 16:04:04.586 * MASTER MODE enabled (user request from 'id=9 addr=127.0.0.1:49316 fd=8 name=sentinel-18311edf-cmd age=137 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=r cmd=exec')
28253:M 08 Oct 16:04:04.593 # CONFIG REWRITE executed with success.

 

failover進入從新配置從節點階段

28163:X 08 Oct 16:04:05.543 # +failover-state-reconf-slaves master mymaster 127.0.0.1 6379

 

命令6380節點複製新的主節點

28163:X 08 Oct 16:04:05.629 * +slave-reconf-sent slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-sent <instance details> -- The leader sentinel sent the SLAVEOF command to this instance in order to reconfigure it for the new slave.

 

看看這個時間點slave2的日誌輸出,基本吻合。其進行的是增量同步。

28244:S 08 Oct 16:04:05.630 * SLAVE OF 127.0.0.1:6381 enabled (user request from 'id=6 addr=127.0.0.1:43880 fd=12 name= age=148 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=224 qbuf-free=32544 obl=81 oll=0 omem=0 events=r cmd=slaveof')
28244:S 08 Oct 16:04:05.636 # CONFIG REWRITE executed with success.
28244:S 08 Oct 16:04:05.770 * Connecting to MASTER 127.0.0.1:6381 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync started 28244:S 08 Oct 16:04:05.770 * Non blocking connect for SYNC fired the event. 28244:S 08 Oct 16:04:05.770 * Master replied to PING, replication can continue... 28244:S 08 Oct 16:04:05.770 * Trying a partial resynchronization (request b95802ca8afd97c578b355a5838d219681d0af27:24302). 28244:S 08 Oct 16:04:05.770 * Successful partial resynchronization with master. 28244:S 08 Oct 16:04:05.770 # Master replication ID changed to a4022bb5c361353a4773fd460cec5cdcc5c02031 28244:S 08 Oct 16:04:05.770 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

 

同時,在這個時間點,sentinel也有日誌輸出,以sentinel1爲例。從日誌中,能夠看到,在這個時間點它會更改配置信息。

28087:X 08 Oct 16:04:05.631 # +config-update-from sentinel 18311edfbfb7bf89fe4b67d08ef432053db62fff 127.0.0.1 26380 @ mymaster 127.0.0.1 6379
28087:X 08 Oct 16:04:05.631 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28087:X 08 Oct 16:04:05.631 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.

 

同步過程還沒有完成。

28163:X 08 Oct 16:04:06.555 * +slave-reconf-inprog slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-inprog <instance details> -- The slave being reconfigured showed to be a slave of the new master ip:port pair, but the synchronization process is not yet complete.

 

主從同步完成。

28163:X 08 Oct 16:04:06.555 * +slave-reconf-done slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6379

+slave-reconf-done <instance details> -- The slave is now synchronized with the new master.

 

failover切換完成。

28163:X 08 Oct 16:04:06.606 # +failover-end master mymaster 127.0.0.1 6379

 

failover成功後,發佈主節點的切換消息

28163:X 08 Oct 16:04:06.606 # +switch-master mymaster 127.0.0.1 6379 127.0.0.1 6381

 

 關聯新主節點的slave信息,須要注意的是,原來的主節點會做爲新主節點的slave。

28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6380 127.0.0.1 6380 @ mymaster 127.0.0.1 6381
28163:X 08 Oct 16:04:06.606 * +slave slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

+slave <instance details> -- A new slave was detected and attached.

 

過了30s後,斷定原來的主節點主觀下線。

28163:X 08 Oct 16:04:36.687 # +sdown slave 127.0.0.1:6379 127.0.0.1 6379 @ mymaster 127.0.0.1 6381

 

綜合來看,Sentinel進行failover的流程以下

1. 每隔1秒,每一個Sentinel節點會向主節點、從節點、其他Sentinel節點發送一條ping命令作一次心跳檢測,來確認這些節點當前是否可達。當這些節點超過down-after-milliseconds沒有進行有效回覆,Sentinel節點就會斷定該節點爲主觀下線。

2. 若是被斷定爲主觀下線的節點是主節點,該Sentinel節點會經過sentinel is master-down-by-addr命令向其餘Sentinel節點詢問對主節點的判斷,當超過<quorum>個數,Sentinel節點會斷定該節點爲客觀下線。若是從節點、Sentinel節點被斷定爲主觀下線,並不會進行後續的故障切換操做。

3. 對Sentinel進行領導者選舉,由其來進行後續的故障切換(failover)工做。選舉算法基於Raft。

4. Sentinel領導者節點開始進行故障切換。

5. 選擇合適的從節點做爲新主節點。

6. Sentinel領導者節點對上一步選出來的從節點執行slaveof no one命令讓其成爲主節點。

7. 向剩餘的從節點發送命令,讓它們成爲新主節點的從節點,複製規則和parallel-syncs參數有關。

8. 將原來的主節點更新爲從節點,並將其歸入到Sentinel的管理,讓其恢復後去複製新的主節點。

 

Sentinel的領導者選舉流程。

Sentinel的領導者選舉基於Raft協議。

1. 每一個在線的Sentinel節點都有資格成爲領導者,當它確認主節點主觀下線時候,會向其餘Sentinel節點發送sentinel is-master-down-by-addr命令,要求將本身設置爲領導者。

2. 收到命令的Sentinel節點,若是沒有贊成過其餘Sentinel節點的sentinel is-master-down-by-addr命令,將贊成該請求,不然拒絕。

3. 若是該Sentinel節點發現本身的票數已經大於等於max(quorum,num(sentinels)/2+1),那麼它將成爲領導者。

 

新主節點的選擇流程。

1. 刪除全部已經處於下線或斷線狀態的從節點。

2. 刪除最近5秒沒有回覆過領導者Sentinel的INFO命令的從節點。

3. 刪除全部與已下線主節點鏈接斷開超過down-after-milliseconds*10毫秒的從節點。

4. 選擇優先級最高的從節點。

5. 選擇複製偏移量最大的從節點。

6. 選擇runid最小的從節點。 

 

三個定時監控任務

1. 每隔10秒,每一個Sentinel節點會向主節點和從節點發送info命令獲取最新的拓撲結構。其做用以下:

    1> 經過向主節點執行info命令,獲取從節點的信息,這也是爲何Sentinel節點不須要顯式配置監控從節點。
    2> 當有新的從節點加入時可馬上感知出來。
    3> 節點不可達或者故障切換後,可經過info命令實時更新節點拓撲信息。

2. 每隔2秒,每一個Sentinel節點會向Redis數據節點的__sentinel__:hello頻道上發送該Sentinel節點對於主節點的判斷以及當前Sentinel節點的信息,同時每一個Sentinel節點也會訂閱該頻道,來了解其它Sentinel節點以及它們對主節點的判斷。其做用以下:

   1> 發現新的Sentinel節點:經過訂閱主節點的__sentinel__:hello瞭解其它Sentinel節點信息,若是是新加入的Sentinel節點,將該Sentinel節點信息保存起來,並與該Sentinel節點建立鏈接。
   2> Sentinel節點之間交換主節點的狀態,做爲後面客觀下線以及領導者選舉的依據。

3. 每隔1秒,每一個Sentinel節點會向主節點、從節點、其他Sentinel節點發送一條ping命令作一次心跳檢測,來確認這些節點當前是否可達。這個定時任務是節點失敗斷定的重要依據。

 

Sentinel的相關參數

# bind 127.0.0.1 192.168.1.1
# protected-mode no
port 26379
# sentinel announce-ip <ip>
# sentinel announce-port <port>
dir /tmp
sentinel monitor mymaster 127.0.0.1 6379 2
# sentinel auth-pass <master-name> <password>
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000
# sentinel notification-script mymaster /var/redis/notify.sh
# sentinel client-reconfig-script mymaster /var/redis/reconfig.sh
sentinel deny-scripts-reconfig yes

其中,

dir:設置Sentinel的工做目錄。

sentinel monitor mymaster 127.0.0.1 6379 2:其中2是quorum,即權重,表明至少須要兩個Sentinel節點認爲主節點主觀下線,纔可斷定主節點爲客觀下線。通常建議將其設置爲Sentinel節點的一半加1。不只如此,quorum還與Sentinel節點的領導者選舉有關。爲了選出Sentinel的領導者,至少須要max(quorum, num(sentinels) / 2 + 1)個Sentinel節點參與選舉。

 

sentinel down-after-milliseconds mymaster 30000:每一個Sentinel節點都要經過按期發送ping命令來判斷Redis節點和其他Sentinel節點是否可達。

若是在指定的時間內,沒有收到主節點的有效回覆,則判斷其爲主觀下線。須要注意的是,該參數不只用來判斷主節點狀態,一樣也用來判斷該主節點下面的從節點及其它Sentinel的狀態。其默認值爲30s。

 

sentinel parallel-syncs mymaster 1:在failover期間,容許多少個slave同時指向新的主節點。若是numslaves設置較大的話,雖然複製操做並不會阻塞主節點,但多個節點同時指向新的主節點,會增長主節點的網絡和磁盤IO負載。

 

sentinel failover-timeout mymaster 180000:定義故障切換超時時間。默認180000,單位秒,即3min。須要注意的是,該時間不是總的故障切換的時間,而是適用於故障切換的多個場景。

# Specifies the failover timeout in milliseconds. It is used in many ways:
#
# - The time needed to re-start a failover after a previous failover was
#   already tried against the same master by a given Sentinel, is two
#   times the failover timeout.
#
# - The time needed for a slave replicating to a wrong master according
#   to a Sentinel current configuration, to be forced to replicate
#   with the right master, is exactly the failover timeout (counting since
#   the moment a Sentinel detected the misconfiguration).
#
# - The time needed to cancel a failover that is already in progress but
#   did not produced any configuration change (SLAVEOF NO ONE yet not
#   acknowledged by the promoted slave).
#
# - The maximum time a failover in progress waits for all the slaves to be
#   reconfigured as slaves of the new master. However even after this time
#   the slaves will be reconfigured by the Sentinels anyway, but not with
#   the exact parallel-syncs progression as specified.

第一種適用場景:若是Redis Sentinel對一個主節點故障切換失敗,那麼下次再對該主節點作故障切換的起始時間是failover-timeout的2倍。這點從Sentinel的日誌就可體現出來(28234:X 08 Oct 16:04:04.385 # Next failover delay: I will not start a failover before Mon Oct  8 16:10:04 2018)

 

sentinel notification-script:定義通知腳本,當Sentinel出現WARNING級別的事件時,會調用該腳本,其會傳入兩個參數:事件類型,事件描述。

sentinel client-reconfig-script:當主節點發生切換時,會調用該參數定義的腳本,其會傳入如下參數:<master-name> <role> <state> <from-ip> <from-port> <to-ip> <to-port>

關於腳本,其必須遵循必定的規則。

# SCRIPTS EXECUTION
#
# sentinel notification-script and sentinel reconfig-script are used in order
# to configure scripts that are called to notify the system administrator
# or to reconfigure clients after a failover. The scripts are executed
# with the following rules for error handling:
#
# If script exits with "1" the execution is retried later (up to a maximum
# number of times currently set to 10).
#
# If script exits with "2" (or an higher value) the script execution is
# not retried.
#
# If script terminates because it receives a signal the behavior is the same
# as exit code 1.
#
# A script has a maximum running time of 60 seconds. After this limit is
# reached the script is terminated with a SIGKILL and the execution retried.

 

sentinel deny-scripts-reconfig:不容許使用SENTINEL SET設置notification-script和client-reconfig-script。

 

Sentinel的常見操做

  • PING This command simply returns PONG.
  • SENTINEL masters Show a list of monitored masters and their state.
  • SENTINEL master <master name> Show the state and info of the specified master.
  • SENTINEL slaves <master name> Show a list of slaves for this master, and their state.
  • SENTINEL sentinels <master name> Show a list of sentinel instances for this master, and their state.
  • SENTINEL get-master-addr-by-name <master name> Return the ip and port number of the master with that name. If a failover is in progress or terminated successfully for this master it returns the address and port of the promoted slave.
  • SENTINEL reset <pattern> This command will reset all the masters with matching name. The pattern argument is a glob-style pattern. The reset process clears any previous state in a master (including a failover in progress), and removes every slave and sentinel already discovered and associated with the master.
  • SENTINEL failover <master name> Force a failover as if the master was not reachable, and without asking for agreement to other Sentinels (however a new version of the configuration will be published so that the other Sentinels will update their configurations).
  • SENTINEL ckquorum <master name> Check if the current Sentinel configuration is able to reach the quorum needed to failover a master, and the majority needed to authorize the failover. This command should be used in monitoring systems to check if a Sentinel deployment is ok.
  • SENTINEL flushconfig Force Sentinel to rewrite its configuration on disk, including the current Sentinel state. Normally Sentinel rewrites the configuration every time something changes in its state (in the context of the subset of the state which is persisted on disk across restart). However sometimes it is possible that the configuration file is lost because of operation errors, disk failures, package upgrade scripts or configuration managers. In those cases a way to to force Sentinel to rewrite the configuration file is handy. This command works even if the previous configuration file is completely missing.
  • SENTINEL MONITOR <name> <ip> <port> <quorum> This command tells the Sentinel to start monitoring a new master with the specified name, ip, port, and quorum. It is identical to the sentinel monitor configuration directive in sentinel.conf configuration file
  • SENTINEL REMOVE <name> is used in order to remove the specified master: the master will no longer be monitored, and will totally be removed from the internal state of the Sentinel, so it will no longer listed by SENTINEL masters and so forth.
  • SENTINEL SET <name> <option> <value> The SET command is very similar to the CONFIG SET command of Redis, and is used in order to change configuration parameters of a specific master. Multiple option / value pairs can be specified (or none at all). All the configuration parameters that can be configured via sentinel.conf are also configurable using the SET command.

sentinel masters

輸出被監控的主節點的狀態信息

127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6379"
    7) "runid"
    8) "6ab2be5db3a37c10f2473c8fb9daed147a32df3e"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "639"
   19) "last-ping-reply"
   20) "639"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "2075"
   25) "role-reported"
   26) "master"
   27) "role-reported-time"
   28) "759682"
   29) "config-epoch"
   30) "0"
   31) "num-slaves"
   32) "2"
   33) "num-other-sentinels"
   34) "2"
   35) "quorum"
   36) "2"
   37) "failover-timeout"
   38) "180000"
   39) "parallel-syncs"
   40) "1"
View Code

 

也可單獨查看某個主節點的狀態

sentinel master mymaster

 

sentinel slaves mymaster 

查看某個主節點slave的狀態

127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "127.0.0.1:6380"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6380"
    7) "runid"
    8) "983b87fd070c7f052b26f5135bbb30fdeb170a54"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "178"
   19) "last-ping-reply"
   20) "178"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "6160"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "489019"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "127.0.0.1"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "70375"
2)  1) "name"
    2) "127.0.0.1:6381"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "6381"
    7) "runid"
    8) "b88059cce9104dd4e0366afd6ad07a163dae8b15"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "178"
   19) "last-ping-reply"
   20) "178"
   21) "down-after-milliseconds"
   22) "30000"
   23) "info-refresh"
   24) "2918"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "489019"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "127.0.0.1"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "71040"
View Code

 

sentinel sentinels mymaster

查看其它Sentinel的狀態

127.0.0.1:26379> sentinel sentinels mymaster
1)  1) "name"
    2) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "26381"
    7) "runid"
    8) "738ccbddaa0d4379d89a147613d9aecfec765bcb"
    9) "flags"
   10) "sentinel"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "475"
   19) "last-ping-reply"
   20) "475"
   21) "down-after-milliseconds"
   22) "30000"
   23) "last-hello-message"
   24) "79"
   25) "voted-leader"
   26) "?"
   27) "voted-leader-epoch"
   28) "0"
2)  1) "name"
    2) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
    3) "ip"
    4) "127.0.0.1"
    5) "port"
    6) "26380"
    7) "runid"
    8) "7251bb129ca373ad0d8c7baf3b6577ae2593079f"
    9) "flags"
   10) "sentinel"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "475"
   19) "last-ping-reply"
   20) "475"
   21) "down-after-milliseconds"
   22) "30000"
   23) "last-hello-message"
   24) "985"
   25) "voted-leader"
   26) "?"
   27) "voted-leader-epoch"
   28) "0"
View Code

 

sentinel get-master-addr-by-name <master name>

返回指定<master name>主節點的IP地址和端口。若是在進行故障切換,則顯示的是新主的信息。

127.0.0.1:26379> sentinel get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"

 

sentinel reset <pattern>

對符合<pattern>(通配符風格)主節點的配置進行重置。

若是某個slave宕機了,其依然處於sentinel的管理中,因此,在其恢復正常後,其依然會加入到以前的複製環境中,即便配置文件中沒有指定slaveof選項。不只如此,若是主節點宕機了,在其重啓後,其默認會做爲從節點接入到以前的複製環境中。

但不少時候,咱們可能就是想移除old master,slave,這個時候,sentinel reset就派上用場了。其會基於當前主節點的狀態,重置其配置(they'll refresh the list of slaves within the next 10 seconds, only adding the ones listed as correctly replicating from the current master INFO output)。關鍵的是,對於非正常狀態的slave,會從當前的配置中剔除。這樣,被剔除節點在恢復正常後(注意此時的配置文件,需剔除slaveof的配置),也不會自動加入到以前的複製環境中。

須要注意的是,該命令僅對當前sentinel節點有效,若是要剔除某個節點,須要在全部的sentinel節點上執行reset操做。

 

sentinel failover <master name>

對指定 <master name> 主節點進行強制故障切換。相對於常規的故障切換,其無需進行Sentinel節點的領導者選舉。直接由當前Sentinel節點進行後續的故障切換。

 

sentinel ckquorum <master name>

檢測當前可達的Sentinel節點總數是否達到<quorum>的個數

127.0.0.1:26379> sentinel ckquorum mymaster
OK 3 usable Sentinels. Quorum and failover authorization can be reached

 

sentinel flushconfig

將Sentinel節點的配置信息強制刷到磁盤上,這個命令Sentinel節點自身用得比較多,對於開發和運維人員只有當外部緣由(例如磁盤損壞)形成配置文件損壞或者丟失時,纔會用上。

 

sentinel remove <master name>

取消當前Sentinel節點對於指定<master name>主節點的監控。

[root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf 
port 26379
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 127.0.0.1 6381 2
sentinel config-epoch mymaster 12
sentinel leader-epoch mymaster 0
sentinel known-slave mymaster 127.0.0.1 6380
sentinel known-slave mymaster 127.0.0.1 6379
sentinel known-sentinel mymaster 127.0.0.1 26381 738ccbddaa0d4379d89a147613d9aecfec765bcb
sentinel known-sentinel mymaster 127.0.0.1 26380 7251bb129ca373ad0d8c7baf3b6577ae2593079f
sentinel current-epoch 12

[root@slowtech redis-4.0.11]# redis-cli -p 26379
127.0.0.1:26379> sentinel remove mymaster
OK
127.0.0.1:26379> quit

[root@slowtech redis-4.0.11]# grep -Ev "^#|^$" sentinel_26379.conf 
port 26379
dir "/tmp"
sentinel myid 2467530fa249dbbc435c50fbb0dc2a4e766146f8
sentinel deny-scripts-reconfig yes
sentinel current-epoch 12
View Code

 

sentinel set <name> <option> <value>

參數                                   用法

quorum            sentinel set mymaster  quorum 3

down-after-milliseconds    sentinel set mymaster down-after-milliseconds 30000

failover-timeout       sentinel set mymaster failover-timeout 18000

parallel-syncs          sentinel set mymaster parallel-syncs 3 

notification-script               sentinel set mymaster notification-script  /tmp/a.sh 

client-reconfig-script          sentinel set mymaster client-reconfig-script  /tmp/b.sh 

auth-pass           sentinel set mymaster auth-pass masterpassword

須要注意的是:

1. sentinel set命令只對當前Sentinel節點有效。

2. sentinel set命令若是執行成功會當即刷新配置文件,這點和Redis普通數據節點不一樣,後者修改完配置後,須要執行config rewrite刷新到配置文件。

3. 建議全部Sentinel節點的配置儘量一致。

4. Sentinel不支持config命令。如何要查看參數的設置,可痛過SENTINEL MASTER命令查看。

 

參考:

1. 《Redis開發與運維》

2. 《Redis設計與實現》

3. 《Redis 4.X Cookbook》

4.  官方文檔

相關文章
相關標籤/搜索