目錄 1node
1. 前言 1redis
2. slave發起選舉 2算法
3. master響應選舉 5網絡
4. 選舉示例 5less
5. 哈希槽傳播方式 6運維
6. 一次主從切換記錄1 6dom
6.1. 相關參數 6測試
6.2. 時間點記錄 6this
6.3. 其它master日誌 6spa
6.4. 其它master日誌 7
6.5. slave日誌 7
7. 一次主從切換記錄2 8
7.1. 相關參數 8
7.2. 時間點記錄 8
7.3. 其它master日誌 8
7.4. 其它master日誌 9
7.5. slave日誌 9
8. slave延遲發起選舉代碼 9
Redis官方原文:https://redis.io/topics/cluster-spec。另外,從Redis-5.0開始,slave已改叫replica,配置項和部分文檔及變量已作更名。
Redis集羣的主從切換採起選舉機制,要求少數服從多數,而參與選舉的只能爲master,因此只有多數master存活動時才能進行,選舉由slave發起。
Redis用了和Raft算法term(任期)相似的的概念,在Redis中叫做epoch(紀元),epoch是一個無符號的64整數,一個節點的epoch從0開始。
若是一個節點接收到的epoch比本身的大,則將自已的epoch更新接收到的epoch(假定爲信任網絡,無拜占庭將軍問題)。
每一個master都會在ping和pong消息中廣播本身的epoch和所負責的slots位圖,slave發起選舉時,建立一個新的epoch(增一),epoch的值會持久化到文件nodes.conf中,如(最新epoch值爲27,最近一次投票給了27):
vars currentEpoch 27 lastVoteEpoch 27 |
只有master爲fail狀態,slave纔會發起選舉。但並非master爲fail時當即發起選舉,而是延遲下列隨機時長,以免多個slaves同時發起選舉(至少延遲0.5秒後纔會發起選舉):
500 milliseconds + random delay between 0 and 500 milliseconds + SLAVE_RANK * 1000 milliseconds |
一個slave發起選舉的條件:
1) 它的master爲fail狀態(非pfail狀態);
2) 它的master至少負責了一個slot;
3) slave和master的複製鏈接斷開時間不超過給定的值(值可配置,目的是確保slave上的數據足夠完整,因此運維時不能任由一個slave長時間不可用,須要經過監控將異常的slave及時恢復)。
因過長時間不可用而不能自動切換的slave日誌:
slave過長時間不可用,致使沒法自動切換爲master 12961:S 06 Jan 2019 19:00:21.969 # Currently unable to failover: Disconnected from master for longer than allowed. Please check the 'cluster-replica-validity-factor' configuration option. |
相關的源代碼:
/* This function is called if we are a slave node and our master serving * a non-zero amount of hash slots is in FAIL state. * * The gaol of this function is: * 1) To check if we are able to perform a failover, is our data updated? * 2) Try to get elected by masters. * 3) Perform the failover informing all the other nodes. */ void clusterHandleSlaveFailover(void) { mstime_t data_age; // 與master斷開的時長,單位毫秒 mstime_t auth_age = mstime() - server.cluster->failover_auth_time; int needed_quorum = (server.cluster->size / 2) + 1; int manual_failover = server.cluster->mf_end != 0 && server.cluster->mf_can_start; auth_timeout = server.cluster_node_timeout*2; if (auth_timeout < 2000) auth_timeout = 2000; auth_retry_time = auth_timeout*2; 。。。。。。 /* Set data_age to the number of seconds we are disconnected from * the master. */ if (server.repl_state == REPL_STATE_CONNECTED) { data_age = (mstime_t)(server.unixtime - server.master->lastinteraction) * 1000; } else { data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000; }
/* Remove the node timeout from the data age as it is fine that we are * disconnected from our master at least for the time it was down to be * flagged as FAIL, that's the baseline. */ if (data_age > server.cluster_node_timeout) data_age -= server.cluster_node_timeout;
/* Check if our data is recent enough according to the slave validity * factor configured by the user. * * Check bypassed for manual failovers. */ if (server.cluster_slave_validity_factor && data_age > (((mstime_t)server.repl_ping_slave_period * 1000) + (server.cluster_node_timeout * server.cluster_slave_validity_factor))) { // slave不可用時間過長,致使不能自動切換爲master if (!manual_failover) { // 人工切換除外 clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE); return; } } 。。。。。。 /* Ask for votes if needed. */ // failover_auth_sent標記是否已發送過投票消息 if (server.cluster->failover_auth_sent == 0) { server.cluster->currentEpoch++; server.cluster->failover_auth_epoch = server.cluster->currentEpoch; serverLog(LL_WARNING,"Starting a failover election for epoch %llu.", (unsigned long long) server.cluster->currentEpoch);
// 給全部節點(包括slaves)發送投票消息FAILOVE_AUTH_REQUEST(請求投票成爲master消息),但注意只有master響應該消息 clusterRequestFailoverAuth(); server.cluster->failover_auth_sent = 1; clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG| CLUSTER_TODO_UPDATE_STATE| CLUSTER_TODO_FSYNC_CONFIG); return; /* Wait for replies. */ } /* Check if we reached the quorum. */ if (server.cluster->failover_auth_count >= needed_quorum) { /* We have the quorum, we can finally failover the master. */
serverLog(LL_WARNING, "Failover election won: I'm the new master.");
/* Update my configEpoch to the epoch of the election. */ if (myself->configEpoch < server.cluster->failover_auth_epoch) { myself->configEpoch = server.cluster->failover_auth_epoch; serverLog(LL_WARNING, "configEpoch set to %llu after successful failover", (unsigned long long) myself->configEpoch); }
/* Take responsibility for the cluster slots. */ clusterFailoverReplaceYourMaster(); } else { clusterLogCantFailover(CLUSTER_CANT_FAILOVER_WAITING_VOTES); } } |
從上段代碼,還能夠看到配置項cluster-slave-validity-factor影響slave是否可以切換爲master。
發起選舉前,slave先給本身的epoch(即currentEpoch)增一,而後請求其它master給本身投票。slave是經過廣播FAILOVER_AUTH_REQUEST包給集中的每個masters。
slave發起投票後,會等待至少兩倍NODE_TIMEOUT時長接收投票結果,無論NODE_TIMEOUT何值,也至少會等待2秒。
master接收投票後給slave響應FAILOVER_AUTH_ACK,而且在(NODE_TIMEOUT*2)時間內不會給同一master的其它slave投票。
若是slave收到FAILOVER_AUTH_ACK響應的epoch值小於本身的epoch,則會直接丟棄。一旦slave收到多數master的FAILOVER_AUTH_ACK,則聲明本身贏得了選舉。
若是slave在兩倍的NODE_TIMEOUT時間內(至少2秒)未贏得選舉,則放棄本次選舉,而後在四倍NODE_TIMEOUT時間(至少4秒)後從新發起選舉。
只因此強制延遲至少0.5秒選舉,是爲確保master的fail狀態在整個集羣內傳開,不然可能只有小部分master知曉,而master只會給處於fail狀態的master的slaves投票。若是一個slave的master狀態不是fail,則其它master不會給它投票,Redis經過八卦協議(即Gossip協議,也叫謠言協議)傳播fail。而在固定延遲上再加一個隨機延遲,是爲了不多個slaves同時發起選舉。
slave的SLAVE_RANK是一個與master複製數有關的值,具備最新複製時SLAVE_RANK值爲0,第二則爲1,以此類推。這樣可以讓具備最全數據的slave優先發起選舉。當具備更高SLAVE_RANK值的slave若是沒有當選,則其它slaves會很快發起選舉(至少4秒後)。
在slave贏得選舉後,會向集羣內的全部節點廣播pong,以儘快完成從新配置(體如今node.conf的更新)。當前未能到達的節點,最終也會完成從新配置。
其它節點會發現有兩個相同的master負責相同的slots,這時就看哪一個master的epoch值更大。
slave成爲master後,並不當即服務,而是留了一個時間差。
master收到slave的投票請求FAILOVER_AUTH_REQUEST後,只有知足下列條件時,纔會響應投票:
1) 對一個epoch,只投票一次;
2) 會拒絕全部更小epoch的投票請求;
3) 不會給小於lastVoteEpoch的epoch投票;
4) master只給master狀態爲fail的slave投票;
5) 若是slave請求的currentEpoch小於master的currentEpoch,則master忽略該請求,但下列狀況例外:
① 假設master的currentEpoch值爲5,lastVoteEpoch值爲1(當有選舉失敗會出現這個狀況,亦即currentEpoch值增長了,但由於選舉失敗,lastVoteEpoch值未變); ② slave的currentEpoch值爲3; ③ slave增一,使用值爲4的epoch發起選舉,這個時候master會響應epoch值爲5,不巧這個響應延遲了; ④ slave從新發起選舉,這個時候選舉用的epoch值爲5(每次發起選舉epoch值均需增一),湊巧這個時候原來延遲的響應達到了,這個時候原來延遲的響應被slave認爲有效。 |
在master投票後,會用請求中的epoch更新本地的lastVoteEpoch,並持久化到node.conf文件中。master不會參與選擇最優的slave,因爲最優的slave有最好的SLAVE_RANK,所以最優的slave可相對更快發起選舉。
假設一個master有A、B和C三個slaves節點,當這個master不可達時:
1) 假設slave A贏得選舉成爲master;
2) slave A由於網絡分區再也不可用;
3) slave B贏得選舉;
4) slave B由於網絡分區再也不可用;
5) 網絡分區修復,slave A又可用。
B掛了,A又可用。同一時刻,slave C發起選舉,試圖替代B成爲master。因爲slave C的master已不可用,因此它可以選舉成爲master,並將configEpoch值增一。而A將不能成爲master,由於C已成爲master,而且C的epoch值更大。
有兩種哈希槽(hash slot)傳播途徑:
1) 心跳消息(Heartbeat messages)。節點在發送ping和pong消息時,老是攜帶了它所負責(或它的master所負責)的哈希槽信息;
2) 更新消息(UPDATE messages)。因爲心跳包還包含了epoch信息,當消息接收者發現心跳包攜帶的信息陳舊時,會響應更新的信息,這樣強迫發送者更新哈希槽。
測試集羣運行在同一個物理機上,cluster-node-timeout值比repl-timeout值大。
cluster-slave-validity-factor值爲1 cluster-node-timeout值爲30000 repl-ping-slave-period值爲1 repl-timeout值爲10 |
master爲FAIL之時的1秒左右時間內,即爲主從切換之時。
master A標記fail時間:20:12:55.467 master B標記fail時間:20:12:55.467 master A投票時間:20:12:56.164 master B投票時間:20:12:56.164 slave發起選舉時間:20:12:56.160 slave準備發起選舉時間:20:12:55.558(延遲579毫秒) slave發現和master心跳超時時間:20:12:32.810(在這以後24秒才發生主從切換) slave收到其它master發來的本身的master爲fail時間:20:12:55.467 切換前服務最後一次正常時間:(服務異常約發生在秒)20:12:22/279275 切換後服務恢復正常時間:20:12:59/278149 服務不可用時長:約37秒 |
該master ID爲c67dc9e02e25f2e6321df8ac2eb4d99789917783。
30613:M 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 從其它master收到44eb43e50c101c5f44f48295c42dda878b6cb3e9已fail消息 30613:M 04 Jan 2019 20:12:55.467 # Cluster state changed: fail 30613:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 對選舉投票 30613:M 04 Jan 2019 20:12:56.204 # Cluster state changed: ok 30613:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 |
該master ID爲bfad383775421b1090eaa7e0b2dcfb3b38455079。
30614:M 04 Jan 2019 20:12:55.467 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). // 標記44eb43e50c101c5f44f48295c42dda878b6cb3e9爲已fail 30614:M 04 Jan 2019 20:12:56.164 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 30 // 對選舉投票 30614:M 04 Jan 2019 20:12:56.709 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 |
slave的master ID爲44eb43e50c101c5f44f48295c42dda878b6cb3e9,slave本身的ID爲0ae8b5400d566907a3d8b425d983ac3b7cbd8412。
30651:S 04 Jan 2019 20:12:32.810 # MASTER timeout: no data nor PING received... // 發現master超時,master異常10秒後發現,緣由是repl-timeout的值爲10 30651:S 04 Jan 2019 20:12:32.810 # Connection with master lost. 30651:S 04 Jan 2019 20:12:32.810 * Caching the disconnected master state. 30651:S 04 Jan 2019 20:12:32.810 * Connecting to MASTER 1.9.16.9:4073 30651:S 04 Jan 2019 20:12:32.810 * MASTER <-> REPLICA sync started 30651:S 04 Jan 2019 20:12:32.810 * Non blocking connect for SYNC fired the event.
30651:S 04 Jan 2019 20:12:43.834 # Timeout connecting to the MASTER... 30651:S 04 Jan 2019 20:12:43.834 * Connecting to MASTER 1.9.16.9:4073 30651:S 04 Jan 2019 20:12:43.834 * MASTER <-> REPLICA sync started 30651:S 04 Jan 2019 20:12:43.834 * Non blocking connect for SYNC fired the event. 30651:S 04 Jan 2019 20:12:54.856 # Timeout connecting to the MASTER... 30651:S 04 Jan 2019 20:12:54.856 * Connecting to MASTER 1.9.16.9:4073 30651:S 04 Jan 2019 20:12:54.856 * MASTER <-> REPLICA sync started 30651:S 04 Jan 2019 20:12:54.856 * Non blocking connect for SYNC fired the event. 30651:S 04 Jan 2019 20:12:55.467 * FAIL message received from bfad383775421b1090eaa7e0b2dcfb3b38455079 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 從其它master收到本身的master的FAIL消息 30651:S 04 Jan 2019 20:12:55.467 # Cluster state changed: fail 30651:S 04 Jan 2019 20:12:55.558 # Start of election delayed for 579 milliseconds (rank #0, offset 227360). // 準備發起選舉,延遲579毫秒,其中500毫秒爲固定延遲,279秒爲隨機延遲,由於RANK值爲0,因此RANK延遲爲0毫秒 30651:S 04 Jan 2019 20:12:56.160 # Starting a failover election for epoch 30. // 發起選舉 30651:S 04 Jan 2019 20:12:56.180 # Failover election won: I'm the new master. // 贏得選舉 30651:S 04 Jan 2019 20:12:56.180 # configEpoch set to 30 after successful failover 30651:M 04 Jan 2019 20:12:56.180 # Setting secondary replication ID to 154a9c2319403d610808477dcda3d4bede0f374c, valid up to offset: 227361. New replication ID is 927fb64a420236ee46d39389611ab2d8f6530b6a 30651:M 04 Jan 2019 20:12:56.181 * Discarding previously cached master state. 30651:M 04 Jan 2019 20:12:56.181 # Cluster state changed: ok 30651:M 04 Jan 2019 20:12:56.708 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 // 忽略來自非集羣成員1.9.16.9:4077的消息 |
測試集羣運行在同一個物理機上,cluster-node-timeout值比repl-timeout值小。
cluster-slave-validity-factor值爲1 cluster-node-timeout值爲10000 repl-ping-slave-period值爲1 repl-timeout值爲30 |
master爲FAIL之時的1秒左右時間內,即爲主從切換之時。
master A標記fail時間:20:37:10.398 master B標記fail時間:20:37:10.398 master A投票時間:20:37:11.084 master B投票時間:20:37:11.085 slave發起選舉時間:20:37:11.077 slave準備發起選舉時間:20:37:10.475(延遲539毫秒) slave發現和master心跳超時時間:沒有發生,由於slave在超時以前已成爲master slave收到其它master發來的本身的master爲fail時間:20:37:10.398 切換前服務最後一次正常時間:20:36:55/266889(服務異常約發生在56秒) 切換後服務恢復正常時間:20:37:12/265802 服務不可用時長:約17秒 |
該master ID爲c67dc9e02e25f2e6321df8ac2eb4d99789917783。
30613:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). 30613:M 04 Jan 2019 20:37:10.398 # Cluster state changed: fail 30613:M 04 Jan 2019 20:37:11.084 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32 30613:M 04 Jan 2019 20:37:11.124 # Cluster state changed: ok 30613:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 |
該master ID爲bfad383775421b1090eaa7e0b2dcfb3b38455079。
30614:M 04 Jan 2019 20:37:10.398 * Marking node 44eb43e50c101c5f44f48295c42dda878b6cb3e9 as failing (quorum reached). 30614:M 04 Jan 2019 20:37:11.085 # Failover auth granted to 0ae8b5400d566907a3d8b425d983ac3b7cbd8412 for epoch 32 30614:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 |
slave的master ID爲44eb43e50c101c5f44f48295c42dda878b6cb3e9,slave本身的ID爲0ae8b5400d566907a3d8b425d983ac3b7cbd8412。
30651:S 04 Jan 2019 20:37:10.398 * FAIL message received from c67dc9e02e25f2e6321df8ac2eb4d99789917783 about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 30651:S 04 Jan 2019 20:37:10.398 # Cluster state changed: fail 30651:S 04 Jan 2019 20:37:10.475 # Start of election delayed for 539 milliseconds (rank #0, offset 228620). 30651:S 04 Jan 2019 20:37:11.077 # Starting a failover election for epoch 32. 30651:S 04 Jan 2019 20:37:11.100 # Failover election won: I'm the new master. 30651:S 04 Jan 2019 20:37:11.100 # configEpoch set to 32 after successful failover 30651:M 04 Jan 2019 20:37:11.100 # Setting secondary replication ID to 0cf19d01597610c7933b7ed67c999a631655eafc, valid up to offset: 228621. New replication ID is 53daa7fa265d982aebd3c18c07ed5f178fc3f70b 30651:M 04 Jan 2019 20:37:11.101 # Connection with master lost. 30651:M 04 Jan 2019 20:37:11.101 * Caching the disconnected master state. 30651:M 04 Jan 2019 20:37:11.101 * Discarding previously cached master state. 30651:M 04 Jan 2019 20:37:11.101 # Cluster state changed: ok 30651:M 04 Jan 2019 20:37:17.560 * Ignoring FAIL message from unknown node 082c079149a9915612d21cca8e08c831a4edeade about 44eb43e50c101c5f44f48295c42dda878b6cb3e9 |
// 摘自Redis-5.0.3
// cluster.c /* This function is called if we are a slave node and our master serving * a non-zero amount of hash slots is in FAIL state. * * The gaol of this function is: * 1) To check if we are able to perform a failover, is our data updated? * 2) Try to get elected by masters. * 3) Perform the failover informing all the other nodes. */ void clusterHandleSlaveFailover(void) { 。。。。。。 /* Check if our data is recent enough according to the slave validity * factor configured by the user. * * Check bypassed for manual failovers. */ if (server.cluster_slave_validity_factor && data_age > (((mstime_t)server.repl_ping_slave_period * 1000) + (server.cluster_node_timeout * server.cluster_slave_validity_factor))) { if (!manual_failover) { clusterLogCantFailover(CLUSTER_CANT_FAILOVER_DATA_AGE); return; } } /* If the previous failover attempt timedout and the retry time has * elapsed, we can setup a new one. */ if (auth_age > auth_retry_time) { server.cluster->failover_auth_time = mstime() + 500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */ random() % 500; /* Random delay between 0 and 500 milliseconds. */ server.cluster->failover_auth_count = 0; server.cluster->failover_auth_sent = 0; server.cluster->failover_auth_rank = clusterGetSlaveRank(); /* We add another delay that is proportional to the slave rank. * Specifically 1 second * rank. This way slaves that have a probably * less updated replication offset, are penalized. */ server.cluster->failover_auth_time += server.cluster->failover_auth_rank * 1000; /* However if this is a manual failover, no delay is needed. */ if (server.cluster->mf_end) { server.cluster->failover_auth_time = mstime(); server.cluster->failover_auth_rank = 0; } serverLog(LL_WARNING, "Start of election delayed for %lld milliseconds " "(rank #%d, offset %lld).", server.cluster->failover_auth_time - mstime(), server.cluster->failover_auth_rank, replicationGetSlaveOffset()); /* Now that we have a scheduled election, broadcast our offset * to all the other slaves so that they'll updated their offsets * if our offset is better. */ clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES); return; } 。。。。。。 } |