深刻理解Redis複製

時間 2019-11-05

標籤深刻理解 redis 複製欄目 Redis 简体版

原文原文鏈接

複製node

A few things to understand ASAP about Redis replication.數據庫

1) Redis replication is asynchronous, but you can configure a master to
   stop accepting writes if it appears to be not connected with at least a given number of slaves. 2) Redis slaves are able to perform a partial resynchronization with the master if the replication link is lost for a relatively small amount of time. You may want to configure the replication backlog size (see the next sections of this file) with a sensible value depending on your needs. 3) Replication is automatic and does not need user intervention. After a network partition slaves automatically try to reconnect to masters and resynchronize with them.

複製的實現網絡

1. 設置主節點的地址和端口app

簡而言之，是執行SLAVEOF命令，該命令是個異步命令，在設置完masterhost和masterport屬性以後，從節點將向發送SLAVEOF的客戶端返回OK。表示複製指令已經被接受，而實際的複製工做將在OK返回以後才真正開始執行。less

2. 建立套接字鏈接。運維

在執行完SLAVEOF命令後，從節點根據命令所設置的IP和端口，建立連向主節點的套接字鏈接。若是建立成功，則從節點將爲這個套接字關聯一個專門用於處理複製工做的文件事件處理器，這個處理器將負責執行後續的複製工做，好比接受RDB文件，以及接受主節點傳播來的寫命令等。異步

3. 發送PING命令。async

從節點成爲主節點的客戶端以後，首先會向主節點發送一個PING命令，其做用以下：tcp

1. 檢查套接字的讀寫狀態是否正常。ide

2. 檢查主節點是否能正常處理命令請求。

若是從節點讀取到「PONG」的回覆，則表示主從節點之間的網路鏈接狀態正常，而且主節點能夠正常處理從節點發送的命令請求。

4. 身份驗證

從節點在收到主節點返回的「PONG」回覆以後，接下來會作的就是身份驗證。若是從節點設置了masterauth選項，則進行身份驗證。反之則不進行。

在須要進行身份驗證的狀況下，從節點將向主節點發送一條AUTH命令，命令的參數便可從節點masterauth選項的值。

5. 發送端口信息。

在身份驗證以後，從節點將執行REPLCONF listening-port <port-number>，向主節點發送從節點的監聽端口號。

主節點會將其記錄在對應的客戶端狀態的slave_listening_port屬性中，這點可經過info Replication查看。

127.0.0.1:6379> info Replication
# Replication
role:master
connected_slaves:1 slave0:ip=127.0.0.1,port=6380,state=online,offset=3696,lag=0

6. 同步。

從節點向主節點發送PSYNC命令，執行同步操做，並將本身的數據庫更新至主節點數據庫當前所處的狀態。

7. 命令傳播

當完成了同步以後，主從節點就會進入命令傳播階段。這時主節點只要一直將本身執行的寫命令發送到從節點，而從節點只要一直接收並執行主節點發來的寫命令，就能夠保證主從節點保持一致了。

8. 心跳檢測

在命令傳播階段，從節點默認會以每秒一次的頻率，向主節點發送命令。

REPLCONF ACK <replication_offset>

其中，replication_offset是從節點當前的複製偏移量。

發送REPLCONF ACK主從節點有三個做用：

1> 檢測主從節點的網絡鏈接狀態。

2> 輔助實現min-slave選項。

3> 檢查是否存在命令丟失。

REPLCONF ACK命令和複製積壓緩衝區是Redis 2.8版本新增的，在此以前，即便命令在傳播過程當中丟失，主從節點都不會注意到。

複製的相關參數

slaveof <masterip> <masterport>
masterauth <master-password> slave-serve-stale-data yes slave-read-only yes repl-diskless-sync no repl-diskless-sync-delay 5 repl-ping-slave-period 10 repl-timeout 60 repl-disable-tcp-nodelay no repl-backlog-size 1mb repl-backlog-ttl 3600 slave-priority 100 min-slaves-to-write 3 min-slaves-max-lag 10 slave-announce-ip 5.5.5.5 slave-announce-port 1234

其中，

slaveof <masterip> <masterport>：開啓複製，只需這條命令便可。

masterauth <master-password>：若是master中經過requirepass參數設置了密碼，則slave中需設置該參數。

slave-serve-stale-data：當主從鏈接中斷，或主從複製創建期間，是否容許slave對外提供服務。默認爲yes，即容許對外提供服務，但有可能會讀到髒的數據。

slave-read-only：將slave設置爲只讀模式。須要注意的是，只讀模式針對的只是客戶端的寫操做，對於管理命令無效。

repl-diskless-sync，repl-diskless-sync-delay：是否使用無盤複製。爲了下降主節點磁盤開銷，Redis支持無盤複製，生成的RDB文件不保存到磁盤而是直接經過網絡發送給從節點。無盤複製適用於主節點所在機器磁盤性能較差但網絡寬帶較充裕的場景。須要注意的是，無盤複製目前依然處於實驗階段。

repl-ping-slave-period：master每隔一段固定的時間向SLAVE發送一個PING命令。

repl-timeout：複製超時時間。

# The following option sets the replication timeout for:
#
# 1) Bulk transfer I/O during SYNC, from the point of view of slave. # 2) Master timeout from the point of view of slaves (data, pings). # 3) Slave timeout from the point of view of masters (REPLCONF ACK pings). # # It is important to make sure that this value is greater than the value # specified for repl-ping-slave-period otherwise a timeout will be detected # every time there is low traffic between the master and the slave.

repl-disable-tcp-nodelay：設置爲yes，主節點會等待一段時間才發送TCP數據包，具體等待時間取決於Linux內核，通常是40毫秒。適用於主從網絡環境複雜或帶寬緊張的場景。默認爲no。

repl-backlog-size：複製積壓緩衝區，複製積壓緩衝區是保存在主節點上的一個固定長度的隊列。用於從Redis 2.8開始引入的部分複製。

# Set the replication backlog size. The backlog is a buffer that accumulates
# slave data when slaves are disconnected for some time, so that when a slave # wants to reconnect again, often a full resync is not needed, but a partial # resync is enough, just passing the portion of data the slave missed while # disconnected. # # The bigger the replication backlog, the longer the time the slave can be # disconnected and later be able to perform a partial resynchronization. # # The backlog is only allocated once there is at least a slave connected.

只有slave鏈接上來，纔會開闢backlog。

repl-backlog-ttl：若是master上的slave全都斷開了，且在指定的時間內沒有鏈接上，則backlog會被master清除掉。repl-backlog-ttl即用來設置該時長，默認爲3600s，若是設置爲0，則永不清除。

slave-priority：設置slave的優先級，用於Redis Sentinel主從切換時使用，值越小，則提高爲主的優先級越高。須要注意的是，若是設置爲0，則表明該slave不參加選主。

slave-announce-ip，slave-announce-port ：經常使用於端口轉發或NAT場景下，對Master暴露真實IP和端口信息。

同步的過程

1. 從節點向主節點發送PSYNC命令。

2. 收到PSYNC命令的主節點執行BGSAVE命令，在後臺生成一個RDB文件，並使用一個緩衝區記錄從如今開始執行的全部寫命令。

3. 當主節點的BGSAVE命令執行完畢時，主節點會將BGSAVE命令生成的RDB文件發送給從節點，從節點接受並載入這個RDB文件，將本身的數據庫狀態更新至主節點執行BGSAVE命令時的數據庫狀態。

4. 主節點將記錄在緩衝區裏面的全部寫命令發送給從節點，從節點執行這些寫命令，將本身的數據庫狀態更新至主節點數據庫當前所處的狀態。

須要注意的是，在步驟2中提到的緩衝區，實際上是有大小限制的，其由client-output-buffer-limit slave 256mb 64mb 60決定，該參數的語法及解釋以下：

# client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds>
#
# A client is immediately disconnected once the hard limit is reached, or if # the soft limit is reached and remains reached for the specified number of # seconds (continuously).

意思是若是該緩衝區的大小超過256M，或該緩衝區的大小超過64M，且持續了60s，主節點會立刻斷開從節點的鏈接。斷開鏈接後，在60s以後（repl-timeout），從節點發現沒有從主節點中得到數據，會從新啓動複製。

在Redis 2.8以前，若是因網絡緣由，主從節點複製中斷，當再次創建鏈接時，仍是會執行SYNC命令進行全量複製。效率較爲低下。從Redis 2.8開始，引入了PSYNC命令代替SYNC命令來執行復制時的同步操做。

PSYNC命令具備全量同步（full resynchronization）和增量同步（partial resynchronization）。

全量同步的日誌：

master：

19544:M 05 Oct 20:44:04.713 * Slave 127.0.0.1:6380 asks for synchronization
19544:M 05 Oct 20:44:04.713 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'dc419fe03ddc9ba30cf2a2cf1894872513f1ef96', my replication IDs are 'f8a035fdbb7cfe435652b3445c2141f98a65e437' and '0000000000000000000000000000000000000000')19544:M 05 Oct 20:44:04.713 * Starting BGSAVE for SYNC with target: disk 19544:M 05 Oct 20:44:04.713 * Background saving started by pid 20585 20585:C 05 Oct 20:44:04.723 * DB saved on disk 20585:C 05 Oct 20:44:04.723 * RDB: 0 MB of memory used by copy-on-write 19544:M 05 Oct 20:44:04.813 * Background saving terminated with success 19544:M 05 Oct 20:44:04.814 * Synchronization with slave 127.0.0.1:6380 succeeded

slave：

19746:S 05 Oct 20:44:04.288 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new
 master with just a partial transfer.19746:S 05 Oct 20:44:04.288 * SLAVE OF 127.0.0.1:6379 enabled (user request from 'id=3 addr=127.0.0.1:37128 fd=8 name= age=929 idle=0 flags=N db=0 sub=0 psub=
0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')19746:S 05 Oct 20:44:04.712 * Connecting to MASTER 127.0.0.1:6379
19746:S 05 Oct 20:44:04.712 * MASTER <-> SLAVE sync started 19746:S 05 Oct 20:44:04.712 * Non blocking connect for SYNC fired the event. 19746:S 05 Oct 20:44:04.713 * Master replied to PING, replication can continue... 19746:S 05 Oct 20:44:04.713 * Trying a partial resynchronization (request dc419fe03ddc9ba30cf2a2cf1894872513f1ef96:1191). 19746:S 05 Oct 20:44:04.713 * Full resync from master: f8a035fdbb7cfe435652b3445c2141f98a65e437:1190 19746:S 05 Oct 20:44:04.713 * Discarding previously cached master state. 19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: receiving 224566 bytes from master 19746:S 05 Oct 20:44:04.814 * MASTER <-> SLAVE sync: Flushing old data 19746:S 05 Oct 20:44:04.815 * MASTER <-> SLAVE sync: Loading DB in memory 19746:S 05 Oct 20:44:04.817 * MASTER <-> SLAVE sync: Finished with success

增量同步的日誌：

master：

19544:M 05 Oct 20:42:06.423 # Connection with slave 127.0.0.1:6380 lost.
19544:M 05 Oct 20:42:06.753 * Slave 127.0.0.1:6380 asks for synchronization 19544:M 05 Oct 20:42:06.753 * Partial resynchronization request from 127.0.0.1:6380 accepted. Sending 0 bytes of backlog starting from offset 1037.

slave：

19746:S 05 Oct 20:42:06.423 # Connection with master lost.
19746:S 05 Oct 20:42:06.423 * Caching the disconnected master state. 19746:S 05 Oct 20:42:06.752 * Connecting to MASTER 127.0.0.1:6379 19746:S 05 Oct 20:42:06.752 * MASTER <-> SLAVE sync started 19746:S 05 Oct 20:42:06.752 * Non blocking connect for SYNC fired the event. 19746:S 05 Oct 20:42:06.753 * Master replied to PING, replication can continue... 19746:S 05 Oct 20:42:06.753 * Trying a partial resynchronization (request f8a035fdbb7cfe435652b3445c2141f98a65e437:1037). 19746:S 05 Oct 20:42:06.753 * Successful partial resynchronization with master. 19746:S 05 Oct 20:42:06.753 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

在Redis 4.0中，master_replid和offset存儲在RDB文件中。當從節點被優雅的關閉並從新啓動時，Redis可以從RDB文件中從新加載master_replid和offset，從而使增量同步成爲可能。

增量同步的實現依賴於如下三部分：

1. 主從節點的複製偏移量。

2. 主節點的複製積壓緩衝區。

3. 節點的運行ID（run ID）。

當一個從節點被提高爲主節點時，其它的從節點必須與新主節點從新同步。在Redis 4.0 以前，由於master_replid發生了變化，因此這個過程是一個全量同步。在Redis 4.0以後，新主節點會記錄舊主節點的naster_replid和offset，由於可以接受來自其它從節點的增量同步請求，即便請求中的master_replid不一樣。在底層實現上，當執行slaveof no one時，會將master_replid，master_repl_offset+1複製爲master_replid，second_repl_offset。

複製相關變量

# Replication
role:master
connected_slaves:2 slave0:ip=127.0.0.1,port=6380,state=online,offset=5698,lag=0 slave1:ip=127.0.0.1,port=6381,state=online,offset=5698,lag=0 master_replid:e071f49c8d9d6719d88c56fa632435fba83e145d master_replid2:0000000000000000000000000000000000000000 master_repl_offset:5698 second_repl_offset:-1 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:1 repl_backlog_histlen:5698 # Replication role:slave master_host:127.0.0.1 master_port:6379 master_link_status:up master_last_io_seconds_ago:1 master_sync_in_progress:0 slave_repl_offset:126 slave_priority:100 slave_read_only:1 connected_slaves:0 master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de master_replid2:0000000000000000000000000000000000000000 master_repl_offset:126 second_repl_offset:-1 repl_backlog_active:1 repl_backlog_size:1048576 repl_backlog_first_byte_offset:1 repl_backlog_histlen:126

其中，

role: Value is "master" if the instance is replica of no one, or "slave" if the instance is a replica of some master instance. Note that a replica can be master of another replica (chained replication).

master_replid: The replication ID of the Redis server. 每一個Redis節點啓動後都會動態分配一個40位的十六進制字符串做爲運行ID。主的運行ID。

master_replid2: The secondary replication ID, used for PSYNC after a failover. 在執行slaveof no one時，會將master_replid，master_repl_offset+1複製爲master_replid，second_repl_offset。

master_repl_offset: The server's current replication offset. Master的複製偏移量。

second_repl_offset: The offset up to which replication IDs are accepted.

repl_backlog_active: Flag indicating replication backlog is active 是否開啓了backlog。

repl_backlog_size: Total size in bytes of the replication backlog buffer. repl-backlog-size的大小。

repl_backlog_first_byte_offset: The master offset of the replication backlog buffer. backlog中保存的Master最先的偏移量，

repl_backlog_histlen: Size in bytes of the data in the replication backlog buffer. backlog中數據的大小。

If the instance is a replica, these additional fields are provided:

master_host: Host or IP address of the master. Master的IP。

master_port: Master listening TCP port. Master的端口。

master_link_status: Status of the link (up/down). 主從之間的鏈接狀態。

master_last_io_seconds_ago: Number of seconds since the last interaction with master. 主節點每隔10s對從從節點發送PING命令，以判斷從節點的存活性和鏈接狀態。該變量表明多久以前，主從進行了心跳交互。

master_sync_in_progress: Indicate the master is syncing to the replica. 主節點是否在向從節點同步數據。我的以爲，應該指的是全量同步或增量同步。

slave_repl_offset: The replication offset of the replica instance. Slave的複製偏移量。

slave_priority: The priority of the instance as a candidate for failover. Slave的權重。

slave_read_only: Flag indicating if the replica is read-only. Slave是否處於可讀模式。

If a SYNC operation is on-going, these additional fields are provided:

master_sync_left_bytes: Number of bytes left before syncing is complete.

master_sync_last_io_seconds_ago: Number of seconds since last transfer I/O during a SYNC operation.

If the link between master and replica is down, an additional field is provided:

master_link_down_since_seconds: Number of seconds since the link is down. 主從鏈接中斷持續的時間。

The following field is always provided:

connected_slaves: Number of connected replicas. 鏈接的Slave的數量。

If the server is configured with the min-slaves-to-write (or starting with Redis 5 with the min-replicas-to-write) directive, an additional field is provided:

min_slaves_good_slaves: Number of replicas currently considered good。狀態正常的從節點的數量。

For each replica, the following line is added:
slaveXXX: id, IP address, port, state, offset, lag. Slave的狀態。

slave0:ip=127.0.0.1,port=6381,state=online,offset=1288,lag=1

如何監控主從延遲

# Replication
role:master
connected_slaves:2 slave0:ip=127.0.0.1,port=6381,state=online,offset=560,lag=0 slave1:ip=127.0.0.1,port=6380,state=online,offset=560,lag=0 master_replid:15715bc0bd37a71cae3d08b9566f001ccbc739de master_replid2:0000000000000000000000000000000000000000 master_repl_offset:560

其中，master_repl_offset是主節點的複製偏移量，slaveX中的offset即對應從節點的複製偏移量，二者的差值即主從的延遲量。

如何評估backlog緩衝區的大小

t * (master_repl_offset2 - master_repl_offset1 ) / (t2 - t1)

t is how long the disconnections may last in seconds.

參考：

1. 《Redis開發與運維》

2. 《Redis設計與實現》

3. 《Redis 4.X Cookbook》