MongoDB副本集同步原理

時間 2019-11-24

原文原文鏈接

MongoDB的同步原理，官方文檔介紹的比較少，網上資料也不是太多，下面是結合官方文檔、網上資料和測試時候的日誌，整理出來的一點東西。
由於MongoDB的每一個分片也是副本集，因此只須要搞副本集的同步原理便可。mysql

1、Initial Sync

大致來講，MongoDB副本集同步主要包含兩個步驟：sql

1. Initial Sync，全量同步
2. Replication，即sync oplog

先經過init sync同步全量數據，再經過replication不斷重放Primary上的oplog同步增量數據。全量同步完成後，成員從轉換 STARTUP2爲SECONDARYmongodb

1.1 初始化同步過程

1) 全量同步開始，獲取同步源上的最新時間戳t1
2) 全量同步集合數據，創建索引（比較耗時）
3) 獲取同步源上最新的時間戳t2
4) 重放t1到t2之間全部的oplog
5) 全量同步結束

簡單來講，就是遍歷Primary上的全部DB的全部集合，將數據拷貝到自身節點，而後讀取全量同步開始到結束時間段內的oplog並重放。數據庫

initial sync結束後，Secondary會創建到Primary上local.oplog.rs的tailable cursor，不斷從Primary上獲取新寫入的oplog，並應用到自身。架構

1.2 初始化同步場景

Secondary節點當出現以下情況時，須要先進⾏全量同步併發

1) oplog爲空
2) local.replset.minvalid集合⾥_initialSyncFlag字段設置爲true（用於init sync失敗處理）
3) 內存標記initialSyncRequested設置爲true（用於resync命令，resync命令只用於master/slave架構，副本集沒法使用）

這3個場景分別對應(場景2和場景3沒看到官網文檔有寫，參考張友東大神博客)app

1) 新節點加⼊，⽆任何oplog，此時需先進性initial sync
2) initial sync開始時，會主動將_initialSyncFlag字段設置爲true，正常結束後再設置爲false；若是節點重啓時，發現_initialSyncFlag爲true，說明上次全量同步中途失敗了，此時應該從新進⾏initial sync
3)當⽤戶發送resync命令時，initialSyncRequested會設置爲true，此時會強制從新開始⼀次initial sync

1.3 疑問點解釋

1.3.1 全量同步數據的時候，會不會源數據的oplog被覆蓋了致使全量同步失敗？ide

在3.4版本及之後，不會。
下面這張圖說明了3.4對全量同步的改進（圖來自張友東博客）：測試

官方文檔是：fetch

initial sync會在爲每一個集合複製文檔時構全部集合索引。在早期版本（3.4以前）的MongoDB中，僅_id在此階段構建索引。
Initial sync複製數據的時候會將新增的oplog記錄存到本地（3.4新增）。

2、Replication

2.1 sync oplog的過程

全量同步結束後，Secondary就開始從結束時間點創建tailable cursor，不斷的從同步源拉取oplog並重放應用到自身，這個過程並非由一個線程來完成的，mongodb爲了提高同步效率，將拉取oplog以及重放oplog分到了不一樣的線程來執行。
具體線程和做用以下（這部分暫時沒有在官方文檔找到，來自張友東大神博客）：

producer thread：這個線程不斷的從同步源上拉取oplog，並加入到一個BlockQueue的隊列裏保存着，BlockQueue最大存儲240MB的oplog數據，當超過這個閾值時，就必須等到oplog被replBatcher消費掉才能繼續拉取。
replBatcher thread：這個線程負責逐個從producer thread的隊列裏取出oplog，並放到本身維護的隊列裏，這個隊列最多容許5000個元素，而且元素總大小不超過512MB，當隊列滿了時，就須要等待oplogApplication消費掉
oplogApplication會取出replBatch thread當前隊列的全部元素，並將元素根據docId（若是存儲引擎不支持文檔鎖，則根據集合名稱）分散到不一樣的replWriter線程，replWriter線程將全部的oplog應用到自身；等待全部oplog都應用完畢，oplogApplication線程將全部的oplog順序寫入到local.oplog.rs集合。

針對上面的敘述，畫了一個圖方便理解：

producer的buffer和apply線程的統計信息均可以經過db.serverStatus().metrics.repl來查詢到。

2.2 對過程疑問點的解釋

2.2.1 爲何oplog的回放要弄這麼多的線程？

和mysql同樣，一個線程作一個事情，拉取oplog是單線程，其餘線程進行回放；多個回放線程加快速度。

2.2.2 爲何須要replBatcher線程來中轉？

oplog重放時，要保持順序性，⽽且遇到create、drop等DDL命令時，這些命令與其餘的增刪改查命令是不能並⾏執⾏的，⽽這些控制就是由replBatcher來完成的。

2.2.3 如何解決secondary節點oplog重放追不上primary問題？

方法一：設置更大的回放線程數

* mongod命令行指定：mongod --setParameter replWriterThreadCount=32
  * 配置文件中指定

setParameter:
  replWriterThreadCount: 32

方法二：增大oplog的大小
方法三：將writeOpsToOplog步驟分散到多個replWriter線程來併發執行，看官方開發者日誌已經實現了這個（在3.4.0-rc2版本）

2.3 注意事項

initial sync單線程複製數據，效率比較低，生產環境應該儘可能避免initial sync出現，需合理配置oplog。
新加⼊節點時，能夠經過物理複製的⽅式來避免initial sync，將Primary上的dbpath拷⻉到新的節點，而後直接啓動。
當Secondary同步滯後是由於主上併發寫入過高致使，db.serverStatus().metrics.repl.buffer的 sizeBytes值持續接近maxSizeBytes的時候，可經過調整Secondary上replWriter併發線程數來提高。

3、日誌分析

3.1 初始化同步日誌

將日誌級別 verbosity設置爲 1，而後過濾日誌
cat mg36000.log |egrep "clone|index|oplog" >b.log
最後拿出過濾後的部分日誌。
3.4.21新加入節點日誌

由於日誌太多，貼太多出來也沒什麼意義，下面貼出了對db01庫的某個
集合的日誌。
能夠發現是先建立collection索引，而後clone集合數據和索引數據，這樣就完成了該集合的clone。最後將配置改成下一個集合。

2019-08-21T16:50:10.880+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-27-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.test2" }),
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.test2" }
2019-08-21T16:50:10.882+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.882+0800 D STORAGE  [InitialSyncInserters-db01.test20] create uri: table:db01/index-28-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.test2" }),
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20] build index on: db01.test2 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.test2" }
2019-08-21T16:50:10.886+0800 I INDEX    [InitialSyncInserters-db01.test20]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-21T16:50:10.901+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: num_1
2019-08-21T16:50:10.906+0800 D INDEX    [InitialSyncInserters-db01.test20]      bulk commit starting for index: _id_
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11] collection clone finished: db01.test2
2019-08-21T16:50:10.913+0800 D REPL     [repl writer worker 11]     collection: db01.test2, stats: { ns: "db01.test2", documentsToCopy: 2000, documentsCopied: 2000, indexes: 2, fetchedBatches: 1, start: new Date(1566377410875), end: new Date(1566377410913), elapsedMillis: 38 }
2019-08-21T16:50:10.920+0800 D STORAGE  [InitialSyncInserters-db01.collection10] create uri: table:db01/index-30-154229953453504826 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),

3.6.12加入新節點日誌

3.6較3.4的區別是，複製數據庫的線程明確了是：repl writer worker 進行重放（看文檔其實3.4已是如此了）
還有就是明確是用cursors來進行。
其餘和3.4沒有區別，也是建立索引，而後clone數據。

2019-08-22T13:59:39.444+0800 D STORAGE  [repl writer worker 9] create uri: table:db01/index-32-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=true)
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T13:59:39.446+0800 I INDEX    [repl writer worker 9]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T13:59:39.447+0800 D REPL     [replication-1] Collection cloner running with 1 cursors established.
2019-08-22T13:59:39.681+0800 D INDEX    [repl writer worker 7]      bulk commit starting for index: _id_
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7] collection clone finished: db01.collection1
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     database: db01, stats: { dbname: "db01", collections: 1, clonedCollections: 1, start: new Date(1566453579439), end: new Date(1566453579725), elapsedMillis: 286 }
2019-08-22T13:59:39.725+0800 D REPL     [repl writer worker 7]     collection: db01.collection1, stats: { ns: "db01.collection1", documentsToCopy: 50000, documentsCopied: 50000, indexes: 1, fetchedBatches: 1, start: new Date(1566453579440), end: new Date(1566453579725), elapsedMillis: 285 }
2019-08-22T13:59:39.731+0800 D STORAGE  [repl writer worker 8] create uri: table:test/index-34-3334250984770678501 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "test.user1" }),log=(enabled=true)

4.0.11加入新節點日誌

使用cursors，和3.6基本一致

2019-08-22T15:02:13.806+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-30--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "num" : 1 }, "name" : "num_1", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { num: 1.0 }, name: "num_1", ns: "db01.collection1" }
2019-08-22T15:02:13.816+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.816+0800 D STORAGE  [repl writer worker 15] create uri: table:db01/index-31--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection1" }),log=(enabled=false)
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15] build index on: db01.collection1 properties: { v: 2, key: { _id: 1 }, name: "_id_", ns: "db01.collection1" }
2019-08-22T15:02:13.819+0800 I INDEX    [repl writer worker 15]      building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2019-08-22T15:02:13.820+0800 D REPL     [replication-0] Collection cloner running with 1 cursors established.

3.2 複製日誌

2019-08-22T15:15:17.566+0800 D STORAGE  [repl writer worker 2] create collection db01.collection2 { uuid: UUID("8e61a14e-280c-4da7-ad8c-f6fd086d9481") }
2019-08-22T15:15:17.567+0800 I STORAGE  [repl writer worker 2] createCollection: db01.collection2 with provided UUID: 8e61a14e-280c-4da7-ad8c-f6fd086d9481
2019-08-22T15:15:17.567+0800 D STORAGE  [repl writer worker 2] stored meta data for db01.collection2 @ RecordId(22)
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] db01.collection2: clearing plan cache - collection info cache reset
2019-08-22T15:15:17.580+0800 D STORAGE  [repl writer worker 2] create uri: table:db01/index-43--463691904336459055 config: type=file,internal_page_max=16k,leaf_page_max=16k,checksum=on,prefix_compression=true,block_compressor=,,,,key_format=u,value_format=u,app_metadata=(formatVersion=8,infoObj={ "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_", "ns" : "db01.collection2" }),log=(enabled=false)

原文連接本文爲雲棲社區原創內容，未經容許不得轉載。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。