ElasticSearch 寫操做剖析

時間 2019-11-07

原文原文鏈接

ElasticSearch 寫操做剖析

在看ElasticSearch權威指南基礎入門中關於：分片內部原理這一小節內容後，大體對ElasticSearch的索引、搜索底層實現有了一個初步的認識。記錄一下在看文檔的過程當中碰到的問題以及個人理解。此外，在文章的末尾，還討論分佈式系統中的主從複製原理，以及採用這種副本複製方案帶來的數據一致性問題。html

ElasticSearch index 操做背後發生了什麼？

更具體地，就是執行PUT操做向ElasticSearch添加一篇文檔時，底層發生的一系列操做。java

PUT user/profile/10
{
  "content":"向user索引中添加一篇id爲10的文檔"
}

經過PUT請求發起了索引新文檔的操做，該操做可以執行的前提是：集羣中有「必定數量」的活躍 shards。這個配置由wait_for_active_shards指定。ElasticSearch關於分片有2個重要的概念：primary shard 和 replica。在定義索引的時候指定索引有幾個主分片，以及每一個主分片有多少個副本。好比：node

PUT user
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 2
  },

介紹一下集羣的環境：ElasticSearch6.3.2三節點集羣。定義了一個user索引，該索引有三個主分片，每一個主分片2個副本。如圖，每一個節點上有三個shards：一個 primary shard，二個replicagit

wait_for_active_shardsgithub

To improve the resiliency of writes to the system, indexing operations can be configured to wait for a certain number of active shard copies before proceeding with the operation. If the requisite number of active shard copies are not available, then the write operation must wait and retry, until either the requisite shard copies have started or a timeout occurs.redis

在索引一篇文檔時，經過wait_for_active_shards指定有多少個活躍的shards時，才能執行索引文檔的操做。默認狀況下，只要primary shard 是活躍的就能夠索引文檔。即wait_for_active_shards值爲1算法

By default, write operations only wait for the primary shards to be active before proceeding (i.e. wait_for_active_shards=1)sql

來驗證一下：在只有一臺節點的ElasticSearch上：三個primary shard 所有分配在一臺節點中，而且存在着未分配的replica緩存

執行：數據結構

PUT user/profile/10
{
  "content":"向user索引中添加一篇id爲10的文檔"
}

返回結果：

{
  "_index": "user",
  "_type": "profile",
  "_id": "10",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 3,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

在_shards 中，total 爲3，說明該索引操做應該在3個（一個primary shard，二個replica）分片中執行成功；可是successful爲1 說明 PUT操做在其中一個分片中執行成功了，就返回給client索引成功的確認。這個分片就是primary shard，由於只有一臺節點，另外2個replica 屬於 unassigned shards，不可能在2個replica 中執行成功。總之，默認狀況下，只要primary shard 是活躍的，就能索引文檔(index操做)

如今在單節點的集羣上，修改索引配置爲：wait_for_active_shards=2，這意味着一個索引操做至少要在2個分片上執行成功，才能返回給client acknowledge。

"settings": {
        "index.write.wait_for_active_shards": "2"
    }

再次向user索引中PUT 一篇文檔：

PUT user/profile/10
{
  "content":"向user索引中添加一篇id爲10的文檔"
}

返回結果：

{
  "statusCode": 504,
  "error": "Gateway Time-out",
  "message": "Client request timeout"
}

因爲是單節點的ElasticSearch，另外的2個replica沒法分配，所以不多是活躍的。而咱們指定的wait_for_active_shards爲2，但如今只有primary shard是活躍的，還差一個replica，所以沒法進行索引操做了。

The primary shard assigned to perform the index operation might not be available when the index operation is executed. Some reasons for this might be that the primary shard is currently recovering from a gateway or undergoing relocation. By default, the index operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error.

索引操做會在1分鐘以後超時。再看建立索引的org.elasticsearch.cluster.metadata.MetaDataCreateIndexService源代碼涉及到兩個參數：一個是wait_for_active_shards，另外一個是ackTimeout，就好理解了。

public void createIndex(final CreateIndexClusterStateUpdateRequest request,
                            final ActionListener<CreateIndexClusterStateUpdateResponse> listener) {
        onlyCreateIndex(request, ActionListener.wrap(response -> {
            if (response.isAcknowledged()) {
                activeShardsObserver.waitForActiveShards(new String[]{request.index()}, request.waitForActiveShards(), request.ackTimeout(),
                    shardsAcknowledged -> {
                        if (shardsAcknowledged == false) {
                            logger.debug("[{}] index created, but the operation timed out while waiting for " +
                                             "enough shards to be started.", request.index());
                        }

總結一下：因爲文檔最終是存在在某個ElasticSearch shard下面的，而每一個shard又設置了副本數。默認狀況下，在進行索引文檔操做時，ElasticSearch會檢查活躍的分片數量是否達到wait_for_active_shards設置的值。若未達到，則索引操做會超時，超時時間爲1分鐘。另外，值得注意的是：檢查活躍分片數量只是在開始索引數據的時候檢查，若檢查經過後，在索引文檔的過程當中，集羣中又有分片由於某些緣由掛掉了，那麼並不能保證這個文檔必定寫入到 wait_for_active_shards 個分片中去了。

由於索引文檔操做（也即寫操做）發生在檢查活躍分片數量操做以後。試想如下幾個問題：

問題1：檢查活躍分片數量知足 wait_for_active_shards 設置的值以後，在持續 bulk index 文檔過程當中有 shard 失效了(這裏的shard是replica)，那難道不能繼續索引文檔了？
問題2：在何時檢查集羣中的活躍分片數量？難道要每次client發送索引文檔請求時就要檢查一次嗎？仍是說週期性地隔多久檢查一次？
問題3：這裏的 check-then-act 並非原子操做，所以wait_for_active_shards這個配置參數又有多大的意義？

所以，官方文檔中是這麼說的：

It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary.

該參數只是儘量地保證新文檔可以寫入到咱們所要求的shard數量中（reduce the chance of ....）。好比：wait_for_active_shards設置爲2，那也只是儘量地保證將新文檔寫入了2個shard中，固然一個是primary shard，另一個是某個replica
check 操做發生在 write操做以前，某個doc的寫操做check actives shard發現符合要求，但check完以後，某個replica掛了，只要不是primary shard，那該doc的寫操做仍是會繼續進行。可是在返回給用戶響應中，會標識出有多少個分片失敗了。實際上，ES索引一篇文檔時，是要求primary shard寫入該文檔，而後primary shard將文件並行轉發給全部的replica，當全部的replica都"寫入"以後，給primary shard響應，primary shard才返回ACK給client。從這裏可看出，須要等到primary shard以及全部的replica都寫入文檔以後，client才能收到響應。那麼，wait_for_active_shards這個參數的意義何在呢？個人理解是：這個參數只是"儘量"保證doc寫入到全部的分片，若是活躍的分片數量未達到wait_for_active_shards，那麼寫是不容許的，而達到了以後，才容許寫，而又因爲check-then-act並非原子操做，並不能保證doc必定是成功地寫入到wait_for_active_shards個分片中去了。（總之，在正常狀況下，ES client 寫請求一篇doc時，該doc在primary shard上寫入，而後並行轉發給各個replica，在各個replica上也執行寫完成(這裏的完成有多是成功，也有多是失敗)，primary shard收到各個replica寫完成的響應，才返回響應給ES client。參考這篇文章）

最後，說一下wait_for_active_shards參數的取值：能夠設置成 all 或者是 1到 number_of_replicas+1 之間的任何一個整數。

Valid values are all or any positive integer up to the total number of configured copies per shard in the index (which is number_of_replicas+1)

number_of_replicas 爲索引指定的副本的數量，加1是指：再算上primary shard。好比前面user索引的副本數量爲2，那麼wait_for_active_shards最多設置爲3。

好，前面討論完了ElasticSearch可以執行索引操做（寫操做）了，接下來是在寫操做過程當中發生了什麼？好比說ElasticSearch是如何作到近實時搜索的？在將文檔寫入ElasticSearch時候發生了故障，那文檔會不會丟失？

因爲ElasticSearch底層是Lucene，在將一篇文檔寫入ElasticSearch，並最終能被Client查詢到，涉及到如下幾個概念：倒排索引、Lucene段、提交點、translog、ElasticSearch分片。這裏概念都是參考《ElasticSearch definitive guide》中相關的描述。

In Elasticsearch, the most basic unit of storage of data is a shard. But, looking through the Lucene lens makes things a bit different. Here, each Elasticsearch shard is a Lucene index, and each Lucene index consists of several Lucene segments. A segment is an inverted index of the mapping of terms to the documents containing those terms.

它們之間的關係示意圖以下：

一個ElasticSearch 索引可由多個 primary shard組成，每一個primary shard至關於一個Lucene Index；一個Lucene index 由多個Segment組成，每一個Segment是一個倒排索引結構表

從文檔的角度來看：文章會被analyze(好比分詞)，而後放到倒排索引(posting list)中。倒排索引之於ElasticSearch就至關於B+樹之於Mysql，是存儲引擎底層的存儲結構。

當文檔寫入ElasticSearch時，文檔首先被保存在內存索引緩存中(in-memeory indexing buffer)。而in-memory buffer是每隔1秒鐘刷新一次，刷新成一個個的可搜索的段(file system cache)--下圖中的綠色圓柱表示(segment)，而後這些段是每隔30分鐘同步到磁盤中持久化存儲，段同步到磁盤的過程稱爲提交 commit。(這裏要注意區份內存中2個不一樣的區域：一個是 indexing buffer，另外一個是file system cache。寫入indexing buffer中的文檔通過 refresh 變成 file system cache中的segments，從而搜索可見)

But the new segment is written to the filesystem cache first—which is cheap—and only later is it flushed to disk—which is expensive. But once a file is in the cache, it can be opened and read, just like any other file.

在這裏涉及到了兩個過程：① In-memory buffer中的文檔被刷新成段；②段提交同步到磁盤持久化存儲。

過程①默認是1秒鐘1次，而咱們所說的ElasticSearch是提供了近實時搜索，指的是：文檔的變化並非當即對搜索可見，但會在一秒以後變爲可見，一秒鐘以後，咱們寫入的文檔就能夠被搜索到了，就是由於這個緣由。另外ElasticSearch提供了 refresh API 來控制過程①。refresh操做強制把In-memory buffer中的內容刷新成段。refresh示意圖以下：

好比說，你能夠在每次index一篇文檔以後就調用一次refresh API，也即：每索引一篇文檔就強制刷新生成一個段，這會致使系統中存在大量的小段，因爲一次搜索須要查找全部的segments，所以大量的小段會影響搜索性能；此外，大量的小段也意味着OS打開了大量的文件描述符，在必定程度上影響系統資源消耗。這也是爲何ElasticSearch/Lucene提供了段合併操做的緣由，由於無論是1s一次refresh，仍是每次索引一篇文檔時手動執行refresh，均可能致使大量的小段(small segment)產生，大量的小段是會影響性能的。
此外，當咱們討論segments時，該segment既能夠是已提交的，也能夠是未提交的segment。所謂已提交的segment，就是這些segment是已經fsync到磁盤上持久化存儲了的，因爲已提交的segments已經持久化了，那麼它們對應的translog日誌也能夠刪除了。而未提交的segments中包含的文檔是搜索可見的，可是若是宕機，就可能致使未提交的segments包含的文檔丟失了，此時能夠從translog恢復。

對於過程②，就是將段刷新到磁盤中去，默認是每隔30分鐘一次，這個刷新過程稱爲提交。若是還將來得及提交時，發生了故障，那豈不是會丟失大量的文檔數據？這個時候，就引入了translog

每篇文檔寫入到In-memroy buffer中時，同時也會向 translog中寫一條記錄。In-memory buffer 每秒刷新一次，刷新後生成新段，in-memory被清空，文檔能夠被搜索。

而translog 默認是每5秒鐘刷新一次到磁盤，或者是在每次請求(index、delete、update、bulk)以後就刷新到磁盤。每5秒鐘刷新一次就是異步刷新，能夠經過以下方式開啓：

PUT /my_index/_settings
{
    "index.translog.durability": "async",
    "index.translog.sync_interval": "5s"
}

這種方式的話，仍是有可能會丟失文檔數據，好比Client發起index操做以後，ElasticSearch返回了200響應，可是因爲translog要等5秒鐘以後才刷新到磁盤，若是在5秒內系統宕機了，那麼這幾秒鐘內寫入的文檔數據就丟失了。

而在每次請求操做（index、delete、update、bulk）執行後就刷新translog到磁盤，則是translog同步刷新，好比說：當Client PUT一個文檔：

PUT user/profile/10
{
  "content":"向user索引中添加一篇id爲10的文檔"
}

在前面提到的三節點ElasticSearch集羣中，該user索引有三個primary shard，每一個primary shard2個replica，那麼translog須要在某個primary shard中刷新成功，而且在該primary shard的兩個replica中也刷新成功，纔會給Client返回 200 PUT成功響應。這種方式就保證了，只要Client接收到的響應是200，就意味着該文檔必定是成功索引到ElasticSearch中去了。由於translog是成功持久化到磁盤以後，再給Client響應的，系統宕機後在下一次重啓ElasticSearch時，就會讀取translog進行恢復。

By default, Elasticsearch fsyncs and commits the translog every 5 seconds if index.translog.durability is set to async or if set to request (default) at the end of every index, delete, update, or bulk request. More precisely, if set to request, Elasticsearch will only report success of an index, delete, update, or bulk request to the client after the translog has been successfully fsynced and committed on the primary and on every allocated replica.

這也是爲何，在咱們關閉ElasticSearch時最好進行一次flush操做，將段刷新到磁盤中。由於這樣會清空translog，那麼在重啓ElasticSearch就會很快（不須要恢復大量的translog了）

translog 也被用來提供實時 CRUD 。當你試着經過ID查詢、更新、刪除一個文檔，它會在嘗試從相應的段中檢索以前，首先檢查 translog 任何最近的變動。這意味着它老是可以實時地獲取到文檔的最新版本。

放一張總結性的圖，以下：

有個問題是：爲何translog能夠在每次請求以後刷新到磁盤？難道不會影響性能嗎？相比於將段(segment)刷新到磁盤，刷新translog的代價是要小得多的，由於translog是通過精心設計的數據結構，而段(segment)是用於搜索的"倒排索引"，咱們沒法作到每次將段刷新到磁盤；而刷新translog相比於段要輕量級得多(translog 可作到順序寫disk，而且數據結構比segment要簡單)，所以經過translog機制來保證數據不丟失又不太影響寫入性能。

Changes to Lucene are only persisted to disk during a Lucene commit, which is a relatively expensive operation and so cannot be performed after every index or delete operation.
......
All index and delete operations are written to the translog after being processed by the internal Lucene index but before they are acknowledged. In the event of a crash, recent transactions that have been acknowledged but not yet included in the last Lucene commit can instead be recovered from the translog when the shard recovers.

若是宕機，那些已經返回給client確認但還沒有 lucene commit 持久化到disk的transactions，能夠從translog中恢復。

總結一下：

這裏一共有三個地方有「刷新操做」，其中 refresh 的應用場景是每一個Index/Update/Delete 單個操做以後，要不要refresh一下？而flush 是針對索引而言的：要不要對 twitter 這個索引 flush 一下，使得在內存中的數據是否已經持久化到磁盤上了，這裏會引起Lucene的commit，當這些數據持久化磁盤上後，相應的translog就能夠刪除了（由於這些數據已經持久化到磁盤，那就是可靠的了，若是發生意外宕機，須要藉助translog恢復的是那些還沒有來得及flush到磁盤上的索引數據）

in-memory buffer 刷新生成segment

每秒一次，文檔刷新成segment就能夠被搜索到了，ElasticSearch提供了refresh API 來控制這個過程
translog 刷新到磁盤

由index.translog.durability來設置，或者由index.translog.flush_threshold_size來設置當translog達到必定大小以後刷新到磁盤(默認512MB)
段(segment) 刷新到磁盤（flush）

每30分鐘一次，ElasticSearch提供了flush API 來控制這個過程。在段被刷新到磁盤（就是一般所說的commit操做）中時，也會清空刷新translog。

寫操做的可靠性如何保證？

數據的可靠性保證是理解存儲系統差別的重要方面。好比說，要對比MySQL與ES的異同、對比ES與redis的區別，就能夠從數據的可靠性這個點入手。在MySQL裏面有redo log、基於 bin log的主從複製、有double write等機制，那ES的數據可靠性保證又是如何實現的呢？
我認爲在分佈式系統中討論數據可靠性須要從二個角度出發：單個副本數據的本地刷盤策略和寫多個副本的以後什麼時候向client返回響應也即同步複製、異步複製的問題。

1 單個副本數據的本地刷盤策略
在ES中，這個策略叫translog機制。當向ES索引一篇doc時，doc先寫入in-memory buffer，同時寫入translog。translog可配置成每次寫入以後，就flush disk。這與MySQL寫redo log日誌的的原理是相似的。
2 寫多個副本的以後什麼時候向client返回響應
這裏涉及到參數wait_for_active_shards，前面提到這個參數雖有一些"缺陷"(check-then-act)，但它仍是儘量地保證一篇doc在寫入primary shard而且也"寫入"(寫入並不表明落盤)了若干個replica以後，才返回ack給client。ES中的 wait_for_active_shards參數與Kafka中 broker配置 min.insync.replicas 原理是相似的。

存在的一些問題

這個 issue 和這個 issue 討論了index.write.wait_for_active_shards參數的前因後果。

以三節點ElasticSearch6.3.2集羣，索引設置爲3個primary shard，每一個primary shard 有2個replica 來討論：

client向其中一個節點發起Index操做索引文檔，這個寫操做請求固然是發送到primary shard上，可是當Client收到200響應時，該文檔是否已經複製到另外2個replica上？
Client將一篇文檔成功寫入到ElasticSearch了（收到了200響應），它能在replica所在的節點上 GET 到這篇文檔嗎？Client發起查詢請求，又能查詢到這篇文檔嗎？（注意：GET 和 Query 是不同的）
前面提到，當 index 一篇文檔時，primary shard 和2個replica 上的translog 要 都刷新 到磁盤，才返回 200 響應，那它是否與參數 index.write.wait_for_active_shards默認值矛盾？由於index.write.wait_for_active_shards默認值爲1，即：只要primary shard 是活躍的，就能夠進行 index 操做。也就是說：當Client收到200的index成功響應，此時primary shard 已經將文檔複製到2個replica 上了嗎？這兩個 replica 已經將文檔刷新成 segment了嗎？仍是說這兩個 replica 僅僅只是將索引該文檔的 translog 刷新到磁盤上了？

ElasticSearch副本複製方式討論

ElasticSearch索引是一個邏輯概念，囊括現實世界中的數據。好比定義一個 user 索引存儲全部的用戶資料信息。索引由若干個primary shard組成，就至關於把用戶資料信息分開成若干個部分存儲，每一個primary shard存儲user索引中的一部分數據。爲了保證數據可靠性不丟失，能夠爲每一個primary shard配置副本(replica)。顯然，primary shard 和它對應的replica 是不會存儲在同一臺機器(node)上的，由於若是該機器宕機了，那麼primary shard 和副本(replica) 都會丟失，那整個系統就丟失一部分數據了。

primary shard 和 replica 這種副本備份方案，稱爲主從備份。primary shard是主(single leader)，replica 是從 (multiple replica)。因爲是分佈式環境，可能存在多個Client同時向ElasticSearch發起索引文檔的請求，這篇文檔會根據文檔id 哈希到某個 primary shard，primary shard寫入該文檔並分發給 replica 進行存儲。因爲採用了哈希，這也是爲何在定義索引的時候，須要指定primary shard個數，而且 primary shard個數一經指定後不可修改的緣由。由於primary shard個數一旦改變，哈希映射結果就變了。而採用這種主從副本備份方案，這也是爲何索引操做(寫操做、update操做) 只能由 primary shard處理，而讀操做既能夠從 primary shard讀取，也能夠從 replica 讀取的緣由。相對於文檔而言，primary shard是single leader，全部的文檔修改操做都統一由primary shard處理，能避免一些併發修改衝突。可是默認狀況下，ElasticSearch 副本複製方式是異步的，也正如前面 index.write.wait_for_active_shards討論，只要primary shard 是活躍的就能夠進行索引操做，primary shard 將文檔「存儲」以後，就返回給client 響應，而後primary shard 再將該文檔同步給replicas，而這就是異步副本複製方式。在ElasticSearch官方討論論壇裏面，也有關於副本複製方式的討論：這篇文章提出了一個問題：Client向primary shard寫入文檔成功，primary shard 是經過何種方式將該文檔同步到 replica的？

其實以爲這裏的異步複製提法有點不許確，不過放到這裏，供你們討論參考吧

關於primary shard 將文檔同步給各個replica，涉及到 in-sync replica概念，在master節點中維護了一個 in-sync 副本列表。

Since replicas can be offline, the primary is not required to replicate to all replicas. Instead, Elasticsearch maintains a list of shard copies that should receive the operation. This list is called the in-sync copies and is maintained by the master node.

當index操做發送到 primary shard時，primary shard 並行轉發給 in-sync副本，等待各個in-sync副本給primary 響應。primary shard收到全部in-sync副本響應後，再給Client響應說：Index操做成功。因爲副本可能會出故障，當沒有in-sync副本，只有primary shard在正常工做時，此時的index操做只在primary shard上執行成功就返回給Client了，這裏就存在了所謂的「單點故障」，由於primary shard所在的節點掛了，那就會丟失 index 操做的文檔了。這個時候， index.write.wait_for_active_shards 參數就起做用了，若是將 index.write.wait_for_active_shards 設置爲2，那麼當沒有in-sync副本，只有primary shard在正常工做時，index 操做就會被拒絕。因此，在這裏，index.write.wait_for_active_shards參數起到一個避免單點故障的功能。更具體的細節可參考data-replication model

採用異步副本複製方式帶來的一個問題是：讀操做能讀取最新寫入的文檔嗎？若是咱們指定讀請求去讀primary shard（經過ElasticSearch 的路由機制），那麼是能讀到最新數據的。可是若是讀請求是由某個 replica 接收處理，那也許就不能讀取到剛纔最新寫入的文檔了。所以，從剛纔Client 讀請求的角度來看，ElasticSearch能提供哪一種程度的一致性呢？而出現這種一致性問題的緣由在於：爲了保證數據可靠性，採用了副本備份，引入了副本，致使副本和primary shard上的數據不一致，即：存在 replication lag 問題。因爲這種副本複製延遲帶來的問題，系統須要給Client 某種數據一致性的保證，好比說：

read your own write

Client可以讀取到它本身最新寫入的數據。好比用戶修改了暱稱，那TA訪問本身的主頁時，能看到本身修改了的暱稱，可是TA的好友可能並不能當即看到 TA 修改後的暱稱。好友請求的是某個 replica 上的數據，而 primary shard還將來得及把剛纔修改的暱稱同步到 replica上。
Monotonic reads

單調讀。每次Client讀取的值，是愈來愈新的值（站在Client角度來看的）。好比說NBA籃球比賽，Client每10分鐘讀一次比賽結果。第10分鐘讀取到的是 1:1，第20分鐘讀到的是2:2，第30分鐘讀到的是3:3，假設在第40分鐘時，實際比賽結果是4:4，Cleint在第40分鐘讀取的時候，讀到的值能夠是3:3 這意味着未讀取到最新結果而已，讀到的值也能夠是4:4，可是不能是2:2 。
consistent prefix reads

符合因果關係的一種讀操做。好比說，用戶1 和用戶2 對話：

用戶1：你如今幹嗎？
用戶2：寫代碼

對於Client讀：應該是先讀取到「你如今幹嗎？」，而後再讀取到「寫代碼。若是讀取結果順序亂了，Client就會莫名其妙。

正是因爲Client 有了系統給予的這種一致性保證，那麼Client（或者說應用程序）就能基於這種保證來開發功能，爲用戶提供服務。

那系統又是如何提供這種一致性保證的呢？或者說ElasticSearch集羣又提供了何種一致性保證？常常聽到的有：強一致性(linearizability)、弱一致性、最終一致性。對於強一致性，通俗的理解就是：實際上數據有多份(primary shard 以及多個 replica)，但在Client看來，表現得就只有一份數據。在多個 client 併發讀寫情形下，某個Client在修改數據A，而又有多個Client在同時讀數據A，linearizability 就要保證：若是某個Client讀取到了數據A，那在該Client以後的讀取請求返回的結果都不能比數據A要舊，至少是數據A的當前值（不能是數據A的舊值）。不說了，再說，我本身都不明白了。

至於系統如何提供這種一致性，會用到一些分佈式共識算法，我也沒有深刻地去研究過。關於副本複製方式的討論，也可參考這篇文章：分佈式系統理論之Quorum機制