這篇blog主要探討常見的分佈式一致性算法:git
重點在於對比這些算法的相同點和差別。思考算法的設計和取捨.github
paxos是最先提出的分佈式一致性算法.
見http://lamport.azurewebsites....web
實踐中, 節點每每同時承載proposer/acceptors/learner的功能.算法
提案人,A proposer sends a proposed value to a set of acceptors. 能夠理解爲能夠執行寫入的master節點。注意paxos協議並無限制Proposer的數量. 在實踐中爲了不中心節點生成全序proposal number的單點故障, 使用了選主協議. 參考下面的leader.segmentfault
見論文的第三節:Implementing a State Machine
必須有一箇中心節點生成全序的 proposal number. 被選爲主節點的proposer纔可發起proposal.
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.安全
投票人,An acceptor may accept the proposed value. 即構成majority或quorum的節點。app
被動接受選舉結果的角色。能夠理解爲只讀的從庫。ssh
客戶端. 發起請求和接收返回.分佈式
下圖來自: https://en.wikipedia.org/wiki...ide
Client Proposer Acceptor Learner | | | | | | | X-------->| | | | | | Request | X--------->|->|->| | | Prepare(1) | |<---------X--X--X | | Promise(1,{Va,Vb,Vc}) | X--------->|->|->| | | Accept!(1,V) | |<---------X--X--X------>|->| Accepted(1,V) |<---------------------------------X--X Response | | | | | | |
由於全部最高的proposal number, 都是通過了majority的accept的, 因此絕對不會發生以下狀況:
更低版本議題覆蓋更高版本議題的狀況, 如,
(proposal number 1, V1) 覆蓋 (proposal number 2, V2).
paxos容許同時存在多個proposer, prepare階段能夠確保不覆蓋higher proposal number.
paxos算法沒有明確要求proposal number的生成是嚴格有序的. 也沒有規定learner嚴格有序地按proposal number讀取.
在嚴格有序的狀況下, paxos纔是線性的.
zab是zookeeper實現的分佈式一致性算法.
見: http://www.cs.cornell.edu/cou...
It is important in our setting that we enable multiple outstanding ZooKeeper operations and that a prefix of operations submitted concurrently by a ZooKeeper client are committed according to FIFO order.
paxos沒有直接支持線性一致性. 若是多個primaries同時執行多個事務, 事務提交的順序是沒法保證的, 這個將致使最終的結果不一致.
下圖說明了這種不一致, 27A -> 28B 變成了 27C -> 28B. 若是但願避免這種不一致, 只能限制一次只能執行一個proposal. 顯然這樣會致使系統吞吐低下. 若使用合併transactions的方式優化的話, 又會致使系統延遲提升. 合併transactions的size也很難選擇.
在多個primaries同時執行多個事務的狀況下, paxos不一樣的processes在同一個sequence number下可能有不一樣的值. 新的master必須對全部未learned的value走一遍paxos phases 1, 即經過majories得到正確的值.
zab 是經過單主節點, 即leader廣播來實現線性, 即全序(論文中的PO, primary order)的廣播.
選主, abdeliver 主節點的廣播.
新紀元, 每次選主成功, 都會有有 epoch_new > epoch_current.
proposes和commit對應paxos的Promise/Accept.
proposes和commit 都包含 (e, (v, z)):
e: epoch 對應master的任期number
v: commit transaction value, 對應實際的值
z: commit transaction identifier (zxid), 對應提交版本號.
正是commit的全序線性化保證, 保證了各個節點的value變化具備線性一致性.
還沒有在了論文中找到prospective leader的精確描述,有模糊描述在 V. Z AB IN DETAIL 以下:
When a process starts, it enters the ELECTION state. While in this state the process tries to elect a new leader or become a leader. If the process finds an elected leader, it moves to the FOLLOWING state and begins to follow the leader.
多數成員ack後,便可安全commit和返回client. 參照discovery階段,由於會選擇全部節點中最新的history,故,只須要多數節點ack,便可返回寫入成功.
返回失敗不必定意味着寫入不成功,多是
寫入失敗不必定失敗,寫入成功能夠確保成功。
raft 是目前最易於實現,最易懂的分佈式一致性算法,和zab同樣,raft必須先選出主節點。
paxos指single-decree paxos, 在工程中價值不大, 而multi-paxos的諸多細節並未給出.
從論文出發, raft考慮到了更多工程上的細節, 如日誌壓縮, 和改變集羣節點都是zab/paxos未說起的.
同zab, raft有一個strong leader. 必須先選出主節點, 纔可提供寫入服務.
同zab, follower可投票, 寫入主節點廣播的log(by AppendEntries rpc)
同zab, 每選出一個leader, 都對應一個全序的term.
日誌, 即zab的propose和commit.
AppendEntries設計得極爲精巧. 同時能夠做爲
https://raft.github.io/
有詳細的選主過程. 能夠隨意操做節點, 觀察各個狀況下的邊界. 參照paxos的文本化流程, 下面是一個正常狀況下的選主:
Node1 Node2 Node3 | | | timeout | | X----RequestVote---->| | X--------------------|-----RequestVote---->| |<-------vote--------X | |<-------vote--------|-------------- ------X become leader | |
Figure 6: Logs are composed of entries, which are numbered
sequentially. Each entry contains the term in which it was
created (the number in each box) and a command for the state
machine. Anentryisconsidered committed if itissafeforthat
entry to be applied to state machines.
仔細思考AppendEntries RPC的Receiver implementation:
當多數節點都返回對應success時, AppendEntries的entries便可認爲是寫入, 能夠返回給client. 這是經過在選主時施加的額外限制來保證的, 具備更新(比較log index和term). log的節點纔可贏得選舉. 那麼, 只要一個log在多數節點寫入了(就算沒有commit), 選主時, 必定會選擇一個具備該entry的節點:
Figure11: Timeline for a configuration change. Dashed lines
show configuration entries that have been created but not
committed,and solidlinesshow thelatestcommittedconfigu-
ration entry. The leader first creates theC old,new configuration
entry in its log and commits it to C old,new (a majority of C old
and a majority of C new ). Then it creates the C new entry and
commits it to a majority of C new . There is no point in time in
whichC old and C new can both make decisions independently.
關鍵在於可以作決定的majority的切換.
gossip一致性算法和paxos/zab/raft 有較大差別, 不存在主節點, 沒有線性化保證. 經過無限重試的廣播, 確保變化在一段時間後, 可以同步到全部節點.
實現算法的關鍵在與, 如何將全部節點的gossip廣播合併成一個狀態, 即, 節點a能夠聲明節點a上有k1=v1, 節點b聲明節點b上有k1=v2. 那麼合併的狀態是 a上有k1=v1, b上有k1=v2. 只要合併的算法是一致的, 如, 能夠選擇並存, 優先a或優先b, 那麼收到對應廣播的節點, 狀態可保持一致.
由於沒有線性化保證, 廣播的內容不能用差量, 而應該用全量.
Active thread (peer P): Passive thread (peer Q): (1) selectPeer(&Q); (1) (2) selectToSend(&bufs); (2) (3) sendTo(Q, bufs); -----> (3) receiveFromAny(&P, &bufr); (4) (4) selectToSend(&bufs); (5) receiveFrom(Q, &bufr); <----- (5) sendTo(P, bufs); (6) selectToKeep(cache, bufr); (6) selectToKeep(cache, bufr); (7) processData(cache); (7) processData(cache) Figure 1: The general organization of a gossiping protocol.
不論是paxos, zab, raft,實現一致性的核心原理都相似:
不論是節點崩潰後繼續提交,仍是由其它節點繼續提交,會在多數節點達成共識.
https://en.wikipedia.org/wiki...
http://lamport.azurewebsites....
https://stackoverflow.com/que...
http://www.cs.cornell.edu/cou...
https://raft.github.io/
https://web.stanford.edu/~ous...