談談分佈式一致性算法—— paxos zab raft gossip

時間 2020-12-26

標籤 git github web 算法 segmentfault 安全 app ssh 分佈式 ide 欄目系統架構简体版

原文原文鏈接

概述

這篇blog主要探討常見的分佈式一致性算法：git

paxos
zab (zookeeper atomic broadcast)
raft
gossip

重點在於對比這些算法的相同點和差別。思考算法的設計和取捨.github

paxos

paxos是最先提出的分佈式一致性算法.
見http://lamport.azurewebsites....web

重要概念

Roles

實踐中, 節點每每同時承載proposer/acceptors/learner的功能.算法

proposer

提案人，A proposer sends a proposed value to a set of acceptors. 能夠理解爲能夠執行寫入的master節點。注意paxos協議並無限制Proposer的數量. 在實踐中爲了不中心節點生成全序proposal number的單點故障, 使用了選主協議. 參考下面的leader.segmentfault

leader

見論文的第三節：Implementing a State Machine
必須有一箇中心節點生成全序的 proposal number. 被選爲主節點的proposer纔可發起proposal.
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm.安全

acceptors

投票人，An acceptor may accept the proposed value. 即構成majority或quorum的節點。app

learner

被動接受選舉結果的角色。能夠理解爲只讀的從庫。ssh

client

客戶端. 發起請求和接收返回.分佈式

work flow

Basic Paxos without failures

下圖來自: https://en.wikipedia.org/wiki...ide

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{Va,Vb,Vc})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

算法核心

acceptor不會接受proposal number更小的prepare, 會返回accept過的highest-numbered.
被提交的value只有在階段2纔會被選擇.

由於全部最高的proposal number, 都是通過了majority的accept的, 因此絕對不會發生以下狀況:
更低版本議題覆蓋更高版本議題的狀況, 如,
(proposal number 1, V1) 覆蓋 (proposal number 2, V2).

爲何paxos要有prepare階段

paxos容許同時存在多個proposer, prepare階段能夠確保不覆蓋higher proposal number.

paxos算法是線性的嗎?

paxos算法沒有明確要求proposal number的生成是嚴格有序的. 也沒有規定learner嚴格有序地按proposal number讀取.
在嚴格有序的狀況下, paxos纔是線性的.

zab

zab是zookeeper實現的分佈式一致性算法.
見: http://www.cs.cornell.edu/cou...

why not paxos

線性一致性

It is important in our setting that we enable multiple outstanding ZooKeeper operations and that a prefix of operations submitted concurrently by a ZooKeeper client are committed according to FIFO order.
paxos沒有直接支持線性一致性. 若是多個primaries同時執行多個事務, 事務提交的順序是沒法保證的, 這個將致使最終的結果不一致.
下圖說明了這種不一致, 27A -> 28B 變成了 27C -> 28B. 若是但願避免這種不一致, 只能限制一次只能執行一個proposal. 顯然這樣會致使系統吞吐低下. 若使用合併transactions的方式優化的話, 又會致使系統延遲提升. 合併transactions的size也很難選擇.

快速恢復

在多個primaries同時執行多個事務的狀況下, paxos不一樣的processes在同一個sequence number下可能有不一樣的值. 新的master必須對全部未learned的value走一遍paxos phases 1, 即經過majories得到正確的值.

重要概念

Roles

leader

zab 是經過單主節點, 即leader廣播來實現線性, 即全序(論文中的PO, primary order)的廣播.

follower

選主, abdeliver 主節點的廣播.

epoch

新紀元, 每次選主成功, 都會有有 epoch_new > epoch_current.

proposes 和 commit

proposes和commit對應paxos的Promise/Accept.
proposes和commit 都包含 (e, (v, z)):
e: epoch 對應master的任期number
v: commit transaction value, 對應實際的值
z: commit transaction identifier (zxid), 對應提交版本號.
正是commit的全序線性化保證, 保證了各個節點的value變化具備線性一致性.

work flow

選主(discovery 和 synchronization)

如何確認prospective leader？

還沒有在了論文中找到prospective leader的精確描述，有模糊描述在 V. Z AB IN DETAIL 以下：

When a process starts, it enters the ELECTION state. While in this state the process tries to elect a new leader or become a leader. If the process finds an elected leader, it moves to the FOLLOWING state and begins to follow the leader.

爲何選主須要discovery 和 synchronization 分兩個階段?

discovery 階段經過多數follower的cepoch確保leader有最新的commit和history(f.p和h.f)．
synchronization 階段確保leader將最新的commit/history同步到了多數節點．

broadcast

多數成員ack後，便可安全commit和返回client. 參照discovery階段，由於會選擇全部節點中最新的history，故，只須要多數節點ack，便可返回寫入成功.
返回失敗不必定意味着寫入不成功，多是

成功，但返回未被收到。
失敗，leader 廣播以前崩潰。
失敗，leader 在多數成員ack前崩潰，以後的選主未選中ack的follower的history。

寫入失敗不必定失敗，寫入成功能夠確保成功。

raft

raft 是目前最易於實現，最易懂的分佈式一致性算法，和zab同樣，raft必須先選出主節點。

why not paxos

難懂
不容易構建具備實際意義的實現.

paxos指single-decree paxos, 在工程中價值不大, 而multi-paxos的諸多細節並未給出.

和 zab 的異同

異

選主邏輯不一樣, raft經過隨機的選主超時觸發選主. 只有一個階段
broadcast邏輯不一樣, 見下面的log.

從論文出發, raft考慮到了更多工程上的細節, 如日誌壓縮, 和改變集羣節點都是zab/paxos未說起的.

同

single master and multi followers
任期概念 raft (term) zab (epoch)

核心抽象

Roles

leader

同zab, raft有一個strong leader. 必須先選出主節點, 纔可提供寫入服務.

follower

同zab, follower可投票, 寫入主節點廣播的log(by AppendEntries rpc)

term

同zab, 每選出一個leader, 都對應一個全序的term.

log

日誌, 即zab的propose和commit.
AppendEntries設計得極爲精巧. 同時能夠做爲

heartbeat
propose
commit (leaderCommit字段能夠commit全部小於等於leaderCommit的log)

work flow

Leader election

https://raft.github.io/
有詳細的選主過程. 能夠隨意操做節點, 觀察各個狀況下的邊界. 參照paxos的文本化流程, 下面是一個正常狀況下的選主:

Node1                 Node2                 Node3
   |                    |                     |
timeout                 |                     |
   X----RequestVote---->|                     |
   X--------------------|-----RequestVote---->|
   |<-------vote--------X                     |
   |<-------vote--------|-------------- ------X
become leader           |                     |

Log replication

Figure 6: Logs are composed of entries, which are numbered
sequentially. Each entry contains the term in which it was
created (the number in each box) and a command for the state
machine. Anentryisconsidered committed if itissafeforthat
entry to be applied to state machines.

仔細思考AppendEntries RPC的Receiver implementation:

不接受小於當前term的log. 廣播AppendEntries的節點只可能有兩種, 一, 誤覺得本身還是leader的老leader. 二, 當前leader, 一個分區中的Node可能不停提升term, 但它得到多數節點投票成爲leader以前, 不會發出廣播. 這一點保證了, 只接受當前任期term的log.
不接受prevLogIndex和prevLogTerm不匹配的log. 若prevLogIndex和prevLogTerm不匹配. 說明prevLogIndex之上有多個leader廣播, leader的log和follower的log不一致. leader經過減小nextIndex重試來修復和follower的不一致. 直到全部未commit的log都達成一致.
若是當前entry的值不一致, 使用leader的. leader的log是可信的.
若是有新的entries, 添加.
若是leaderCommit > commitIndex, 設置 commitIndex = min(leaderCommit, index of last new entry)

當多數節點都返回對應success時, AppendEntries的entries便可認爲是寫入, 能夠返回給client. 這是經過在選主時施加的額外限制來保證的, 具備更新(比較log index和term). log的節點纔可贏得選舉. 那麼, 只要一個log在多數節點寫入了(就算沒有commit), 選主時, 必定會選擇一個具備該entry的節點:

不切換leader的狀況下, 不會寫入更新的log
切換leader須要多數節點的贊成, 必定會選擇一個具備該log的節點. 由於該log是最新的.

Cluster membership changes

Figure11: Timeline for a configuration change. Dashed lines
show configuration entries that have been created but not
committed,and solidlinesshow thelatestcommittedconfigu-
ration entry. The leader first creates theC old,new configuration
entry in its log and commits it to C old,new (a majority of C old
and a majority of C new ). Then it creates the C new entry and
commits it to a majority of C new . There is no point in time in
whichC old and C new can both make decisions independently.

關鍵在於可以作決定的majority的切換.

gossip

gossip一致性算法和paxos/zab/raft 有較大差別, 不存在主節點, 沒有線性化保證. 經過無限重試的廣播, 確保變化在一段時間後, 可以同步到全部節點.

work flow

實現算法的關鍵在與, 如何將全部節點的gossip廣播合併成一個狀態, 即, 節點a能夠聲明節點a上有k1=v1, 節點b聲明節點b上有k1=v2. 那麼合併的狀態是 a上有k1=v1, b上有k1=v2. 只要合併的算法是一致的, 如, 能夠選擇並存, 優先a或優先b, 那麼收到對應廣播的節點, 狀態可保持一致.
由於沒有線性化保證, 廣播的內容不能用差量, 而應該用全量.

Active thread (peer P):                 Passive thread (peer Q):
(1) selectPeer(&Q);                     (1)
(2) selectToSend(&bufs);                (2)
(3) sendTo(Q, bufs);            ----->  (3) receiveFromAny(&P, &bufr);
(4)                                     (4) selectToSend(&bufs);
(5) receiveFrom(Q, &bufr);      <-----  (5) sendTo(P, bufs);
(6) selectToKeep(cache, bufr);          (6) selectToKeep(cache, bufr);
(7) processData(cache);                 (7) processData(cache)
Figure 1: The general organization of a gossiping protocol.