zookeeper實現分析之二：ZAB

時間 2019-11-17

標籤 zookeeper 實現分析之二 zab 欄目 Zookeeper 简体版

原文原文鏈接

一年多前學習zookeeper時作的筆記，主要是翻譯自「ZooKeeper's atomic broadcast protocol:Theory and practice」，並添加了本身的一些理解，整理一下做爲一篇博客貼出來，後續有時間會分析一下在zookeeper源碼中，zab是如何實現的，以及zab與paxos的區別。 -------------------------------------------------------------------------- <h3>1 Consistency Guarantees</h3> Zookeeper不能保證強一致性，客戶端能看到older數據。Zookeeper提供順序一致性。 Zookeeper的一致性保證： 一、順序一致性：客戶端的更新通知是嚴格按照順序進行發送。 二、原子性：更新操做要麼成功要麼失敗，沒有中間狀態。 三、Single system image：無論客戶端鏈接哪個服務器，客戶端看到的都是the same view of service。 四、Reliability：一旦一個更新成功，那麼那就會被持久化，直到客戶端用新的更新覆蓋這個更新。 五、Timeliness：Zookeeper確保客戶端在必定時間內（幾十秒）完成或看到系統的數據更新。 那麼zab是如何確保這些一致性相關的特色。 Zab的兩個重要的要求以下： 一、支持同時處理多個outstanding的客戶端寫操做。一個outstanding事務的含義是事務已經被提交但沒有被commit。 二、有效的從crash狀態恢復過來。 Zookeeper能處理併發地處理多個客戶端的outstanding 寫請求，而且以FIFO順序commit這些寫操做。FIFO的特性對於zookeeper可以有效的從crash狀態恢復過來也是相當重要的。 原始的paxos協議不能同時處理多個outstanding transaction，paxos不要求通訊時的FIFO通道特性，paxos能夠容忍消息丟失和從新排序。 在paxos中，從primary crash中恢復過來並保證事務的序列化的能力不是足夠有效，而zab改進了這方面的能力，採用了一個事務ID來實現事務的totally order。 Zookeeper的性能要求以下： 一、低延時（low latency）。 二、 Good throughput。高吞吐量。 三、 Smooth failure handling。容錯。 在這種狀況下，爲了能有效地更新一個new primary的應用程序狀態，在zab中new primary會被指望擁有最高事務ID的進程，整個集羣能夠經過從new primary中拷貝事務，從而全部數據副本均可以達到一個一致性。 而在paxos，沒有采用相似zab的序列號，因此一個新的primary須要執行paxos算法的第一階段，以便於獲取到全部primary沒有學習到值。 <h5>2 ZAB協議和流程介紹</h5> Zab協議有四個階段，以下： 一、階段0：Leader election 二、階段1：Discovery（或者epoch establish） 三、階段2：Synchronization（或者sync with followers） 四、階段3：Broadcast 在Zab協議的實現時，合併爲三個階段： 一、 Fast Leader Election 二、 Recovery Phase 三、 Broadcast Phase 在實現中將discovery和synchronization這兩個phase合併成了broadcast phase。 ZAB的流程圖以下所示： <a href="http://static.oschina.net/uploads/img/201312/22194147_iKYR.jpg"><img title="1" style="border-right: 0px; border-top: 0px; display: inline; border-left: 0px; border-bottom: 0px" height="322" alt="1" src="http://static.oschina.net/uploads/img/201312/22194147_K2bP.jpg" width="500" border="0" /></a> CEPOCH = Follower sends its last promise to the prospective leader NEWEPOCH = Leader proposes a new epoch e' ACK-E = Follower acknowledges the new epoch proposal NEWLEADER = Prospective leader proposes itself as the new leader of epoch e' ACK-LD = Follower acknowledges the new leader proposal COMMIT-LD = Commit new leader proposal PROPOSE = Leader proposes a new transaction ACK = Follower acknowledges leader proosal COMMIT = Leader commits proposal <h3>3 Leader election</h3> <h4>3.1 leader election後置條件</h4> Leader election可能有多種方式，但在這裏咱們只分析一種，fast leader election。 Leader election後置條件： 一、條件：Leader election這個過程必須保證選舉出來的leader能看到全部歷史的commited transactions。 二、緣由：這個後置條件是爲了確保在後續recovery phase步驟中zookeeper replicas的一致性。它是防止follower中包含leader中沒有的committed transaction，並且在recovery phase中只有leader向follower和observer同步，follower不會向leader同步，若是出現這種狀況，那麼zookeeper的replicas就出現了不一致的狀況。 因此爲了達到這個後置條件，leader election須要選擇出一個擁有highest lastZxid的leader。 那麼fast leader election是如何選擇出一個擁有highest lastZxid的leader？ <h4>3.2 Fast leader election介紹</h4> 在進行fast leader election過程當中，爲了選舉出一個擁有highest lastZxid的leader（能看到最新的歷史committed transaction），處於election狀態的peer servers會對其餘peer server進行表決。Peer server會交換他們的vote（選舉）的通知。同時當peer server發現一個擁有recent history的peer server（peer server擁有higher history Zxid），peer server會更新其自身的vote。當選舉出一個leader後，而後進入recovery phase，fast leader election就結束了，假如vote選舉出來leader就是peer server自身，那麼peer server變成leading狀態（fast leader election過程當中，peer server自己的狀態是following），其餘的peer server則進入following狀態。若是後續的recovery phase和broadcast phase發生任何失敗的狀況，那麼peer server都會回到election狀態，從新啓動fast leader election。 <h4>3.3 Epoch number</h4> Epoch是用於區分每個round，每一次創建一個新的leader-follower關係，都會有一個惟一的epoch值去標識。就好像皇帝登基必須得有一個年號，與以前或以後的皇帝進行區分。 Epoch在兩個過程當中用到：一、leader election時。二、recovery過程（新創建一個leader-follower關係）。 一、過程1：每個fast leader election開始時epoch的值都爲0，epoch的值會在fast leader election過程當中進行更新。 我的理解每一個zookeeper節點剛啓動時沒有leader-follower關係視圖，那麼它就會認爲本身是leader，而後發起leader electoin，那麼這個leader election的epoch值爲0；在leader election過程當中，將epoch更新到currepoch值（其餘peer server中的最高的epoch）。使用epoch number來區分不一樣的fast leader election過程。就好像你想當皇帝，定了一個年號發起登基過程，若是當前有其餘皇帝存在，且他的年號比你的年號更新，那麼你就得更新年號，從新發起登基，誰支持的人多誰就是皇帝；若是沒有其餘皇帝存在，但有其餘人也在登基，那麼你們就一塊兒比比，看誰的年號更新，看誰的資格更老（一樣的epoch，vote值越大越優先），那麼選舉誰當皇帝。 二、過程2：在一個faster leader election結束後，新產生的leader會獲取epoch，其值爲lastest history zxid的高32位，而後對epoch自增，而後用新的epoch值做爲新zxid的高32，zxid的低32位爲0。一旦當上皇帝后，就發佈一個新的年號。 這裏有矛盾的地方： 兩個過程的epoch是不是同一個？過程1的epoch是不會持久化的。過程2中由於zxid是持久化的，那麼至關於epoch也是持久化的。因此不理解。 <h4>3.4 選取出highest zxid的leader</h4> 爲了能選舉出highest zxid的leader，那麼就須要對vote進行比較。 對於peer server集合 PSET = {p1, p2, p3, …., pn}，其中{1, 2, 3, …. , n }是peer server的ID. 那麼Pi的vote能夠用pair(Zi, i)表示，Zi是Pi的highest zxid，也是lastest zxid。 那麼兩個vote比較大小的準則是：    (Zi, i) >= (Zj, j) ： Zi > Zj 或者( Zi = Zj && i >= j ) 每個peer server都有一個惟一的ID，且都知道其replicas中保存的latest zxid，那麼全部的peer就會以必定順序進行排序。 <h4>3.5 Fast leader election持久化</h4> 在fast leader election過程當中，不會對任何數據進行持久化，不會把過程當中產生的值寫入到disk中。包括epoch number和ID但在fast leader election會使用已經持久化的latest zxid。 <h4>3.6 Fast leader election過程和僞碼</h4> 進行Fast leader election的先決條件： 一、每一個peer server都知道其餘peer的ip地址，並知道peer server的總數。 二、每一個peer server一開始都是發起一個vote，選取本身爲leader。向其餘全部的peer server發送vote的notification，並等待回覆。 三、根據peer server的狀態處理vote notification或則notifincation的回覆. 若是peer server處於election狀態，那麼peer server會收到其餘peer server的vote，若是收到的vote值更大，那麼peer server會更新其vote。 若是peer server不處於election狀態，那麼peer server會更新其所看到的leader-follower關係。 無論哪一種狀況下，當peer server檢測到大部分peers持有相同的vote時，那麼它會返回 Fast leader election邏輯僞代碼 主要有兩個邏輯分支： 一、正常過程，vote的notification的回覆的peer server的狀態爲election 二、另外過程，vote的notification的回覆的peer server的狀態爲leading/following 執行leader election的狀況較爲複雜，多是一個服務器節點新加入到zookeeper集羣中。也多是zookeeper集羣剛啓動，你們都處於leader election狀態。以上兩個邏輯分支能處理這些狀況。 ***初始化vote和peer server狀態*** 1 Peer P: 2 timeout <---T0 // use some reasonable timeout value 3 ReceivedVotes <--- 0; OutOfElection <--- 0; // key-value mappings where keys are server ids 4 P:state <--- election;  P:vote <---(P:lastZxid; P:id);  P:round <--- P:round + 1 1-4是初始化過程，設置超時時間，receivedVotes是收到的vote noficaton回覆。 進入election狀態，根據lastZxid和ServerID生成一個vote，vote的epoch自增。 ReceivedVotes做爲一個結果集合，在收到全部vote後，進行表決。OutOfElection用於保存狀態爲leading/folling的rspvote，用於表決先存在的leader/follower是否有效。 5 Send notification (P:vote, P:id, P:state, P:round) to all peers 向全部的peer server發送notification，一個notification包括vote，id，peer state，和vote的epoch number。 ***開始接收notification回覆的循環處理*** 6 while P:state = election do 7     n <---(null if P:queue = 0; for timeout milliseconds, otherwise pop from P:queue) 8     if n = null then 9          Send notification (P:vote, P:id, P:state, P:round) to all peers 10        timeout <---(2* timeout), unless a predefined upper bound has been reached 8-10是當notification回覆爲空時，有兩種狀況，一種是信令發送出去回覆超時，第二種是沒有創建於peer server的鏈接. 若是是第一種狀況，那麼從新發送notification；若是是第二種狀況，那麼創建與peer server的tcp鏈接. 11    else if n:state = election then //當nofication回覆不爲空，且peer server的狀態也是election時 12         if n:round > P:round then 13               P:round <--- n:round 14               ReceivedVotes <---0 15               if n:vote > (P:lastZxid; P:id) then P:voteßn:vote 16               else P:vote <---(P:lastZxid; P:id) 17               Send notification (P:vote, P:id, P:state, P:round) to all peers 這個邏輯分支是notification回覆中resvote的epoch要大於vote 的epoch（說明回覆中的peer vote的zxid > vote的zxid），那麼vote失效了，須要更新vote，比較回覆中的兩個vote值的大小，選擇值大的vote，而後從新發送notification。 18         else if n:round = P:round and n:vote > P:vote then 19              P:vote <--- n:vote 20              Send notification (P:vote, P:id, P:state, P:round) to all peers       當回覆中的rspvote的epoch等於vote的epoch，但rspvote > vote，那麼更新vote信息       而後從新將vote向全部的peer server發送。 21          else if n:round < P:round then goto line 6      Resvote的epoch小於vote的epoch，那麼這個回覆是無效的，        直接忽略，繼續下一個循環。 22          Put(ReceivedVotes(n:id); n:vote; n:round)     將rspvote放入到ReceivedVotes中。 23         if  ReceivedVotes = SizeEnsemble then 24                DeduceLeader(P.vote.id);  return P.vote      若是已經收到了全部peer server的vote，若是vote中的leaderID == currentPeer自己，      那麼currPeer爲leader，結束並返回這次vote結果。 25         else if P.vote has a quorum in ReceivedVotes                        and there are no new notifications within T0 milliseconds then 26                DeduceLeader(P.vote.id);  return P.vote        若是收到超過半數peer server的vote，那麼vote中的leaderID == currentPeer自己，           那麼currPeer爲leader，結束並返回這次vote結果. 27          end      邏輯分支1總結：          若是rspvote中epoch > vote epoch，更新epoch和vote後從新發起vote          若是rspvote中epoch < vote epoch，無效rspvote          其餘，都保存在結果集合中，若是有rspvote>vote，那麼將vote更新到rspvote；等待全部rspvote都收到，那麼vote的值應該爲結果集合中最大值，若是結果集合超過半數，那麼這次vote生效，leader爲vote中的serverID。若是serverID爲自己的serverID，那麼currpeer的狀態爲leader不然爲follower 28    else // state of n is LEADING or FOLLOWING 當rspvote的狀態爲following或leading，說明vote以外已經存在了一個leader，那麼此段邏輯主要是分紅兩部分：一部分是vote的表決；另外一部分是vote以外的leader/follower表決. 29         if n:round = P:round then 30             Put(ReceivedVotes(n.id); n:vote; n:round) 31             if n:state = LEADING then 32                 DeduceLeader(n:vote:id); return n:vote 33             else if n:vote:id = P:id and n:vote has a quorum in ReceivedVotes then 34                 DeduceLeader(n:vote:id); return n:vote 35             else if n:vote has a quorum in ReceivedVotes and the voted peer n:vote:id is in                      state LEADING and n:vote:id 2 OutOfElection then 36                  DeduceLeader(n:vote:id); return n:vote 37             end 38         end 以上部分是vote的表決，以上的邏輯跟代碼中不符合，代碼中的邏輯是： 若是rspvote的epoch==vote的epoch，放入到receivedVots中，若是rspvote的狀態是leader 且集合中的rspvote超過半數，那麼vote的表決的leader就是rspvote的leader。 39         Put(OutOfElection(n:id); n:vote; n:round) 40         if n:vote:id = P:id and n:vote has a quorum in OutOfElection then 41             P:round <--- n:round 42             DeduceLeader(n:vote:id); return n:vote 43         else if n:vote has a quorum in OutOfElection and the voted peer n:vote:id is in state                      LEADING and n:vote:id 2 OutOfElection then 44             P:round <--- n:round 45             DeduceLeader(n:vote:id); return n:vote 46          end 以上部分是對vote以外的leader/follower進行表決，OutOfElection是用來存放狀態爲leader/follow的rspstate，若是OutOfElection的rspvote超過半數，那麼說明election以外的leader./follow是有效地， 47  end    邏輯分支2總結：這部分是考慮到可能有部分peer server維持leader/follower的狀態，部分peer server處於election狀態，若是維持leader/follower狀態的peer server數據過半，那麼leader/follower就是有效地。或者vote的epoch等於leader的epoch，那麼若是有半數以上的rspvote，那麼當前的leader/follower也是有效的。 <h3>4 Discovery and synchronization</h3> 在broadcast階段，zookeeper集羣必須有一個leader peer，zookeeper集羣是primary/backup模式，那麼leader就是primary。Discovery和synchronization這兩個階段的做用就是將所有的zookeeper節點帶入到一個最終一致的狀態，特別是當從crash中恢復時。這兩個階段組成了zab的recovery部分，對於容許多個獨立事務的狀況下，保證事務的順序起着關鍵做用。 無論在discovery、synchronization仍是broadcast，一旦發生錯誤，那麼均可以回到leader election過程。 用戶若是須要使用zookeeper服務，那麼必須鏈接一個zookeeper節點。用戶向鏈接的服務器提交寫操做，而後zab協議層會執行一個broadcast；假如用戶向follower提交寫操做，那麼follower會把寫操做發送給leader；若是leader收到寫操做，leader會執行，而後向全部follower擴散這個寫操做對應的數據更新。讀操做能夠由與用戶相鏈接的zookeeper節點直接完成。用戶能夠經過發送sync命令保證數據副本的更新。 在zab協議中，zxid(transaction identifiers)對於實現順序一致性十分關鍵。在zookeeper中事務能夠用(v, z)表示，v是新狀態（znode），z則是zxid，一個identifier。那麼一個zxid也是一個pair(e, c)，e是一個primary Pe（能夠理解爲leader）的epoch number，c是一個整數值，做爲計數器使用。Primary每產生一個新的事務，那麼計數器c就會+1。 當一個新的epoch開始時，一個新的leader會被激活，此時c會被設置爲0，e會在前一個epoch的值上+1。 在代碼實現中e是zxid的高32位，c是zxid的低32位。 如下四個變量構成了一個peer的持久化狀態： 一、History：已經被接受的事務提案(transaction proposal)。 二、acceptedEpoch：最近收到的NEWEPOCH信令中的epoch number。 三、currentEpoch：最近收到的NEWLEADER信令中的epoch number。 四、lastZxid：history中的最近的zxid。 <h3>5 Discovery過程</h3> 在這個階段，followers會跟他們的將來預期中的leader進行通訊，準leader會收集accepted follower(已經創建鏈接的)的latest transactions，這個階段的目的是發現quorum peer server中的highest histroy transaction，而後創建一個新的epoch，這樣就能夠防止previous leader不會commit 新的proposals（由於previous leader的epoch已通過期了）。 在discovery階段的開始，一個follower peer會創建於準leader的leader-follower connection。 Follower同時只是鏈接一個leader。假如一個peer P不是leading狀態，其餘peer會考慮p是一個準leader，任何其餘leader-follower鏈接都會被p拒絕；一樣leader-follower鏈接的拒絕或其餘的failure能將follower從新帶入到leader election狀態。 1 Follower F: 2 Send the message FOLLOWERINFO(F:acceptedEpoch) to L 3   upon receiving NEWEPOCH(e0) from L do 4      if e0 > F:acceptedEpoch then 5          F:acceptedEpoch <--- e0 // stored to non-volatile memory 6          Send ACKEPOCH(F:currentEpoch; F:history; F:lastZxid) to L 7           goto Phase 2 8      else if e0 < F:acceptedEpoch then 9           F:state <--- election and goto Phase 0 (leader election) 10     end 11 end 這個過程是follower端，follower向準leader發送FOLLOWERINFO信令，告訴leader本身的信息，最重要的就是把accepted epoch發送給leader。而後接收leader的NEWLEADER信令，NEWLEADER信令中帶有new epoch（這個epoch表示這這一輪過程，每一次創建leader-follower關係，都會有一個新的epoch來惟一標識，與previous leader-follower進行區分）。Follower檢查這個new epoch是否有效，若是有效，follower更新自身的epoch並回復一個ACKEPOCH，上報當前follower的狀態，進入下一個階段。若是無效，那麼follower會從新跳到leader electoin階段。 12 Leader L: 13 upon receiving FOLLOWERINFO(e) messages from a quorum Q of connected followers do 14      Make epoch number e0 such that e0 > e for all e received through FOLLOWERINFO(e) 15      Propose NEWEPOCH(e0) to all followers in Q 16 end 17 upon receiving ACKEPOCH from all followers in Q do 18      Find the follower f in Q such that for all f0 2 Q n ffg: 19          either f0:currentEpoch < f:currentEpoch 20          or (f0:currentEpoch = f:currentEpoch) ^ (f0:lastZxid _z f:lastZxid) 21      L:history <--- f:history  // stored to non-volatile memory 22      goto Phase 2 23 end 這個是leader端的recovery過程，leader會生產一個new epoch，首先接收全部follower的epoch，肯定new epoch要大於全部的follower epoch。而後向全部follower發送NEWEPOCH信令，將new epoch下發到全部的follower中。 等待follower的ACKEPOCH回覆，若是全部的follower的currEpoch和zxid都小於等於leader的currEpoch和zxid，那麼進入下一個過程。 <h3>6 Synchronization過程</h3> 這個過程是將follower的數據副本與準leader的歷史數據進行同步，使得zookeeper集羣的數據處於一致的狀態。同步的方向是準leader向follower同步。同步的過程以下：leader與follower進行通訊，發送NEWLEADER信令，帶有歷史事務的highest zxid；follower收到這些信令後，決定是否更新歷史事務，而後響應leader。當leader看到quorum follower的響應後，就會向它們發送commit信令。在這以後leader就創建完成了。 1 Leader L: 2 Send the message NEWLEADER(e0;L:history) to all followers in Q 3 upon receiving ACKNEWLEADER messages from some quorum of followers do 4      Send a COMMIT message to all followers 5      goto Phase 3 6 end 這是leader端的過程，發送NEWLEADER，而後接受響應，最後發送commit，至此leader創建完畢。 7 Follower F: 8 upon receiving NEWLEADER(e0;H) from L do 9      if F:acceptedEpoch = e0 then 10         atomically 11             F:currentEpoch <--- e0 // stored to non-volatile memory 12             for each (v; z) in H, in order of zxids, do 13                  Accept the proposal (e0; (v; z)) 14             end 15             F:history <---H // stored to non-volatile memory 16         end 17         Send an ACKNEWLEADER(e0;H) to L 18     else 19          F:state <--- election and goto Phase 0 20     end 21 end 22 upon receiving COMMIT from L do 23      for each outstanding transaction (v; z) in F:history, in order of zxids, do 24          Deliver (v; z) 25      end 26      goto Phase 3 27 end 這是follower端的流程，先是收到NEWLEADER信令，而後原子地更新epoch和歷史事務，發送ACKNEWLEADER信令響應leader；而後等待commit信令，收到commit信令後進行處理，進入下一個階段。 <h3>7 代碼實現的Recovery phase</h3> 在實現discovery和synchronization時，沒有嚴格分紅兩個階段進行實現，在實現時進行了一些優化，合併成一個階段實現，那麼這個階段就是recovery phase；recovery階段就是將全部的zookeeper集羣的數據副本進入到最終一致性地狀態中，且創建出一個具備最高highest zxid的leader。 在實現中，第0階段的fast leader election與第一階段discovery緊密結合在一塊兒，faster leader election在實現時作了一個優化，它會選擇出一個most up-to-date的history（我的理解就是選擇出一個具備最新的commit事務的peer server），那麼這樣的一個leader被選舉出來後，在第一階段就不須要去與followers通訊去發現latest history。 那麼既然在fast leader election中包括了discovery階段的責任，那麼這個discovery階段就能夠被忽略，因此在實現時就將discovery和synchornization階段合併成一個recovery階段。這個階段是在fast leader election以後，且認爲leader擁有lastest history。 僞碼： 1 Leader L: 2 L:lastZxid <--- (L:lastZxid:epoch + 1; 0) 3 upon receiving FOLLOWERINFO(f:lastZxid) message from a follower f do 4      Send NEWLEADER(L:lastZxid) to f 5      if f:lastZxid  <=  L:history:lastCommittedZxid then 6          if f:lastZxid  <=  L:history:oldThreshold then 7              Send a SNAP message with a snapshot of the whole database of L 8          else 9              Send a DIFF({committed transaction (v; z) in L:history : f:lastZxid < z}) 10        end 11     else 12         Send a TRUNC(L:history:lastCommittedZxid) message to f 13     end 14 end 15 upon receiving ACKNEWLEADER messages from some quorum of followers do 16     goto Phase 3 // Algorithm 3 17 end 以上是leader端的流程，先生存一個新的zxid和epoch，接收follower的FOLLOWERINFO信令（包含follower的lastzxid），而後向follower發送NEWLEADER(包含leader的zxid)。而後根據FOLLOWERINFO中帶有的lastzxid對follower進行更新。分紅三種狀況……. History.lastCommittedZxid是最新committed的歷史事務。History.oldThreshold是過久的歷史提案，比leader上一次snapshot的時間還久。見2.6.2關於TRUNC的說明。 第一種狀況是TRUNC，follower丟棄從leader.latestZxid到follower.lasterZxid之間的提案。 第二種狀況是DIFF，follower接收新的提案從follower.lasterZxid到leader.lasterZxid之間的新提案。 第三種狀況是SNAP，follower中的提案太舊，leader將snap更新到follower上。 18 Follower F: 19 Connect to its prospective leader L 20 Send the message FOLLOWERINFO(F:lastZxid) to L 21 upon L denies connection do 22     F:state <--- election and goto Phase 0 23 end 24 upon receiving NEWLEADER(newLeaderZxid) from L do 25     if newLeaderZxid:epoch < F:lastZxid:epoch then 26         F:state <--- election and goto Phase 0 27     end 28     upon receiving a SNAP, DIFF, or TRUNC message do 29         if got TRUNC(lastCommittedZxid) then 30             Abort all proposals from lastCommittedZxid to F:lastZxid 31         else if got DIFF(H) then 32             Accept all proposals in H, in order of zxids, then commit all 33         else if got SNAP then 34             Copy the snapshot received to the database, and commit the changes 35         end 36         Send ACKNEWLEADER 37         goto Phase 3 // Algorithm 3 38    end 39 end 以上是follower的流程，首先是向leader鏈接，而後發送FOLLOWERINFO信令，若是leader拒絕鏈接，那麼follower從新回到leader election階段。接收NEWLEADER信令，若是信令中帶有的epoch無效（小於follower的epoch），那麼follower從新回到leader election狀態。 而後接收SNAP/DIFF/TRUNC信令，同步數據副本和zxid，最後回覆ACKNEWLEADER信令。進入到下一個階段。 這個同步的目的是讓全部數據副本都進入一個最終一致性狀態。爲了達到這個目的，任何副本中的committed transactions必須以一樣一種順序，甚至已經被提交的transaction但沒有被任何一個peer節點committ的事務必須被拋棄。SNAP和DIFF用於保證各個副本中的committed事務的順序一致性；而TRUNC用於處理已經被提交但沒有被committed的事務。 <h3>8 Broadcast phase</h3> Zookeeper peer之間的雙向通道使用TCP鏈接實現，TCP通訊的FIFO序列化特性對於實現broadcast協議相當重要。 假如沒有發生崩潰，那麼peers會一直停留在broadcast階段。第三階段中只能有一個leader。 Broadcast的過程是leader與follower之間的一個兩階段的提交過程（two-phase commit） 一、 leader與follower的通信通道(communication channel)是一個FIFO，全部都是是按順序處理。 二、 leader收到一個request後，會生成一個propose。而後執行兩階段提交. <a href="http://static.oschina.net/uploads/img/201312/22185922_040N.png"><img title="wps_clip_image-14769" style="border-top-width: 0px; display: inline; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px" height="126" alt="wps_clip_image-14769" src="http://static.oschina.net/uploads/img/201312/22185922_KHTe.png" width="244" border="0" /></a> Broadcast的僞碼和流程 1 Leader L: 2 upon receiving a write request v do 3     Propose (e0; (v; z)) to all followers in Q, where z = (e0; c), such that z succeeds all zxid        values previously broadcast in e0 (c is the previous zxid's counter plus an increment of one) 4 end 5 upon receiving ACK((e0; (v; z))) from a quorum of followers do 6     Send COMMIT(e0; (v; z)) to all followers 7 end 以上是leader處理的兩階段提交的流程：首先leader受到寫請求v，而後生成一個提案(e,(v,z))，向全部follower發送此提案的內容，而後等待follower的ack；若是ack超過半數，那麼提案成立。向全部follower下發commit提案的命令。 8 // Reaction to an incoming new follower: 9 upon receiving FOLLOWERINFO(e) from some follower f do 10     Send NEWEPOCH(e0) to f 11     Send NEWLEADER(e0;L:history) to f 12 end 13 upon receiving ACKNEWLEADER from follower f do 14     Send a COMMIT message to f 15     Q <--- Q 並集 {f} 16 end 以上是一個新follower加入leader的流程：首先leader收到FOLLOWERINFO信令，而後向new follower發送NEWEPOCH信令，再發送NEWLEADER信令給new follower；等待new follower的ACKNEWLEADER，最後發送commit，至此new follower就加入到了集羣中。 17 Follower F: 18 if F is leading then Invokes ready(e0) 19 upon receiving proposal (e0; (v; z)) from L do 20     Append proposal (e0; (v; z)) to F:history 21     Send ACK((e0; (v; z))) to L 22 end 23 upon receiving COMMIT(e0; (v; z)) from L do 24     while there is some outstanding transaction (v0; z0) in F:history such that z0 < z do 25         Do nothing (wait) 26     end 27     Commit (deliver) transaction (v; z) 28 end 這是follower的broadcast流程：接收到leader的提案，而後將提案寫入到history中，而後發送響應。等待leader的commit信令，收到後執行commit 提案。 <h3>9 Zab所存在的問題</h3> <h4>9.1 acceptedEpoch和currentEpoch的做用</h4> 在recovery開始階段，準leader甚至在與大部分follower成功創建鏈接以前就增長其epoch（包括在lastZxid內）值。由於在recovery階段，follower在發現其epoch值要比準leader大時，會返回到leader election階段。那麼當準leader失去leader地位，併成爲previous leader（其epoch比準leader要小1）的一個follower，那麼準leader會發現previous leader的epoch值比其要小，那麼它會返回到leader election階段。這個現象會致使此peer一直在recovery階段和leader election階段之間循環。 因此使用lastZxid來存儲epoch number，沒有對一個tried epoch（我的理解是一個準leader在嘗試成爲leader時使用的epoch）和一個joined epoch（一個成功的leader所使用的epoch）進行區分。使用acceptedEpoch和currentEpoch的目的就是在於防止此類問題的發生。 <h4>9.2 Abandon follower proposal</h4> 假設一個集合{p1, p2, p3}，全部的peers都處於broadcast階段，且都已經同步到了最新的committed事務，事務的ID是（e= 1, c= 3），p1爲leader；一個新的提案，事務ID爲（1, 4）已經被leader p1發出，但在p2和p3收到事務以前，p1就已經發生了崩潰（好比已經放到socket緩存區中），那麼{p2, p3}會從新回到leader election，並選舉出一個新的leader。當p1恢復正常了，此時p2已經成爲了leader；那麼根據fast leader election，在recovery階段p2會將epoch設置爲2（p2.latestZxid = （2, 0）），那麼在broadcast階段，已經新的提案已經被quorum接收和commit，它的zxid爲（2， 1）。在這個時候leader p2的history.lastCommittedZxid = (2, 1)，而且p2的history.OlderThreshold = （1, 1）；那麼p1從新啓動後，p1會執行fast leader election，而後發現其餘peer已經創建leader-follower關係，且p2是leader，那麼p1會向發送FOLLOWERINFO（p1.latestZxid = （1, 4））。 在這種狀況下， p1.lastestZxid(1,4) < p2.history.lastCommittedZxid(2, 1) && p2.history.oldThreshold（1, 1）< p1.lastestZxid (1, 4)，那麼這種狀況下leader p2須要向p1發送TRUNC信令，讓follower放棄uncommitted proposal(1, 4)。 做者zy，QQ105789990node

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。