【原創】大數據基礎之Zookeeper(3)選舉算法

提到zookeeper選舉算法,就不得不提Paxos算法,由於zookeeper選舉算法是Paxos算法的一個變種;算法

 

Paxos要解決的問題是:在一個分佈式網絡環境中有衆多的參與者,可是每一個參與者都不可靠,可能隨時掉線等,這時這些參與者如何針對某個見解達成一致;apache

相似的問題現實生活中有不少,好比一個團隊要組織團建,團隊中有10我的,每一個人都有本身想去的地方,如何就團建的目的地達成一致?服務器

最簡單的方式是把團隊全體叫到會議室開會,很快就能夠根據少數服從多數的原則,肯定一個大多數人都滿意的目的地;網絡

若是將問題改成:團隊10我的分別在世界的10個地方出差,做息時間各不相同,而且只能經過郵件聯繫,這時如何肯定團建的目的地?併發

1 Paxos算法

https://en.wikipedia.org/wiki/Paxos_(computer_science)less

https://zh.wikipedia.org/zh-cn/Paxos%E7%AE%97%E6%B3%95 分佈式

Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.ide

1.1 Roles

Paxos describes the actions of the processors by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.學習

  • Client 
    • The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server.
  • Acceptor (Voters) 
    • The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
  • Proposer 
    • A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
  • Learner 
    • Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
  • Leader 
    • Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved in that case.

client有不少個,而且每一個client都有不少idea,可是隻有一個client的一個idea最終會被你們接受;client想讓一個idea被接收,首先會把idea告訴proposor,proposor收到一個idea以後會提交給多個acceptor進行表決,若是超過半數的acceptor表決經過,則表示idea被你們接受;learner會及時收到acceptor的表決結果;flex

因爲實際表決過程是併發的,因此表決過程分爲多個階段,而且增長版本version的概念,這裏有點相似於樂觀鎖;

一個形象的例子是在某個腐敗的國家裏,政府有一個項目要招標,而後有不少公司(client)都想拿到該項目(idea),決定該項目給誰的是有一個政府內部高層人士(acceptor)小組討論決定,可是他們深藏不漏,公司須要經過一些政商通吃的中介(proposor),給高層人士輸送賄賂(version),每一個高層人士收到一個賄賂以後會表示再也不接受不高於這個賄賂的其餘賄賂而且支持當前這個賄賂的公司,若是一個公司可以成功賄賂小組中多數高層人士,那麼這個公司能夠拿到這個項目;

1.2 Basic Paxos

  • Phase 1a: Prepare

  • Phase 1b: Promise

  • Phase 2a: Accept Request

  • Phase 2b: Accepted

首先將議員的角色分爲 proposers,acceptors,和 learners(容許身兼數職)。proposers 提出提案,提案信息包括提案編號和提議的 value;acceptor 收到提案後能夠接受(accept)提案,若提案得到多數 acceptors 的接受,則稱該提案被批准(chosen);learners 只能「學習」被批准的提案。劃分角色後,就能夠更精確的定義問題:

  1. 決議(value)只有在被 proposers 提出後才能被批准(未經批准的決議稱爲「提案(proposal)」);
  2. 在一次 Paxos 算法的執行實例中,只批准(chosen)一個 value;
  3. learners 只能得到被批准(chosen)的 value。

做者經過不斷增強上述3個約束(主要是第二個)得到了 Paxos 算法。

P1:一個 acceptor 必須接受(accept)第一次收到的提案。
P1a:當且僅當acceptor沒有迴應過編號大於n的prepare請求時,acceptor接受(accept)編號爲n的提案。
P2:一旦一個具備 value v 的提案被批准(chosen),那麼以後批准(chosen)的提案必須具備 value v。
P2a:一旦一個具備 value v 的提案被批准(chosen),那麼以後任何 acceptor 再次接受(accept)的提案必須具備 value v。
P2b:一旦一個具備 value v 的提案被批准(chosen),那麼之後任何 proposer 提出的提案必須具備 value v。
P2c:若是一個編號爲 n 的提案具備 value v,那麼存在一個多數派,要麼他們中全部人都沒有接受(accept)編號小於 n 
的任何提案,要麼他們已經接受(accept)的全部編號小於 n 的提案中編號最大的那個提案具備 value v。

2 Zookeeper Leader Election

每一個zookeeper服務器都至關於client+proposor+acceptor+learner

Vote(提案編號、提議的 value):myid、zxid、epoch

初始值:

currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());

其餘人的Vote:

HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
 

zookeeper的選舉算法作了不少簡化,將Paxos的value和version合併成Vote,是爲了保證可以在最短的時間內選舉出新的leader,同時避免數據丟失和數據同步,來看如下幾種狀況:

  • 集羣第一次啓動,全部服務器的zxid和epoch同樣,在集羣的半數服務器啓動後誰的myid最大,誰將會成爲leader;好比5臺服務器,id分別爲1/2/3/4/5,當你順序啓動1/2/3/4/5的時候,3啓動後3將會成爲leader,4/5啓動後會成爲follower;當你順序啓動5/4/3/2/1的時候,3啓動後5將會成爲leader;
  • 集羣重啓,在集羣的半數服務器啓動後,誰的epoch最大(每次選舉成功後leader會將epoch+1),誰將會成爲leader,若是你們的epoch相同,誰的zxid最大(即誰擁有最新的數據),誰將會成爲leader;

 

選舉核心類及調用流程:

org.apache.zookeeper.server.quorum.Election

org.apache.zookeeper.server.quorum.FastLeaderElection implements Election

         lookForLeader

                  sendNotifications

                  totalOrderPredicate

                  termPredicate

                          org.apache.zookeeper.server.quorum.flexible.QuorumMaj

                                   containsQuorum

                  QuorumPeer.setPeerState

 

是否超過半數判斷

    public boolean containsQuorum(HashSet<Long> set){

        return (set.size() > half);

    }

Vote大小判斷 

    protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {

        LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +

                Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));

        if(self.getQuorumVerifier().getWeight(newId) == 0){

            return false;

        }

       

        /*

         * We return true if one of the following three cases hold:

         * 1- New epoch is higher

         * 2- New epoch is the same as current epoch, but new zxid is higher

         * 3- New epoch is the same as current epoch, new zxid is the same

         *  as current zxid, but server id is higher.

         */

       

        return ((newEpoch > curEpoch) ||

                ((newEpoch == curEpoch) &&

                ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));

    }
相關文章
相關標籤/搜索