分佈式系統原理-CAP/2PC/3PC

時間 2019-12-14

原文原文鏈接

一、CAP理論php

CAP是分佈式系統、特別是分佈式存儲領域中被討論最多的理論，「什麼是CAP定理？」在Quora 分佈式系統分類下排名 FAQ 的 No.1。CAP在程序員中也有較廣的普及，它不只僅是「C、A、P不能同時知足，最多隻能3選2」，如下嘗試綜合各方觀點，從發展歷史、工程實踐等角度講述CAP理論。node

CAP定理程序員

CAP由Eric Brewer在2000年PODC會議上提出[1][2]，是Eric Brewer在Inktomi[3]期間研發搜索引擎、分佈式web緩存時得出的關於數據一致性(consistency)、服務可用性(availability)、分區容錯性(partition-tolerance)的猜測：web

It is impossible for a web service to provide the three following guarantees : Consistency, Availability and Partition-tolerance.算法

該猜測在提出兩年後被證實成立[4]，成爲咱們熟知的CAP定理：緩存

數據一致性(consistency)：若是系統對一個寫操做返回成功，那麼以後的讀請求都必須讀到這個新數據；若是返回失敗，那麼全部讀操做都不能讀到這個數據，對調用者而言數據具備強一致性(strong consistency) (又叫原子性 atomic、線性一致性 linearizable consistency)[5]
服務可用性(availability)：全部讀寫請求在必定時間內獲得響應，可終止、不會一直等待
分區容錯性(partition-tolerance)：在網絡分區的狀況下，被分隔的節點仍能正常對外服務

Partition字面意思是網絡分區，即因網絡因素將系統分隔爲多個單獨的部分，有人可能會說，網絡分區的狀況發生機率很是小啊，是否是不用考慮P，保證CA就好[8]。要理解P，咱們看回CAP證實[4]中P的定義：網絡

In order to model partition tolerance, the network will be allowed to lose arbitrarily many messages sent from one node to another.app

現實狀況下咱們面對的是一個不可靠的網絡、有必定機率宕機的設備，這兩個因素都會致使Partition，於是分佈式系統實現中 P 是一個必須項，而不是可選項。異步

對於分佈式系統工程實踐，CAP理論更合適的描述是：在知足分區容錯的前提下，沒有算法能同時知足數據一致性和服務可用性：async

In a network subject to communication failures, it is impossible for any web service to implement an atomic read/write shared memory that guarantees a response to every request.

CAP定理證實中的一致性指強一致性，強一致性要求多節點組成的被調要能像單節點同樣運做、操做具有原子性，數據在時間、時序上都有要求。若是放寬這些要求，還有其餘一致性類型：

序列一致性(sequential consistency)[13]：不要求時序一致，A操做先於B操做，在B操做後若是全部調用端讀操做獲得A操做的結果，知足序列一致性
最終一致性(eventual consistency)[14]：放寬對時間的要求，在被調完成操做響應後的某個時間點，被調多個節點的數據最終達成一致

可用性在CAP定理裏指全部讀寫操做必需要能終止，實際應用中從主調、被調兩個不一樣的視角，可用性具備不一樣的含義。當P(網絡分區)出現時，主調能夠只支持讀操做，經過犧牲部分可用性達成數據一致。

工程實踐中，較常見的作法是經過異步拷貝副本(asynchronous replication)、quorum/NRW，實如今調用端看來數據強一致、被調端最終一致，在調用端看來服務可用、被調端容許部分節點不可用(或被網絡分隔)的效果。

一個分佈式系統裏面，節點組成的網絡原本應該是連通的。然而可能由於一些故障，使得有些節點之間不連通了，整個網絡就分紅了幾塊區域。數據就散佈在了這些不連通的區域中。這就叫分區。
當你一個數據項只在一個節點中保存，那麼分區出現後，和這個節點不連通的部分就訪問不到這個數據了。這時分區就是沒法容忍的。
提升分區容忍性的辦法就是一個數據項複製到多個節點上，那麼出現分區以後，這一數據項就可能分佈到各個區裏。容忍性就提升了。
要把數據複製到多個節點，就會帶來一致性的問題，就是多個節點上面的數據多是不一致的。要保證一致，每次寫操做就都要等待所有節點寫成功，而這等待又會帶來可用性的問題。

總的來講就是，數據存在的節點越多，分區容忍性越高，但要複製更新的數據就越多，一致性就越難保證。爲了保證一致性，更新全部節點數據所須要的時間就越長，可用性就會下降。

2、2PC

wiki：https://en.wikipedia.org/wiki/Two-phase_commit_protocol

two-phase commit protocol (2PC) is a type of atomic commitment protocol (ACP). It is a distributed algorithm that coordinates all the processes that participate in a distributed atomic transaction on whether to commit or abort (roll back) the transaction (it is a specialized type of consensus protocol).

在分佈式系統中，每個機器節點雖然都能明確的知道本身執行的事務是成功仍是失敗，可是卻沒法知道其餘分佈式節點的事務執行狀況。所以，當一個事務要跨越多個分佈式節點的時候，爲了保證該事務能夠知足ACID，就要引入一個協調者（Cooradinator）。其餘的節點被稱爲參與者（Participant）。協調者負責調度參與者的行爲，並最終決定這些參與者是否要把事務進行提交。

圖1: 2PC, coordinator提議經過, voter{1,2,3}達成新的共識

PC1: Commit request phase[edit]

or voting phase

The coordinator sends a query to commit message to all cohorts and waits until it has received a reply from all cohorts.
The cohorts execute the transaction up to the point where they will be asked to commit. They each write an entry to their undo log and an entry to their redo log.
Each cohort replies with an agreement message (cohort votes Yes to commit), if the cohort's actions succeeded, or an abort message (cohort votes No, not to commit), if the cohort experiences a failure that will make it impossible to commit.

PC2:

Commit phase[edit]

or Completion phase

Success[edit]

If the coordinator received an agreement message from all cohorts during the commit-request phase:

The coordinator sends a commit message to all the cohorts.
Each cohort completes the operation, and releases all the locks and resources held during the transaction.
Each cohort sends an acknowledgment to the coordinator.
The coordinator completes the transaction when all acknowledgments have been received.

Failure[edit]

If any cohort votes No during the commit-request phase (or the coordinator's timeout expires):

The coordinator sends a rollback message to all the cohorts.
Each cohort undoes the transaction using the undo log, and releases the resources and locks held during the transaction.
Each cohort sends an acknowledgement to the coordinator.
The coordinator undoes the transaction when all acknowledgements have been received.

Message flow[edit]

Coordinator                                         Cohort
                              QUERY TO COMMIT
                -------------------------------->
                              VOTE YES/NO           prepare*/abort*
                <-------------------------------
commit*/abort*                COMMIT/ROLLBACK
                -------------------------------->
                              ACKNOWLEDGMENT        commit*/abort*
                <--------------------------------  
end

An * next to the record type means that the record is forced to stable storage.^[4]

2PC缺點：

一、同步阻塞

After a cohort has sent an agreement message to the coordinator, the Cohort will block until a commit or rollback is received.

二、單點問題

Coordinator存在單點，若是在Commit階段Coordinate宕機，將致使Cohort block.

三、數據不一致

Coordinator在發送完部分Commit請求後出現宕機，收到commit請求的cohort執行，其餘的則未執行，數據不一致。

2PC的缺陷

2PC的缺點在於不能處理fail-stop形式的節點failure. 好比下圖這種狀況. 假設coordinator和voter3都在Commit這個階段crash了, 而voter1和voter2沒有收到commit消息. 這時候voter1和voter2就陷入了一個困境. 由於他們並不能判斷如今是兩個場景中的哪種:

(1)上輪全票經過而後voter3第一個收到了commit的消息並在commit操做以後crash了,

(2)上輪voter3反對因此乾脆沒有經過.

圖3: 2PC, coordinator和voter3 crash, voter{1,2}沒法判斷當前狀態而卡死

2PC在這種fail-stop狀況下會失敗是由於voter在得知Propose Phase結果後就直接commit了, 而並無在commit以前告知其餘voter本身已收到Propose Phase的結果. 從而致使在coordinator和一個voter雙雙掉線的狀況下, 其他voter不但沒法復原Propose Phase的結果, 也沒法知道掉線的voter是否打算甚至已經commit. 爲了解決這一問題, 3PC

三、3PC

除了引入超時機制以外，3PC把2PC的準備階段再次一分爲二，這樣三階段提交就有CanCommit、PreCommit、DoCommit三個階段。

2PC中Commit_Request對應3PC中CanCommit+PreCommit

Commit 對應3PC中的DoCommit

一、Coordinator: sends a canCommit? message to the cohorts and moves to the waiting state.

二、Cohorts: receives a canCommit? message from the coordinator. If the cohort agrees it sends a Yes message to the coordinator and moves to the prepared state. Otherwise it sends a No message and move to abort state.

三、Coordinator: If there is a failure, timeout, or if the coordinator receives a No message in the waiting state, the coordinator aborts the transaction and sends an abort message to all cohorts. Otherwise the coordinator will receive Yes messages from all cohorts within the time window, so it sends preCommit messages to all cohorts and moves to the prepared state.

四、Cohorts: In the prepared state, if the cohort receives an abort message from the coordinator, fails, or times out waiting for a commit, it aborts. If the cohort receives a preCommit message, it sends an ACK message back and awaits a final commit or abort.

五、Coordinator: If the coordinator succeeds in the prepared state, it will move to the commit state. However if the coordinator times out while waiting for an acknowledgement from a cohort, it will abort the transaction.

六、Cohorts: If, after a cohort member receives a preCommit message, the coordinator fails or times out, the cohort member goes forward with the commit.

經過進入增長的這一個PreCommit階段, voter能夠獲得Propose階段的投票結果, 但不會commit; 而經過進入Commit階段, voter能夠盤出其餘每一個voter也都打算commit了, 從而能夠放心的commit.

換言之, 3PC在2PC的Commit階段裏增長了一個barrier(即至關於告訴其餘全部voter, 我收到了Propose的結果啦). 在這個barrier以前coordinator掉線的話, 其餘voter能夠得出結論不是每一個voter都收到Propose Phase的結果, 從而放棄或選出新的coordinator; 在這個barrier以後coordinator掉線的話, 每一個voter會放心的commit, 由於他們知道其餘voter也都作一樣的計劃.

圖4: 3PC, coordinator提議經過, voter{1,2,3}達成新的共識

3PC的缺陷

3PC能夠有效的處理fail-stop的模式, 但不能處理網絡劃分(network partition)的狀況---節點互相不能通訊. 假設在PreCommit階段全部節點被一分爲二, 收到preCommit消息的voter在一邊, 而沒有收到這個消息的在另一邊. 在這種狀況下, 兩邊就可能會選出新的coordinator而作出不一樣的決定.

圖5: 3PC, network partition, voter{1,2,3}失去共識

優缺點

優勢：下降參與者阻塞範圍，並可以在出現單點故障後繼續達成一致
缺點：引入preCommit階段，在這個階段若是出現網絡分區，協調者沒法與參與者正常通訊，參與者依然會進行事務提交，形成數據不一致。

不管是二階段提交仍是三階段提交都沒法完全解決分佈式的一致性問題。Google Chubby的做者Mike Burrows說過， there is only one consensus protocol, and that’s Paxos」 – all other approaches are just broken versions of Paxos. 意即世上只有一種一致性算法，那就是Paxos，全部其餘一致性算法都是Paxos算法的不完整版。

除了網絡劃分之外, 3PC也不能處理fail-recover的錯誤狀況. 簡單說來當coordinator收到preCommit的確認前crash, 因而其餘某一個voter接替了原coordinator的任務而開始組織全部voter commit. 而與此同時原coordinator重啓後又回到了網絡中, 開始繼續以前的回合---發送abort給各位voter由於它並無收到preCommit. 此時有可能會出現原coordinator和繼任的coordinator給不一樣節點發送相矛盾的commit和abort指令, 從而出現個節點的狀態分歧.

這種狀況等價於一個更真實或者更負責的網絡環境假設: 異步網絡. 在這種假設下, 網絡傳輸時間可能任意長. 爲了解決這種狀況, 那就得請出下一篇的主角: Paxos

ref:

https://zhuanlan.zhihu.com/p/35298019