VMware-Fault-Tolerant Virtual Machine--論文翻譯

時間 2020-10-22

標籤 ios 緩存服務器網絡架構 app less dom async ide 欄目虛擬機简体版

原文原文鏈接

The Design of a Practical System for Fault-Tolerant Virtual Machines

基於容錯虛擬機的實用系統設計ios

因爲沒有找到翻譯，因此本身翻了一下好方便總結回憶，因此徹底徹底不保證翻譯質量緩存

只看了一二章、瞭解了一下主要思路，對於技術細節沒有作過多的關注（第三章）服務器

ABSTRACT

We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines , based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server.we have designed a complete system in VmwareSphere 4.0 that is easy to use, runs on commodity servers,and typically reduces performance of real applications by less than 10%.In addition , the data bandwidth needed to keep the primary and secondary VM executing in lockstep is less than 20Mbit/s for several real applications,which allows for the possibility of implementing fault tolerance over longer distances . An easy-to-use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encounted in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide performance results for both micro-benchmarks and real applications網絡

咱們已經實現了一個基於容錯虛擬機的商業化企業級系統，該系統經過運行在不一樣服務器上的備份虛擬機，將主虛擬機的運行狀況進行了「複製」。咱們已經在虛擬環境上設計了一個完整易上手的系統，它運行在商用服務器上，僅僅減小了真實應用不到10%的性能。此外，對於絕大多數真實應用，爲了保證主從虛擬機運行協調，數據帶寬不能超過20Mbit/s，這使得遠距離實現容錯成爲了可能。一個容易使用的、可以在發生錯誤後自動地恢復冗餘的商用系統須要許多額外運行在VM上的組件。咱們已經設計並實現了這些額外的組件，並處理在使得VM可以支持運行企業級應用時所遇到的許多問題。在這篇論文中，咱們描述了咱們的基本設計，討論了替代設計選項，以及一些實現細節，還提供了若干方面和真實應用的性能數據。架構

1. INTRODUCTION

A common approach to implementing fault-tolerant servers is the primary/backup approach, where a backup server is always available to take over if the primary server fails. The state of the backup server must be kept nearly identical to the primary server at all times, so that the backup server can take over immediately when the primary fails, and in such a way that the failure is hidden to external clients and no data is lost. One way of replicating the state on the backup server is to ship changes to all state of the primary, including CPU, memory,and I/O devices, to the backup nearly continuously. However, the bandwidth needed to send this state, particular changes in memory, can be very large.app

一個實現容錯服務器的常見方法是主從複製，從屬服務器總可以在主服務器出現錯誤的時候，及時地接管工做。從屬服務器的狀態必須一直與主服務器保持幾乎一致，這樣當主服務器退出的時候，從屬服務器才能當即接管工做，讓主服務器發生的錯誤對外部的用戶不可見，而且不丟失數據。在從屬服務器上覆制主服務器狀態的一個方法是接二連三地傳輸全部主服務器狀態的變化給本身，包括CPU,存儲，I/O設備。然而，這樣要求傳輸的帶寬會很高，尤爲是在傳輸存儲相關的狀態變化的時候。less

A different method for replicating servers that can use much less bandwidth is sometimes referred to as the state-machine approach. The idea is to model the servers as deterministic state machines that are kept in sync by starting them from the same intial state and ensuring that they receive the same input requests in the same order. Since most servers or services have some opreations that are not deterministic, extra coordination must be used to ensure that a primary and backup are kept in sync. However, the amount of extra information need to keep the primary and backup in sync is far less than the amout of state(mainly memory updates) that is changing in the primary.dom

另一個可以使用較少的帶寬複製服務器的方法是狀態機方法。狀態機方法將服務器建模成肯定性狀態機，在啓動是時候保證初始狀態相同，以及以後它們收到相同順序的相同輸入請求，那麼它們就可以保持同步。然而，絕大多數服務器或者設備都有一些不肯定性操做，因此必需要使用額外的協做來保證主從同步。儘管如此，額外須要保持的信息也比狀態機自己的狀態更新信息要小得多。async

Implementing coordination to ensure deterministic execution of physical servers is diffcult, particularly as processor frequencies increase. In contrast, a virtual machine running on top of a hypervisor is an excellent platform for implementing the state-machine approach. A VM can be considered a well-defined state machine whose opreations are the opreations of the machine being vritualized(including all its devices). As with physical servers, VMs have some non-deterministic operations( e.g. reading a time-of-day clock or delivery of an interrupt),and so extra information must be sent to the backup to ensure that it is kept in sync. Since the hypervisor has full control over the execution of a VM, including delivery of all inputs, the hyervisor is able to capture all the necessary information about non-deterministic on the primary VM and to replay these operations correctly on the backup VM.ide

實現可以確保物理服務器的肯定性運行的協做是十分困難的，尤爲是在處理器頻率增長的狀況下。與之對照的是，虛擬機運行在虛擬監視器上，是一個很是好的平臺來實現狀態機方法。一個VM實際上是一個well_defined 的狀態機，它的操做都是實際被虛擬化的物理機器的操做。和物理服務器同樣，虛擬機有一些不肯定性操做，因此額外的寫做信息必需要發送給從屬服務器以確保雙方同步。由於虛擬監視器徹底掌控着虛擬機的運行，包括輸入的傳遞。因此一個虛擬監視器可以捕捉主VM上全部關於非肯定性操做的必要信息，而且可以在從屬VM上從新正確實現這些操做。

Hence,the state-machine approach can be implemented for virtual machines on commodity hardware, with no hard-ware modifications, allowing fault tolerance to be implemented immediately for the newest microprocessors. In addition, the low bandwidth required for the state-machine approach alllows for the possibility of greater physical separation of the primary and the backup. For example, replicated virtual machines can be run on physical machines distributed across a campus, which provides more reliability than VMs running in the same building.

所以，狀態機方法能夠經過在商用硬件上運行虛擬機來實現，而無需硬件級別的特殊修正，而且容許在最新的微處理器上也能實現(容錯)。此外，低帶寬使得狀態機方法讓主從服務器物理隔離很遠成爲了可能。好比說，被複制的虛擬機們能夠運行在分佈在整個校園的物理機上，而不是僅僅分佈在一座大樓裏。

We have implemented fault-tolerant VMs using the primary/backup approach on the VMware vSphere 4.0 platform, which runs fully virtualized x86 virtual machines in a highly-efficient manner. Since VMware vSphere implements a complete x86 virtual machine, we are automatically able to provide fault tolerance for any x86 opreating systems and applications. The base technology that allows us to record the exeuction of a primary and ensure that the backup executes identically is known as deterministic replay. VMware vSphere Fault Tolerance is based on deterministic replay, but adds in the necessary extra protocols and functionality to build a complete fault-tolerant system. In addition to providing hardware fault tolerance, our system automatically restores redundancy after a failure by starting a new backup virtual machine on any available server in the local cluster. At this time, the production versions of both deterministic replay and Vmware FT support only uni-processor VMs. Recording and replaying the exeuction of multi-processor VM is still work in progress, with significant performance issues because nearly every access to shared memory can be non-deterministic opreation.

咱們已經在Vmware vSphere 4.0上現實了基於狀態機的容錯虛擬機，該容錯虛擬機是x86虛擬機。由於Vmware vSphere 實現了一個完整的 x86 虛擬機，咱們自動地可以提供容錯服務給全部的x86架構的應用和系統。容許咱們記錄主VM運行並確保從屬VM可以以等價運行的技術是肯定性replay。Vmware vSphere Fault Tolerance 基於肯定性replay,可是添加了額外必須的協議以及功能來支持創建一個完整的容錯系統。除了提供硬件容錯，咱們的系統在局部簇任可以使用的服務器上開啓一個新的從屬虛擬機遭遇失敗時，可以自動地恢復冗餘。到目前爲止，Vmware FT 只支持單處理器VMs。在記錄而且replay 多處理器VM的運行，咱們遇到了性能困難。由於幾乎每一次訪問共享內存都會成爲不肯定性操做。

Bressoud and Schneider describe a prototype implementation of fault-tolerant VMs for the HP PA-RISC platform. Our approach is similar, but we have made some fundamental changes for performance reasons and investiagetd a number of design alternatives. In addition, we have had to design and implement many additional components in the system and deal with a number of practical issueds to build a complete system that is efficicent and usable by customers running enterprise applications. Similar to most ohter practical systems discussed, we only attempt to deal with fail-stop failures, which are server failures that can be detected before the failing server causes an incorrect externally visible action.

Bressoud 和 Schneiderr 描述了一個協議版本的HP PA-RISC平臺上容錯虛擬機的實現。咱們的方法是類似的，可是咱們由於性能緣由進行了一些底層改變，並嘗試了大量的替代設計。此外，咱們已經在系統中設計並實現許多額外組件而且處理了大量的實際問題來創建完整、高效、可以運行企業級應用的系統。和許多物理系統相似的是，咱們僅僅嘗試處理fail-stop類型的服務器錯誤，這類錯誤可以在服務器作出外部可見的錯誤以前被系統察覺到。

The rest of the paper is organized as follows. First, we describe our basic design and detail our fundamental protocols that ensure that no data is lost if a back up VM takes over after a primary VM fails. Then, we describe in detail many of the practical issues that must be addressed to build a robust, complete and automated system. We also describe several design choices that arise for implementing fault-tolerant VMs and discuss the tradeoffs in these choices. Next, we give performance results for our implementation for some benchmarks and some real enterprise applications. Finally, we describe related work and conclude.

本文剩餘部分主要內容以下。首先，咱們描述了 VM-FT的基本設計和基本協議的細節，這些設計和協議確保了若是一個從屬VM從主服務器發生錯誤後進行接管，將不會致使數據損失。然後，咱們描述了許多必需要處理的實際問題。此外，咱們也闡述了多種實現VM-FT的設計選擇，並對這些選擇進行了相關討論。而後，咱們給出了咱們實現的系統在一些實際企業應用上的性能數據。最後，咱們闡述了相關工做和結論。

2 BASIC FT DESIGN

Figure 1 shows the basic steup of our system for fault-tolerant VMs. For a given VM for which we desire to provide fault tolerance(the primary VM), we run a backup VM on a different physical server that is kept in sync and executes indentically to the primary virtual machine, though with a small time lag. We say that the two VMs are in virtual lock-step. The virtual disks for the VMs are on shared storage (such as a Fibre Channel or iSCSI disk array), and therefore accessible to the primary and backup VM for input and output. (We will discuss a design in which the primary and backup VM have separte non-shared virtual disks in Section 4.1.) Only the primary VM advertises its presence on the network, so all network inputs come to the primary VM. Similarly, all other inputs(such as keyboard and mouse) go only to the primary VM.

圖一展現了VM-FT系統的基本設置。對於某個但願可以支持容錯的VM（主VM),咱們在另一個物理機上運行從屬VM，該從屬VM與主VM保持同步，而且相比主VM可以等同地運行着，儘管會有些時延。咱們把這叫作虛擬同步。VMs的虛擬硬盤都是創建在共享存儲上的，所以 accessible to the primary and backup VM for input and output. 只有主VM在網絡中宣告本身的存在，因此全部的網絡輸入都會去往主VM.類似的，全部的其餘輸入也僅僅是去往主VM。

All input that the primary VM receives is sent to the backup VM via a network connection known as the logging channel. For server workloads, the dominant input traffic is network and disk. Additional information,as discussed below in Section 2.1, is transmitted as necessary to ensure that the backup VM executes non-deterministic operations in the same way as the primary VM. The result is that the backup VM always executes identically to the primary VM.However, the outputs of the backup VM are dropped by the hypervisor, so only the primary produces actual outputs that are returned to clients. As described in Section 2.2, the primary and backup VM follow a specific protocol, including explicit acknowledgments by the backup VM, in order to ensure that no data is lost if the primary fails

全部主VM收到的輸入都經過網絡（logging channel) 發送給從屬VM。至於服務器負載，主要的輸入瓶頸是網絡和硬盤。一些額外的控制信息也會被傳輸給從屬VM來保證和主VM和運行着一致的不肯定性操做。然而，從屬VM輸出會被VM監視器給丟棄，因此僅僅是主VM產生實際的輸出並將其返回給用戶。正如在2.2所說，主、從屬VM遵頊特殊的協議，包括顯式的從屬VM的ack，以確保沒有信息丟失。

To detect if a primary or backup VM has failed, our system uses a combination of heartbeating between the relevant servers and monitoring of the traffic on the logging channel. In addition, we must ensure that only one of the primary or backup VM takes over execution, even if there is a split-brain situation where the primary and backup servers have lost communcation with each other.

爲了檢測主、從屬VM是否出現故障，咱們的系統在相關的服務器間使用了心跳通訊，而且監視了logging channel 上的網絡狀況（是否發送網絡擁堵）。此外，咱們必須確保僅有一臺主或者從屬VM接管運行，若是存在「split-brain"狀況。（這裏應該是說防止出現split-brain)

In the following sections, we provide more details on serveral important areas. In section 2.1, we give some details on the deterministic replay technology that ensures that primary and backup VMs are kept in sync via the information sent over the logging channel. In section 2.2, we describe a fundamental rule of our FT protocol that ensures that no data is lost if the primary fails. In section 2.3, we describe our methods for detecting and responding to a failure in a correct fashion.

在接下來的章節中，咱們在許多重要方面提升了更多細節。在2.1，咱們重點闡述了肯定性重演技術，這項技術可以經過logging channel 發送信息來保證主、從屬VM一致同步。在2.2，咱們闡述了咱們的FT 協議的基本規則，該規則保證了主VM失效時不會形成數據損失。在2.3,咱們闡述了檢測和響應錯誤的方法。

2.1 Deterministic Replay Implementation

As we have mentioned, replicating server(or VM) execution can be modeled as the replication of a deterministic state machine. If two deterministic state machines are started in the same order, then they will go through the same sequences of states and produce the same outputs. A virtual machine has a broad set of inputs, including incoming network packets, disk reads, and input from the keyboard and mouse. Non-deterministic events(such as virtual interrupts) and non-deterministic operations(such as reading the clock cycle counter of the processor) also affect the VM's state. This presents three challenges for replicating execution of any VM running any operating system and workload: (1) correctly capturing all the input and non-determinism necessary to ensure deterministic execution of a backup virtual machine, (2) correctly applying the inputs and non-determinism to the backup virtual machine, and (3) doing so in a manner that doesnt degrade performance. In addition, many complex operations in x86 microprocessors have undefined, hence non-deterministic,side effects. Capturing these undefined side effects and replaying them to produce the same state presents an additional challenge.

正如咱們所提到的，複製服務器運行能夠被抽象成複製一個肯定性狀態機。若是兩個肯定性狀態機開始在相同狀態，以後它們經歷相同的狀態序列，將產生相同的輸出序列。一個虛擬機有着各類形式的輸入，包括網絡數據包，硬盤讀取的數據，來自鍵盤和數據的數據等。不肯定性事件（好比虛擬中斷）以及不肯定性操做（好比讀取cpy 的時鐘計數器）也會影響VM的狀態。這揭示了複製任何運行任何操做系統和工做負載的VM的運行，都須要面臨三大挑戰：(1) 正確地捕捉全部輸入和必要的不肯定性行爲來保證從屬VM的肯定性運行。(2) 正確地把輸入和必要的不肯定行爲應用到從屬VM上。(3) 以不影響性能的前提下實現(1)(2)。除了這三點之外，許多在x86微處理器上覆雜的操做都有未定義的邊際影響。捕捉這些邊際影響並重現它們，來達到相同的狀態（對於從屬VM而言），仍然是一個挑戰。

VMware deterministic replay provides exactly this functionality for x86 virtual machines on the Vmwares vSphere platform. Deterministic replay records the inputs of a VM and all possible non-determinism associated with the VM execution in a stream of log entries written to a log file. The VM execution may be exactly replayed later by reading the log entries from the file. For non-deterministic operations, sufficient information is logged to allow the operations to be reproduced with the same state change and output. For non-deterministic events such as timer or IO completion interrupts, the exact instruction at which the event ocurred is also recorded. During replay, the event is delieverd at the same point int the instruction stream. VMware deterministic replay implements an efficient event recording and event delivery mechanism that employs various techniques, including the use of hardware performance counters developed in conjunction with AMD and Intel

在Vmware平臺上，實現了x86虛擬機的肯定性重現。肯定性重如今流式日誌中記錄了VM的輸入以及可能與VM運行相關的不肯定行爲。經過讀取日誌文件，VM運行狀態可能會被準確重現。對於不肯定性操做，經過記錄冗餘信息來確保操做可以被以相同的狀態和輸出來複製。一些不肯定性事件好比計時器或者IO中斷，事件發生時的相關指令也會被記錄起來。在重現期間，事件被放置在指令流的相同位置。虛擬肯定性重現實現了高效事件記錄和事件傳遞機制

Bressoud and Schneider mention dividing the execution of VM into epochs, where non-deterministic events such as interrupts are only delivered at the end of an epoch. The notion of epoch seems to be used as a batching mechanism because it's too expensive to deliver each interrupt separately at the exact instruction where it occurred. However, our event delivery mechanism is efficient enough that VMware deterministic replay has no need to use epochs. Each interrupt is recorded as it occurs and efficiently delivered at the appropriate instruction while being replayed.

Bressoud和Schneider提到將VM的運行劃分紅多個時期，非肯定性事件好比中斷僅僅在一個時期的最後進行傳遞。時期傳遞的概念主要是批處理，由於單獨正確傳遞每個中斷開銷太大。然而，咱們的事件傳遞機制是足夠高效的以致於VMware 肯定性重演沒有必要去劃分紅多個時期。全部中斷都是在它發生時就被記錄，而且在重現時在合適的指令位置下，被高效地傳輸。

2.2 FT Protocol

For VMware FT, we use deterministic replay to produce the necessary log entries to record the execution of the primary VM, but instead of writing the log entries to disk,we send them to the backup VM via the logging channel. The backup VM replays the entries in real time, and hence executes identically to the primary VM. However, we must augment the logging entries with a strict FT protocol on the logging channel in order to ensure that we achieve fault tolerance. Our fundamental requirement is the following

對於Vmware-FT而言，咱們產生必要的log記錄來記錄主VM的運行狀態，可是與把log記錄寫入磁盤不一樣的是，咱們將它們發送到backup VM上（經過logging channel). backup VM 實時重演這些記錄，所以可以「像」主VM同樣運行。爲了保證容錯，咱們基本要求以下：

輸出要求：若是backup VM在主VM 宕機以後接管了服務，backup VM的運行狀態將與主VM的輸出狀態保持徹底一致。

Note that after a failover occurs (i.e. the backup VM takes over after the failure of the primary VM), the backup VM will likely start executing quite difffferently from the way the primary VM would have continued executing, because of the many non-deterministic events happening during execution. However, as long as the backup VM satisfifies the Output Requirement, no externally visible state or data is lost during a failover to the backup VM, and the clients will notice no interruption or inconsistency in their service.

在錯誤發生後(好比backup接管了宕機的主VM),backup將開始以與主VM不一樣的方式運行，由於在運行過程當中會發生許多不肯定性操做。然而，只要backup知足輸出要求、沒有額外可見狀態或者數據在backup準備接管的過程當中被丟失,那麼用戶就不會在外界觀察到不一致或者中斷

The Output Requirement can be ensured by delaying any external output (typically a network packet) until the backup VM has received all information that will allow it to replay execution at least to the point of that output operation. One necessary condition is that the backup VM must have received all log entries generated prior to the output operation. These log entries will allow it to execute up to the point of the last log entry. However, suppose a failure were to happen immediately after the primary executed the output operation. The backup VM must know that it must keep replaying up to the point of the output operation and only 「go live」 (stop replaying and take over as the primary VM, as described in Section 2.3) at that point. If the backup were to go live at the point of the last log entry before the output operation, some non-deterministic event (e.g. timer interrupt delivered to the VM) might change its execution path before it executed the output operation

輸出要求能夠經過延遲外部輸出來知足（尤爲是多餘網絡數據包），直到backup收到了全部的可以讓它運行到輸出所在的狀態點的信息。（也就是說，主VM不能直接輸出，而是要等待發送的log entries發送到backup才能運行），若是主VM在output以後失效了，那麼backup必需要replay到output所在的狀態點，而且必須在該結點上線。若是過早上線，那麼可能會因爲一些不肯定性操做（好比時鐘中斷）錯過輸出。

Given the above constraints, the easiest way to enforce the Output Requirement is to create a special log entry at

each output operation. Then, the Output Requirement may be enforced by this specific rule:

If the backup VM has received all the log entries, including the log entry for the output-producing operation, then the backup VM will be able to exactly reproduce the state of the primary VM at that output point, and so if the primary dies, the backup will correctly reach a state that is consistent with that output. Conversely, if the backup VM takes over without receiving all necessary log entries, then its state may quickly diverge such that it is inconsistent with the primary’s output. The Output Rule is in some ways analogous to the approach described in [11], where an 「externally synchronous」 IO can actually be buffered, as long as it is actually written to disk before the next external communication.

若是backup收到了全部的log記錄,包括輸出操做的log,那麼當primary宕機的時候，backup就可以達到與主VM輸出時相一致的狀態。不然，會出錯。這裏的輸出規則實質是在下一次外部通訊前，對外部的輸出是能夠進行緩存的。

Note that the Output Rule does not say anything about stopping the execution of the primary VM. We need only delay the sending of the output, but the VM itself can continue execution. Since operating systems do non-blocking network and disk outputs with asynchronous interrupts to indicate completion, the VM can easily continue execution and will not necessarily be immediately affected by the delay in the output. In contrast, previous work [3, 9] has typically indicated that the primary VM must be completely stopped prior to doing an output until the backup VM has acknowledged all necessary information from the primary VM.As an example, we show a chart illustrating the requirements of the FT protocol in Figure 2. This figure shows a timeline of events on the primary and backup VMs. The arrows going from the primary line to the backup line represent the transfer of log entries, and the arrows going from the backup line to the primary line represent acknowledgments. Information on asynchronous events, inputs, and output operations must be sent to the backup as log entries and acknowledged. As illustrated in the figure, an output to the external world is delayed until the primary VM has received an acknowledgment from the backup VM that it has received the log entry associated with the output operation. Given that the Output Rule is followed, the backup VM will be able to take over in a state consistent with the primary’s last output.

請注意，Output Rules並非說讓主VM中止運行。咱們僅僅須要延遲發送輸出，可是VM能夠繼續工做（幹其餘的事情）。由於是非阻塞網絡IO，VM很容易不受輸出延遲的影響。...一堆廢話，後面是解釋圖二的

We cannot guarantee that all outputs are produced exactly once in a failover situation. Without the use of transactions with two-phase commit when the primary intends to send an output, there is no way that the backup can determine if a primary crashed immediately before or after sending its last output. Fortunately, the network infrastructure (including the common use of TCP) is designed to deal with lost packets and identical (duplicate) packets. Note that incoming packets to the primary may also be lost during a failure of the primary and therefore won’t be delivered to the backup. However, incoming packets may be dropped for any number of reasons unrelated to server failure, so the network infrastructure, operating systems, and applications are all written to ensure that they can compensate for lost packets

咱們不能保證在failover情境下，全部的輸出僅僅產生一次。簡單的說，若是backup發現主VM在output 操做附近宕機，那麼backup是沒法判斷，主VM是在對外輸出以後宕機的，仍是在對外輸出以前宕機的。幸運的是，網絡結構被設計爲可以處理丟包、重複包等狀況，因此這些重複輸出是可接收的（這裏的輸出是packet級別的)。另外值得注意的是，在主VM失敗的狀況下（在沒有主VM的狀況下，系統對於外界的輸入是不會響應的，也就變相的丟掉了），其餘的backup可能會丟失一些packet,可是徹底不慌，由於不管是網絡結構、應用都有應對這種狀況的手段。

2.3 Detecting and Responding to Failure

As mentioned above, the primary and backup VMs must respond quickly if the other VM appears to have failed. If the backup VM fails, the primary VM will go live — that is, leave recording mode (and hence stop sending entries on the logging channel) and start executing normally. If the primary VM fails, the backup VM should similarly go live, but the process is a bit more complex. Because of its lag in execution, the backup VM will likely have a number of log entries that it has received and acknowledged, but have not yet been consumed because the backup VM hasn't reached the appropriate point in its execution yet. The backup VM must continue replaying its execution from the log entries until it has consumed the last log entry. At that point, the backup VM will stop replaying mode and start executing as a normal VM. In essence, the backup VM has been promoted to the primary VM (and is now missing a backup VM). Since it is no longer a backup VM, the new primary VM will now produce output to the external world when the guest OS does output opreations. During the transition to normal mode, there may be some device-specific operations needed to allow this output to occur properly. In particular, for the purposes of networking, VMware FT automatically advertises the MAC address of the new primary VM on the network, so that physical network switches will know on what server the new primary VM is located. In addition, the newly promoted primary VM may need to reissue some disk IOs(as described in Section 3.4)

如上所述，VM必須可以快速響應其餘VM出現宕機時的狀況，若是backup 宕機了，那麼主VM會中止發送log記錄，並像單機同樣運行。若是主VM宕機了，backup也會相似地上線，可是狀況會變得複雜一些。因爲運行的延遲，backupVM可能收到了一些log,可是還沒來得及運行。backup必須運行到它收到的最後一條log記錄的位置，並在那個位置上線，像一個普通的VM同樣運行。在backup從發現主VM宕機到本身轉化成爲主VM的期間，可能須要一些硬件級別的指令，來保證新的VM可以正常工做。舉個例子，在某些網絡場景下，backup轉化成爲主VM的時候，可能也須要copy以前的主VM的mac地址，這樣才能保證外部一致性。

There are many possible ways to attempt to detect failure of the primary and backup VMs. VMware FT uses UDP heartbeating between servers that are running fault-tolerant VMs to detect when a server may have crashed. In addition, VMware FT monitors the logging traffic that is sent from the primary to the backup VM and the acknowledgments sent from the backup VM to the primary VM. Because of regular timer interrupts, the logging traffic should be regular and never stop for a functioning guest OS. Therefore, a halt in the flow of log entries or acknowledgments could indicate the failure of a VM. A failure is declared if heartbeating or logging traffic has stopped for longer than a specific timeout (on the order of a few seconds).

有許多方式檢測primary和backup之間的failure,Vmware使用udp 心跳通訊來進行檢測。除此以外，primary和backup之間的通訊加入了ack 確認。因爲週期性的時鐘中斷，因此primary和backup之間的logging channels應該是週期性通訊的，因此若是超過週期，logging channels處於空閒狀態，那麼就能夠推測某個VM宕機了。也就是經過超時檢測機制，來檢測錯誤。

However, any such failure detection method is susceptible to a split-brain problem. If the backup server stops receiving heartbeats from the primary server, that may indicate that the primary server has failed, or it may just mean that all network connectivity has been lost between still functioning servers. If the backup VM then goes live while the primary VM is actually still running, there will likely be data corruption and problems for the clients communicating with the VM. Hence, we must ensure that only one of the primary or backup VM goes live when a failure is detected. To avoid split-brain problems, we make use of the shared storage that stores the virtual disks of the VM. When either a primary or backup VM wants to go live, it executes an atomic test-and-set operation on the shared storage. If the operation succeeds, the VM is allowed to go live. If the operation fails, then the other VM must have already gone live, so that current VM actually halts itself("commits suicide"). If the VM cannot access the shared storage when trying to do the atomic operation ,then it just waits until it can. Note that if shared storage is not accessible because of some failure in the storage network, then the VM would likely not be able to do useful work anyway because the virtual disks reside on the same shared storage. Thus,using shared storage to resolve split-brain situations does not introduce any extra unavailability.

然而，以上的錯誤檢測沒法解決腦裂問題。若是backup與primary之間的通訊中斷，可是backup和primary二者都運行良好，那麼就會出現，backup認爲primary宕機了，primary認爲backup宕機了，二者都同時成爲新的primary，腦裂就發生了。

Vm-ft的解決方案是經過了一個shared storage實現了test-and-set 操做，有點相似CAS那個原理。

本質是個套娃問題。（後面raft解決了這個問題）

One final aspect of the design is that once a failure has ocurred and one of the VMs has gone live,VMware FT automatically restores redundancy by starting a new backup VM on another host. Though this process is not covered in most previous work, it is fundamental to making fault-tolerant VMs userful and requires careful design. More details are given in Section 3.1

在運行時會動態判斷冗餘，並動態加入backup主機

3 PRACTIACL IMPLEMENTATION OF FT

第三章更接近於技術細節，而不怎麼設計整個架構的設計，因此就不翻譯了。

Section 2 described our fundamental design and protocols for FT. However, to create a usable,robust,and automatic system, there are many other components that must be designed and implemented.

3.1 Starting and Restarting FT VMs

One of the biggest additional components that must be designed is the mechanism for starting a backup VM in the same state as a primary VM.This mechanism will also be used when re-starting a backup VM after a failure has ocurred. Hence, this mechanism must be usable for a running primary VM that is in an arbitrary state(i.e. not just starting up) . In addition, we would prefer that the mechanism does not significantly disrupt the execution of the primary VM, since that will affect any current clients of the VM.

For VMware FT, we adapted the existing VMotion functionality of Vmware vSphere. VMware VMotion[10] allows the migration of a running VM from one server to another server with minimal disruption — VM pause times are typically less than a second. We created a modified form of VMotion that creates an exact running copy of a VM on a remote server, but without destroying the VM on the local server. That is, our modified FT VMotion clones a VM to a remote host rather than migrating it. The FT VMotion also sets up a logging channel, and causes the source VM to enter logging mode as the primary, and the destination VM to enter replay mode as the new backup. Like normal VMotion, FT VMotion typically interrupts the execution of the primary VM by less than a second. Hence, enabling FT on a running VM is an easy, non-disruptive operation.

Another aspect of starting a backup VM is choosing a server on which to run it. Fault-tolerant VMs run in a cluster of servers that have access to shared storage, so all VMs can typically run on any server in the cluster. This flexibility allows VMware vSphere to restore FT redundancy even when one or more servers have failed. VMware vSphere implements a clustering service that maintains management and resource information. When a failure happens and a primary VM now needs a new backup VM to re-establish redundancy, the primary VM informs the clustering service that it needs a new backup. The clustering service determines the best server on which to run the backup VM based on resource usage and other constraints and invokes an FT VMotion to create the new backup VM. The result is that VMware FT typically can re-establish VM redundancy within minutes of a server failure, all without any noticeable interruption in the execution of a fault-tolerant VM

3.2 Managin the Loggin Channel

There are a number of interesting implementation details in managing the traffiffiffic on the logging channel. In our implementation, the hypervisors maintain a large buffffer for logging entries for the primary and backup VMs. As the primary VM executes, it produces log entries into the log buffffer, and similarly, the backup VM consumes log entries from its log buffffer. The contents of the primary’s log buffffer are flflushed out to the logging channel as soon as possible, and log entries are read into the backup’s log buffffer from the logging channel as soon as they arrive. The backup sends acknowledgments back to the primary each time that it reads some log entries from the network into its log buffffer. These acknowledgments allow VMware FT to determine when an output that is delayed by the Output Rule can be sent. Figure 3 illustrates this process.

If the backup VM encounters an empty log buffffer when it needs to read the next log entry, it will stop execution until a new log entry is available. Since the backup VM is not communicating externally, this pause will not affffect any clients of the VM. Similarly, if the primary VM encounters a full log buffffer when it needs to write a log entry, it must stop execution until log entries can be flflushed out. This stop in execution is a natural flflow-control mechanism that slows down the primary VM when it is producing log entries at too fast a rate. However, this pause can affffect clients of the VM, since the primary VM will be completely stopped and unresponsive until it can log its entry and continue execution. Therefore, our implementation must be designed to minimize the possibility that the primary log buffffer fifills up.

One reason that the primary log buffffer may fifill up is because the backup VM is executing too slowly and therefore consuming log entries too slowly. In general, the backup VM must be able to replay an execution at roughly the same speed as the primary VM is recording the execution. Fortu nately, the overhead of recording and replaying in VMware deterministic replay is roughly the same. However, if the server hosting the backup VM is heavily loaded with other VMs (and hence overcommitted on resources), the backup VM may not be able to get enough CPU and memory resources to execute as fast as the primary VM, despite the best efffforts of the backup hypervisor’s VM scheduler. Beyond avoiding unexpected pauses if the log buffffers fifill up, there is another reason why we don’t wish the execution lag to become too large. If the primary VM fails, the backup VM must 「catch up」 by replaying all the log entries that it has already acknowledged before it goes live and starts communicating with the external world. The time to fifinish replaying is basically the execution lag time at the point of the failure, so the time for the backup to go live is roughly equal to the failure detection time plus the current execution lag time. Hence, we don’t wish the execution lag time to be large (more than a second), since that will add signifificant time to the failover time.

Therefore, we have an additional mechanism to slow down the primary VM to prevent the backup VM from getting too far behind. In our protocol for sending and acknowledging log entries, we send additional information to determine the real-time execution lag between the primary and backup VMs. Typically the execution lag is less than 100 milliseconds. If the backup VM starts having a signifificant execution lag (say, more than 1 second), VMware FT starts slowing down the primary VM by informing the scheduler to give it a slightly smaller amount of the CPU (initially by just a few percent). We use a slow feedback loop, which will try to gradually pinpoint the appropriate CPU limit for the primary VM that will allow the backup VM to match its execution. If the backup VM continues to lag behind, we continue to gradually reduce the primary VM’s CPU limit. Conversely, if the backup VM catches up, we gradually increase the primary VM’s CPU limit until the backup VM returns to having a slight lag.

Note that such slowdowns of the primary VM are very rare, and typically happen only when the system is under extreme stress. All the performance numbers of Section 5 include the cost of any such slowdowns

3.3 Operation on FT VMs

Another practical matter is dealing with the various control operations that may be applied to the primary VM. For example, if the primary VM is explicitly powered off. The backup VM should be stopped as well, and not attempt to go live. As another example, any resource management change on the primary(such as increased CPU share) shoud also be applied to the backup. For these kind of operations. special control entries are sent on the logging channel from the primary to the backup, in order to effect the appropriate operation on the backup.

In general, most operations on the VM should be initated only on the primary VM. VMware FT then sends any necessary control entry to cause the appropriate change on the backup VM. The only operation that can be done independently on the primary and backup VMs is VMotion. That is, the primary and backup VMs can be VMotioned independently to other hosts. Note that VMware FT ensures that neither VM is moved to the server where the other VM is,since that situation would no longer provide fault tolerance.

VMotion of a primary VM adds some complexity over a normal VMotion, since the backup VM must disconnect from the source primary and re-connect to the destination primary VM at the appropriate time. VMotion of a backup VM has a similar issue, but adds an additional complexity. For a normal VMotion, we require that all outstanding disk IOs be quiesced(i.e. completed) just as the final switchover on the VMotion occurs. For a primary VM, this quiescing is easily handled by waiting until the physical IOs complete and delivering these completions to the VM. However, for a backup VM, there is no easy way to cause all IOs to be completed at any required point, since the backup VM must replay the primary VM's execution and complete IOs at the same execution point. The primary VM may be running a workload in which there are always disk IOs in flight during normal execution. VMware FT has a unique method to solve this problem. When a backup VM is at the final switchover point for a VMotion, it requests via the logging channel that the primary VM temporarily quiesce all of its IOs. The backup VM's IOs will then naturally be quiesced as well at a single execution point as it replays the primary VM's execution of the quiescing operation.

3.4 Implementation Issues For Disk IOs

There are a number of subtle implementation issues related to disk IO. First , given that disk opreations are non-blocking and so can execute in parallel, simultaneous disk operations that access the same disk location can lead to non-determinism. Also, our implementation of disk IO uses DMA directly to/from the memory of the virtual machines, so simultaneous disk opreations that access the same memory pages can also lead to non-determinism.Our solution is generally to detect any such IO races(which are rare) , and force such racing disk operations to execute sequentially in the same way on the primary and backup.

Second, a disk operation can also race with a memory access by an application (or OS) in a VM, because the disk operation directly access the memory of a VM via DMA. For example, there could be a non-deterministic result if an application/OS in a VM is reading a memory block at the same time a disk read is occurring to that block. This situation is also unlikely, but we must detect it and deal with it if it happens. One solution is to set up page protection temporarily on pages that are targets of disk operations. The page protections result in a trap if the VM happens to make an access to a page that is alos the target of an outstanding disk operation, and the VM can be paused until the disk operation completes. Because changing MMu protections on pages is an expensive operation, we choose instead to use bounce buffers, A bounce buffer is a temporary buffer that has the same size as the memory being accessed by a disk operation. A disk read operation is modified to read the specified data to the bounce buffer, and the data is copied to guest memory only as the IO completion is delivered.Similarly, for a disk write operation, the data to be sent is first copied to the bounce buffer, and the disk write is modified to write data from the bounce buffer. The use of the bounce buffer can slow down disk operations, but we have not seen it cause any noticeable performance loss.

Thrid,ther are some issues associated with disk IOs that are outstanding(i.e. not completed) on the primary when a failure happens, and the backup takes over. There is no way for the newly-promoted primary VM to be sure if the disk IOs were issued to the disk or completed successfully. In addition, because the disk IOs were not issued externally on the backup VM, there will be no explicit IO completion for them as the newly-promoted primary VM continues to run, which would eventually cause the guest operating system in the VM to start an abort or reset procedure. We could send an error completion that indicates that each IO failed, since it is acceptable to return an error even if the IO completed successfully. However, the guest OS might not respond well to errors from its local disk. Instead, we re-issue the pending IOs during the go-live process of the backu VM. Because we have eliminated all races and all IOs specify directly which memory and disk blocks are accessed, these disk operations can be re-issued even if they have already completed successfully(i.e. they are idempotent)

3.5 Implementation Issues for NetWork IO

VMware vSphere provides many performance optimizations for VM networking. Some of these optimizations are based on the hypervisor asynchronously updating the state of the virtual machine’s network device. For example, receive buffffers can be updated directly by the hypervisor while the VM is executing. Unfortunately these asynchronous updates to a VM’s state add non-determinism. Unless we can guarantee that all updates happen at the same point in the instruction stream on the primary and the backup, the backup’s execution can diverge from that of the primary. The biggest change to the networking emulation code for FT is the disabling of the asynchronous network optimizations. The code that asynchronously updates VM ring buffffers with incoming packets has been modifified to force the guest to trap to the hypervisor, where it can log the updates and then apply them to the VM. Similarly, code that normally pulls packets out of transmit queues asynchronously is disabled for FT, and instead transmits are done through a trap to the hypervisor(except as noted below).

The elimination of the asynchronous updates of the network device combined with the delaying of sending packets described in Section 2.2 has provided some performance challenges for networking. We've taken two approaches to improving VM network performance while running FT. First we implemented clustering optimizations to reduce VM traps and interrupts. When the VM is streaming data at a sufficient bit rate, the hypervisor can do one transimit trap per group of packets and, in the best case, zero traps, since it can transmit the packets as part of receiving new packets. Like-wise, the hypervisor can reduce the number of interrupts to the VM for incoming packets by only posting the interrupt for a group of packets.

Our second performance optimization for networking involves reducing the delay fro transmitted packets. As noted earilier, the hypervisor must delay all transmitted packets until it gets an acknowledgment from the backup for the appropriate log entries. The key to reducing the transmit delay is to reduce the time required to send a log mesage to the backup and get an acknowledgment. Our primary optimaizations in this area involve ensuring that sending and receiving log entries and acknowledgments can all be done without any thread context switch. The VMware vSphere hypervisor allows functions to be registered with the TCP stack that will be called from a deferred-execution context(similar to a tasklet in Linux) whenever TCP data is received. This allows us to quickly handle any incoming log messages on the backup and any acknowledgments received by the primary without any thread context switches. In addition, when the primary VM enqueues a packet to be transmitted, we force an immdeiate log flush of the associated output log entry(as described in Section 2.2) by scheduling a deferred-execution context to do the flush.