【翻譯】QEMU內部機制：宏觀架構和線程模型

原文地址：http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecture-and.htmlreact

做者介紹：Stefan Hajnoczi來自紅帽公司的虛擬化團隊，負責開發和維護QEMU項目的block layer, network subsystem和tracing subsystem。編程

目前工做是multi-core device emulation in QEMU和host/guest file sharing using vsock，過去從事過disk image formats, storage migration和I/O performance optimization安全

本文是QEMU內部機制系列文章的第一篇，目的是分享QEMU的工做原理，並讓新的代碼貢獻者更容易地熟悉QEMU的基線代碼。app

運行一臺vm包括執行vm的代碼、處理定時器、IO而且響應外部命令(用戶向qemu發送的命令)。爲了完成全部這些事情，須要一個可以以安全的方式調解資源，而且不會在一個須要花費長時間的磁盤IO或外部命令操做的場景下暫停vm的執行的架構。有兩種常見的用於響應多個事件源的編程架構：less

1.並行架構：將熱舞分配到進程或線程中以便同時執行，我把它稱爲"線程架構"
2.事件驅動架構：經過運行一個主循環來響應事件，將事件分發到事件處理器中。該方式通常是經過在多個文件描述符上執行select或poll等類型的系統調用實現的異步

QEMU實際上使用了一種將事件驅動編程和線程混合起來的架構。它這麼作是爲了不事件驅動編程模型的單線程架構沒法利用多核cpu的優點。並且，在某些場景下，編寫一個獨立的線程來執行某項任務比集成至事件驅動中實現起來更簡單。可是，QEMU的核心是事件驅動的，它的大部分代碼運行在事件驅動的環境下。socket

事件驅動架構是圍繞將事件分發至處理函數的事件循環進行的。QEMU的主事件循環是main_loop_wait()函數，它執行下列任務：

1.等待文件描述符可讀或可寫。文件描述符的角色很關鍵，由於包括文件、套接字、管道和多種其餘資源在內，都是文件描述符。文件描述符能夠經過qemu_set_fd_handler()函數添加。
2.運行到期的定時器。定時器使用qemu_mod_timer()函數添加。
3.運行bottom-halves(中斷語境中的下半部)，它與當即到期的定時器相似。BH用於避免重入和調用堆棧溢出，它能夠經過qemu_bh_schedule()函數添加。

當一個文件描述符狀態就緒、一個定時器到期或一個BH被調度時，事件循環使用回調來響應該事件。在回調環境下，有兩個簡單的規則：

1.核心代碼的其餘部分不會被同時執行，因此無需進行同步。對於其餘核心代碼而言，回調的執行是序列化、原子化得。在任什麼時候刻，僅有一個線程擁有執行核心代碼的控制權。
2.不該該執行任何阻塞式的系統調用或長時間的計算任務。由於事件循環在繼續響應其餘事件以前必須等待回調結束，因此回調應避免執行時間過長。破壞該規則會致使vm暫停，並且沒法響應外部命令。

第二個規則有時比較難以實現，從而在QEMU中有一些阻塞式的代碼。實際上，在qemu_aio_wait()函數中甚至嵌套了一個子事件循環，它用於等待全局事件循環的文件描述符的一個子集。希望在之後的重構代碼中能將這些違背規則的實現移除掉。新增的代碼基本上不會有理由去違反該規則，若是必定要執行阻塞式任務，一個解決方案是將它交給獨立的工做線程去完成。

雖然不少IO操做能夠在非阻塞式的方式執行，可是某些系統調用並無非阻塞版本的實現。此外，有的長時間運行的計算任務會持續佔用CPU，而且難以將其分解爲回調的形式。在這些場景下，可使用獨立的工做線程將這些任務從QEMU的核心代碼流程中單拿出來處理。

一個使用工做線程的示例是posix-aio-compat.c文件，它實現了異步文件IO。當QEMU的核心代碼發起一個aio請求後，它會被放置於一個隊列中。工做線程從該隊列取出請求，在QEMU的核心代碼流程以外處理該請求。因爲他們是獨立的線程，在處理中就能夠放心的執行阻塞式的操做了。這種實現方式同時會處理線程與線程間的、線程與QEMU核心代碼間的同步和交互操做。

還有一個例子是ui/vnc-jobs-async.c文件，它使用工做線程執行鏡像壓縮和編碼等cpu消耗型任務。

QEMU的核心代碼大部分都是線程不安全的，因此工做線程沒法直接調用QEMU的核心代碼。有一個例外是qemu_malloc()這樣的簡單經常使用函數是線程安全的，但這僅僅是一個例外。這種線程不安全的實現規則致使了工做線程不能將事件交互回QEMU的核心代碼。

當一個工做線程須要通知QEMU的核心代碼時，一個管道或qemu_eventfd()文件描述符會被加入到事件循環中。工做線程能夠向文件描述符執行寫操做，同時，事件循環會在該文件描述符準備就緒時調用回調函數。並且，必須使用信號來保證事件循環可以在任何場景下都有機會運行。posix-aio-compat.c中使用了該方式，在下一節介紹完vm的代碼運行機制以後，能夠對該方式有更深入的理解。

上述內容主要關注的是QEMU的事件循環，可是QEMU更重要的用處在於可以執行vm的代碼。

有兩種執行vm代碼的方式：Tiny Code Generator(TCG)和KVM。TCG使用動態二進制翻譯技術來模擬vm，也稱爲Just-in-Time(JIT)編譯技術。KVM利用現代cpu提供的硬件虛擬化技術直接在宿主的cpu上安全的執行vm的代碼。本文不會關注這些具體的執行方式，重要的是這些執行方式都容許QEMU跳到vm代碼並執行它。

跳到vm代碼以後，QEMU的控制權就會交給vm。運行vm代碼的線程不能處於事件循環下，由於此時vm擁有cpu的控制權。一般，運行vm代碼的時間是有限的，由於對模擬設備寄存器的讀寫以及其餘異常會中斷vm代碼的執行並將控制權交還給QEMU。在某些極端場景下，vm可能會佔用過多的時間從而不會讓出控制權，此時，QEMU就會處於沒法響應的狀態。

爲了解決這種vm長期佔用QEMU線程的控制權的問題，可使用signals(信號)來跳出vm。UNIX信號會讓當前執行流程交出控制權而且調用信號處理函數。這樣，QEMU就能夠中斷vm代碼的執行而且回到主事件循環中，而且開始處理pending狀態的事件。

該機制會致使正在執行vm代碼的QEMU沒法當即響應並處理新事件。大部分狀況下QEMU最終會處理這些事件，但這種額外的延遲會致使性能問題。所以，定時器、IO完成和工做線程對QEMU核心的通知等事件會使用信號來確保事件循環當即運行。

如今，您可能想知道事件循環、smp多核架構的vm等宏觀架構是什麼樣的。既然咱們已經知道了QEMU的線程模型和執行vm代碼的機制，下一節將對宏觀架構進行介紹。

傳統的QEMU架構是在單線程中執行vm代碼和事件循環代碼。QEMU默認採用該架構，稱爲非-iothread架構，使用默認編譯選項./configure && make便可激活。QEMU線程會在接收到信號或遇到異常時結束vm代碼的執行，獲取控制權。而後它使用非阻塞式的select運行一次事件循環，以後它會繼續執行vm的代碼，而且重複此過程。

若vm啓動時使用了-smp 2參數，QEMU並不會建立額外的線程。QEMU依舊使用單線程，但它會多路複用兩個vcpus用於執行vm代碼和事件循環。所以，非-iothread的方式沒法發揮多核宿主機的優點，而且在開啓smp的vm場景下性能堪憂。

但請注意，QEMU雖然只有一個核心線程，但它會有0個或多個工做線程，這些線程多是臨時的也多是永久的。這些線程僅執行特定的任務而且不會執行vm代碼或處理事件。我這裏強調這一點是因爲不少人搞不清楚工做線程，而且把它們誤認爲是因爲使用了多vcpus而啓動的線程。請記住，非-iothread架構下僅會啓動一個QEMU核心線程。

最新的架構是QEMU爲每一個vcpu啓動一個線程，外加一個獨立的事件循環線程，稱爲iothread架構，編譯時使用./configure --enable-io-thread進行激活。每個vcpu均可以並行的執行vm的代碼，爲SMP機制提供真正的支持。iothread負責執行事件循環。經過維護一個全局互斥鎖來保證QEMU的核心代碼依舊不會被vcpu線程和iothread同時執行。大部分時間中，vcpu線程將執行vm的代碼而且不會hold該全局互斥鎖。一樣地，在大部分時間中iothread會阻塞在select調用，而且也不會hold該全局互斥鎖。

請注意：TCG模式不是線程安全的，因此即便在iothread架構下，QEMU依舊是使用單線程多路複用vcpu的方式。只有KVM模式能夠利用每vcpu線程的優點。

總結和展望
但願這能夠幫助你們理解QEMU的宏觀架構(該架構也被KVM使用)。
以後，本文的細節可能會更新。而且我但願可以將默認架構從非-iothread更改成iothread，甚至將非-iothread架構移除。
我將會嘗試在qemu項目更新時，更新本文。

QEMU Internals: Overall architecture and threading model

This is the first post in a series on QEMU Internals aimed at developers. It is designed to share knowledge of how QEMU works and make it easier for new contributors to learn about the QEMU codebase.

Running a guest involves executing guest code, handling timers, processing I/O, and responding to monitor commands. Doing all these things at once requires an architecture capable of mediating resources in a safe way without pausing guest execution if a disk I/O or monitor command takes a long time to complete. There are two popular architectures for programs that need to respond to events from multiple sources:

Parallel architecture splits work into processes or threads that can execute simultaneously. I will call this threaded architecture.
Event-driven architecture reacts to events by running a main loop that dispatches to event handlers. This is commonly implemented using the select(2) or poll(2) family of system calls to wait on multiple file descriptors.

QEMU actually uses a hybrid architecture that combines event-driven programming with threads. It makes sense to do this because an event loop cannot take advantage of multiple cores since it only has a single thread of execution. In addition, sometimes it is simpler to write a dedicated thread to offload one specific task rather than integrate it into an event-driven architecture. Nevertheless, the core of QEMU is event-driven and most code executes in that environment.

The event-driven core of QEMU

An event-driven architecture is centered around the event loop which dispatches events to handler functions. QEMU's main event loop is main_loop_wait() and it performs the following tasks:

Waits for file descriptors to become readable or writable. File descriptors play a critical role because files, sockets, pipes, and various other resources are all file descriptors. File descriptors can be added using qemu_set_fd_handler().
Runs expired timers. Timers can be added using qemu_mod_timer().
Runs bottom-halves (BHs), which are like timers that expire immediately. BHs are used to avoid reentrancy and overflowing the call stack. BHs can be added using qemu_bh_schedule().

When a file descriptor becomes ready, a timer expires, or a BH is scheduled, the event loop invokes a callbackthat responds to the event. Callbacks have two simple rules about their environment:

No other core code is executing at the same time so synchronization is not necessary. Callbacks execute sequentially and atomically with respect to other core code. There is only one thread of control executing core code at any given time.
No blocking system calls or long-running computations should be performed. Since the event loop waits for the callback to return before continuing with other events, it is important to avoid spending an unbounded amount of time in a callback. Breaking this rule causes the guest to pause and the monitor to become unresponsive.

This second rule is sometimes hard to honor and there is code in QEMU which blocks. In fact there is even a nested event loop in qemu_aio_wait() that waits on a subset of the events that the top-level event loop handles. Hopefully these violations will be removed in the future by restructuring the code. New code almost never has a legitimate reason to block and one solution is to use dedicated worker threads to offload long-running or blocking code.

Offloading specific tasks to worker threads

Although many I/O operations can be performed in a non-blocking fashion, there are system calls which have no non-blocking equivalent. Furthermore, sometimes long-running computations simply hog the CPU and are difficult to break up into callbacks. In these cases dedicated worker threads can be used to carefully move these tasks out of core QEMU.

One example user of worker threads is posix-aio-compat.c, an asynchronous file I/O implementation. When core QEMU issues an aio request it is placed on a queue. Worker threads take requests off the queue and execute them outside of core QEMU. They may perform blocking operations since they execute in their own threads and do not block the rest of QEMU. The implementation takes care to perform necessary synchronization and communication between worker threads and core QEMU.

Another example is ui/vnc-jobs-async.c which performs compute-intensive image compression and encoding in worker threads.

Since the majority of core QEMU code is not thread-safe, worker threads cannot call into core QEMU code. Simple utilities like qemu_malloc() are thread-safe but that is the exception rather than the rule. This poses a problem for communicating worker thread events back to core QEMU.

When a worker thread needs to notify core QEMU, a pipe or a qemu_eventfd() file descriptor is added to the event loop. The worker thread can write to the file descriptor and the callback will be invoked by the event loop when the file descriptor becomes readable. In addition, a signal must be used to ensure that the event loop is able to run under all circumstances. This approach is used by posix-aio-compat.c and makes more sense (especially the use of signals) after understanding how guest code is executed.

Executing guest code

So far we have mainly looked at the event loop and its central role in QEMU. Equally as important is the ability to execute guest code, without which QEMU could respond to events but would not be very useful.

There are two mechanism for executing guest code: Tiny Code Generator (TCG) and KVM. TCG emulates the guest using dynamic binary translation, also known as Just-in-Time (JIT) compilation. KVM takes advantage of hardware virtualization extensions present in modern Intel and AMD CPUs for safely executing guest code directly on the host CPU. For the purposes of this post the actual techniques do not matter but what matters is that both TCG and KVM allow us to jump into guest code and execute it.

Jumping into guest code takes away our control of execution and gives control to the guest. While a thread is running guest code it cannot simultaneously be in the event loop because the guest has (safe) control of the CPU. Typically the amount of time spent in guest code is limited because reads and writes to emulated device registers and other exceptions cause us to leave the guest and give control back to QEMU. In extreme cases a guest can spend an unbounded amount of time without giving up control and this would make QEMU unresponsive.

In order to solve the problem of guest code hogging QEMU's thread of control signals are used to break out of the guest. A UNIX signal yanks control away from the current flow of execution and invokes a signal handler function. This allows QEMU to take steps to leave guest code and return to its main loop where the event loop can get a chance to process pending events.

The upshot of this is that new events may not be detected immediately if QEMU is currently in guest code. Most of the time QEMU eventually gets around to processing events but this additional latency is a performance problem in itself. For this reason timers, I/O completion, and notifications from worker threads to core QEMU use signals to ensure that the event loop will be run immediately.

You might be wondering what the overall picture between the event loop and an SMP guest with multiple vcpus looks like. Now that the threading model and guest code has been covered we can discuss the overall architecture.

iothread and non-iothread architecture

The traditional architecture is a single QEMU thread that executes guest code and the event loop. This model is also known as non-iothread or !CONFIG_IOTHREAD and is the default when QEMU is built with ./configure && make. The QEMU thread executes guest code until an exception or signal yields back control. Then it runs one iteration of the event loop without blocking in select(2). Afterwards it dives back into guest code and repeats until QEMU is shut down.

If the guest is started with multiple vcpus using -smp 2, for example, no additional QEMU threads will be created. Instead the single QEMU thread multiplexes between two vcpus executing guest code and the event loop. Therefore non-iothread fails to exploit multicore hosts and can result in poor performance for SMP guests.

Note that despite there being only one QEMU thread there may be zero or more worker threads. These threads may be temporarily or permanent. Remember that they perform specialized tasks and do not execute guest code or process events. I wanted to emphasise this because it is easy to be confused by worker threads when monitoring the host and interpret them as vcpu threads. Remember that non-iothread only ever has one QEMU thread.

The newer architecture is one QEMU thread per vcpu plus a dedicated event loop thread. This model is known as iothread or CONFIG_IOTHREAD and can be enabled with ./configure --enable-io-thread at build time. Each vcpu thread can execute guest code in parallel, offering true SMP support, while the iothread runs the event loop. The rule that core QEMU code never runs simultaneously is maintained through a global mutex that synchronizes core QEMU code across the vcpus and iothread. Most of the time vcpus will be executing guest code and do not need to hold the global mutex. Most of the time the iothread is blocked in select(2) and does not need to hold the global mutex.

Note that TCG is not thread-safe so even under the iothread model it multiplexes vcpus across a single QEMU thread. Only KVM can take advantage of per-vcpu threads.

Conclusion and words about the future

Hopefully this helps communicate the overall architecture of QEMU (which KVM inherits). Feel free to leave questions in the comments below.
In the future the details are likely to change and I hope we will see a move to CONFIG_IOTHREAD by default and maybe even a removal of !CONFIG_IOTHREAD.
I will try to update this post as qemu.git changes.