【翻譯】QEMU內部機制：頂層概覽

原文地址：http://blog.vmsplice.net/2011/03/qemu-internals-big-picture-overview.htmllinux

做者介紹：Stefan Hajnoczi來自紅帽公司的虛擬化團隊，負責開發和維護QEMU項目的block layer, network subsystem和tracing subsystem。api

目前工做是multi-core device emulation in QEMU和host/guest file sharing using vsock，過去從事過disk image formats, storage migration和I/O performance optimization安全

本文將對QEMU頂層架構進行介紹，以幫助理解其線程模型。併發

vm是經過qemu程序建立的，通常就是qemu-kvm或kvm程序。在一臺運行有3個vm的宿主機上，將看到3個對應的qemu進程：

ide

當vm關機時，qemu進程就會退出。vm重啓時，爲了方便，qemu進程不會被重啓。不過先關閉並從新開啓qemu進程也是徹底能夠的。

當qemu啓動時，vm的內存就會被分配。-mem-path參數能夠支持使用基於文件的內存系統，好比說hugetlbfs等。不管如何，內存會映射至qemu進程的地址空間，從vm的角度來看，這就是其物理內存。

qemu支持大端和小端字節序架構，因此從qemu代碼訪問vm內存時要當心。字節序的轉換不是直接訪問vm的內存進行的，而是由輔助函數負責。這使得能夠從宿主機上運行具備不一樣字節序的vm。

kvm是linux內核中的虛擬化特性，它使得相似於qemu的程序可以安全的在宿主cpu上直接執行vm的代碼。可是必須獲得宿主機cpu的支持。當前，kvm在x86, ARMv8, ppc, s390和MIPS等CPU上都已被支持。

爲了使用kvm執行vm的代碼，qemu進程會訪問/dev/kvm設備併發起KVM_RUN ioctl調用。kvm內核模塊會藉助Intel和AMD提供的硬件虛擬化擴展功能執行vm的代碼。當vm須要訪問硬件設備寄存器、暫停訪問cpu或執行其餘特殊操做時，kvm就會把控制權交給qemu。此時，qemu會模擬操做的預期輸出，或者只是在暫停的vm CPU的場景下等待下一次vm的中斷。

vm的cpu執行任務的基本流程以下：
open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
ioctl(KVM_RUN)
switch (exit_reason) {
case KVM_EXIT_IO: /* ... */
case KVM_EXIT_HLT: /* ... */
}
}

宿主機內核會像調度其餘進程同樣來調度qemu進程。多個vm在徹底不知道其餘vm的存在的狀況下同時運行。Firefox或Apache等應用程序也會與qemu競爭宿主機資源，但可使用資源控制方法來隔離qemu並使其優先級更高。

qemu在用戶空間進程中模擬了一個完整的虛擬機，它無從得知vm中運行有哪些進程。也就是說，qemu爲vm提供了RAM、可以執行vm的代碼而且可以模擬硬件設備，從而，在vm中能夠運行任何類型的操做系統。宿主機沒法「窺視」任意vm。

vm會在每一個虛擬cpu上擁有一個vcpu線程。一個獨立的iothread線程會經過運行select事件循環來處理相似於網絡數據包或硬盤等IO請求。具體細節已經在上一篇【翻譯】QEMU內部機制：宏觀架構和線程模型中討論過。

===精選評論===
->/dev/kvm設備是由誰打開的？vcpu線程仍是iothread線程？
它是在qemu啓動時由主線程調用的。請注意，每一個vcpu都有其本身的文件描述符，而/dev/kvm是vm的全局文件描述符，它並不是專屬於某個vcpu。

->vcpu線程如何給出cpu的錯覺?它是否須要在上下文切換時維護cpu上下文？或者說它是否必須硬件虛擬化技術的支持？
是的，須要硬件支持。kvm模塊的ioctl接口支持操做vcpu寄存器的狀態。與物理CPU在機器開啓時具備初始寄存器狀態同樣，QEMU在復位時也會對vcpu進行初始化。
KVM須要硬件的支持，intel cpu上稱爲VMX，AMD cpu上稱爲SVM。兩者並不兼容，因此kvm爲兩者分別提供了代碼上的支持。

->hypervisor和vm如何實現直接交互？
這要看你是須要什麼樣的交互，virtio-serial能夠用於任何vm/host的交互。
qemu guest agent程序創建於virtio-serial之上，他支持json rpc api。
它支持宿主機在vm中調用一系列命令（好比查詢IP、備份應用等）

->可否解釋vm中的一個應用的IO操做流程是怎樣的麼？
根據KVM仍是TCG、MMIO仍是PIO，ioeventfd是否啓用，會有多種不一樣的代碼路徑。
基本流程是vm的內存或IO讀寫操做會「陷入」kvm內核模塊，kvm從KVM_RUN的ioctl調用返回後，將返回值交給QEMU處理。
QEMU找到負責該塊內存地址的模擬設備，並調用其處理函數。當該設備處理完成以後，QEMU會使用KVM_RUN的ioctl調用重入進vm。
若是沒有使用Passthrough，那麼vm看到的是一個被模擬出來的網卡設備，此時，kvm和qemu不會直接讀寫物理網卡。他們會將數據包交給linux的網絡棧(好比tap設備)進行處理。
virtio能夠模擬網絡、存儲等虛擬設備，它會用一些優化方式，可是基本原理仍是模擬一個「真實」設備。

->kvm內核模塊是如何捕捉到virtqueue的kick動做的？
ioeventfd被註冊爲KVM的IO總線設備，kvm使用ioeventfd_write()通知ioeventfd。
trace流程爲：
vmx_handle_exit with EXIT_REASON_IO_INSTRUCTION
--> handle_io
--> emulate_instruction
--> x86_emulate_instruction
--> x86_emulate_insn
--> writeback
--> segmented_write
--> emulator_write_emulated
--> emulator_read_write_onepage
--> vcpu_mmio_write
--> ioeventfd_write
該流程顯示了當vm kicks宿主機時，信號是如何通知ioeventfd。

QEMU Internals: Big picture overview

Last week I started the QEMU Internals series to share knowledge of how QEMU works. I dove straight in to the threading model without a high-level overview. I want to go back and provide the big picture so that the details of the threading model can be understood more easily.

The story of a guest

A guest is created by running the qemu program, also known as qemu-kvm or just kvm. On a host that is running 3 virtual machines there are 3 qemu processes:

When a guest shuts down the qemu process exits. Reboot can be performed without restarting the qemu process for convenience although it would be fine to shut down and then start qemu again.

Guest RAM

Guest RAM is simply allocated when qemu starts up. It is possible to pass in file-backed memory with -mem-pathsuch that hugetlbfs can be used. Either way, the RAM is mapped in to the qemu process' address space and acts as the "physical" memory as seen by the guest:

QEMU supports both big-endian and little-endian target architectures so guest memory needs to be accessed with care from QEMU code. Endian conversion is performed by helper functions instead of accessing guest RAM directly. This makes it possible to run a target with a different endianness from the host.

KVM virtualization

KVM is a virtualization feature in the Linux kernel that lets a program like qemu safely execute guest code directly on the host CPU. This is only possible when the target architecture is supported by the host CPU. Today KVM is available on x86, ARMv8, ppc, s390, and MIPS CPUs.
In order to execute guest code using KVM, the qemu process opens /dev/kvm and issues the KVM_RUN ioctl. The KVM kernel module uses hardware virtualization extensions found on modern Intel and AMD CPUs to directly execute guest code. When the guest accesses a hardware device register, halts the guest CPU, or performs other special operations, KVM exits back to qemu. At that point qemu can emulate the desired outcome of the operation or simply wait for the next guest interrupt in the case of a halted guest CPU.
The basic flow of a guest CPU is as follows:

open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

The host's view of a running guest

The host kernel schedules qemu like a regular process. Multiple guests run alongside without knowledge of each other. Applications like Firefox or Apache also compete for the same host resources as qemu although resource controls can be used to isolate and prioritize qemu.

Since qemu system emulation provides a full virtual machine inside the qemu userspace process, the details of what processes are running inside the guest are not directly visible from the host. One way of understanding this is that qemu provides a slab of guest RAM, the ability to execute guest code, and emulated hardware devices; therefore any operating system (or no operating system at all) can run inside the guest. There is no ability for the host to peek inside an arbitrary guest.

Guests have a so-called vcpu thread per virtual CPU. A dedicated iothread runs a select(2) event loop to process I/O such as network packets and disk I/O completion. For more details and possible alternate configuration, see the threading model post.

The following diagram illustrates the qemu process as seen from the host:

Further information

Hopefully this gives you an overview of QEMU and KVM architecture. Feel free to leave questions in the comments and check out other QEMU Internals posts for details on these aspects of QEMU.

Here are two presentations on KVM architecture that cover similar areas if you are interested in reading more:

Jan Kiszka's Linux Kongress 2010 presentation on the Architecture of the Kernel-based Virtual Machine (KVM). Very good material.
My own attempt at presenting a KVM Architecture Overview from 2010.