【轉】理解qemu對設備的模擬機制

Understanding QEMU devices

https://www.qemu.org/2018/02/09/understanding-qemu-devices/html

July, 2017react

Here are some notes that may help newcomers understand what is actually happening with QEMU devices:git

With QEMU, one thing to remember is that we are trying to emulate what an Operating System (OS) would see on bare-metal hardware. Most bare-metal machines are basically giant memory maps, where software poking at a particular address will have a particular side effect (the most common side effect is, of course, accessing memory; but other common regions in memory include the register banks for controlling particular pieces of hardware, like the hard drive or a network card, or even the CPU itself). The end-goal of emulation is to allow a user-space program, using only normal memory accesses, to manage all of the side-effects that a guest OS is expecting.github

【計算機的本質】
大多數裸機都基本上是巨型存儲器映射,當軟件在特定地址處的「唆使」時,會產生特定的反作用。
最多見的是訪問內存,但在內存區域中也會包括用於控制特定硬件的寄存器組,好比硬盤、網卡、甚至是CPU自己等。

As an implementation detail, some hardware, like x86, actually has two memory spaces, where I/O space uses different assembly codes than normal; QEMU has to emulate these alternative accesses. Similarly, many modern CPUs provide themselves a bank of CPU-local registers within the memory map, such as for an interrupt controller.windows

【兩個內存空間】
X86的中有兩個內存空間,其中的IO空間使用不一樣於normal的彙編代碼。
Qemu必須模擬這些不一樣的訪問狀況。
一樣地,許多現代CPU在內存映射中爲本身提供了一組CPU本地寄存器,例如用於中斷控制器。

With certain hardware, we have virtualization hooks where the CPU itself makes it easy to trap on just the problematic assembly instructions (those that access I/O space or CPU internal registers, and therefore require side effects different than a normal memory access), so that the guest just executes the same assembly sequence as on bare metal, but that execution then causes a trap to let user-space QEMU then react to the instructions using just its normal user-space memory accesses before returning control to the guest. This is supported in QEMU through 「accelerators」.app

【cpu硬件對虛擬化的支持】
某些硬件有虛擬化的hook,
CPU能夠容易地捕獲「有問題」的彙編指令
(訪問I / O空間或CPU內部寄存器的指令,這些指令指望的是不一樣於正常內存訪問的反作用),
因此 guest虛擬機只需執行與裸機相同的彙編序列便可。
但這會致使「陷入」,QEMU使用其正常的用戶空間內存訪問對指令做出反應,
並將控制權再返回給guest虛擬機。此功能經過「加速器」實現。

Virtualizing accelerators, such as KVM, can let a guest run nearly as fast as bare metal, where the slowdowns are caused by each trap from guest back to QEMU (a vmexit) to handle a difficult assembly instruction or memory address. QEMU also supports other virtualizing accelerators (such as HAXMor macOS’s Hypervisor.framework).less

QEMU also has a TCG accelerator, which takes the guest assembly instructions and compiles it on the fly into comparable host instructions or calls to host helper routines; while not as fast as hardware acceleration, it allows cross-hardware emulation, such as running ARM code on x86.dom

The next thing to realize is what is happening when an OS is accessing various hardware resources. For example, most operating systems ship with a driver that knows how to manage an IDE disk - the driver is merely software that is programmed to make specific I/O requests to a specific subset of the memory map (wherever the IDE bus lives, which is specific to the the hardware board). When the IDE controller hardware receives those I/O requests it then performs the appropriate actions (via DMA transfers or other hardware action) to copy data from memory to persistent storage (writing to disk) or from persistent storage to memory (reading from the disk).ide

【驅動程序的本質】
驅動程序只是軟件,它的做用是對內存映射的特定子集發出特定的I / O請求。
IDE控制器收到請求後就會執行讀寫磁盤動做。
硬盤在初始化時,驅動程序會持續訪問IDE硬件的內存映射以便進行分區和格式化文件系統。

When you first buy bare-metal hardware, your disk is uninitialized; you install the OS that uses the driver to make enough bare-metal accesses to the IDE hardware portion of the memory map to then turn the disk into a set of partitions and filesystems on top of those partitions.oop

So, how does QEMU emulate this? In the big memory map it provides to the guest, it emulates an IDE disk at the same address as bare-metal would. When the guest OS driver issues particular memory writes to the IDE control registers in order to copy data from memory to persistent storage, the QEMU accelerator traps accesses to that memory region, and passes the request on to the QEMU IDE controller device model. The device model then parses the I/O requests, and emulates them by issuing host system calls. The result is that guest memory is copied into host storage.

【qemu實現模擬的機制】
在qemu提供給客戶機的大內存映射中,它會像在裸機上同樣將IDE硬件映射至相同位置。
當客戶機訪問這部份內存以便寫磁盤時,qemu的加速器會捕獲訪問,
而且將請求傳送給qemu的IDE控制器設備模型,
模型會解析IO請求而且經過宿主機的系統調用來模擬指令。
最終將客戶機的內存拷貝至宿主機的磁盤中。

On the host side, the easiest way to emulate persistent storage is via treating a file in the host filesystem as raw data (a 1:1 mapping of offsets in the host file to disk offsets being accessed by the guest driver), but QEMU actually has the ability to glue together a lot of different host formats (raw,qcow2, qed, vhdx, …) and protocols (file system, block device, NBDCephgluster, …) where any combination of host format and protocol can serve as the backend that is then tied to the QEMU emulation providing the guest device.

Thus, when you tell QEMU to use a host qcow2 file, the guest does not have to know qcow2, but merely has its normal driver make the same register reads and writes as it would on bare metal, which cause vmexits into QEMU code, then QEMU maps those accesses into reads and writes in the appropriate offsets of the qcow2 file. When you first install the guest, all the guest sees is a blank uninitialized linear disk (regardless of whether that disk is linear in the host, as in raw format, or optimized for random access, as in the qcow2 format); it is up to the guest OS to decide how to partition its view of the hardware and install filesystems on top of that, and QEMU does not care what filesystems the guest is using, only what pattern of raw disk I/O register control sequences are issued.

The next thing to realize is that emulating IDE is not always the most efficient. Every time the guest writes to the control registers, it has to go through special handling, and vmexits slow down emulation. Of course, different hardware models have different performance characteristics when virtualized. In general, however, what works best for real hardware does not necessarily work best for virtualization, and until recently, hardware was not designed to operate fast when emulated by software such as QEMU. Therefore, QEMU includes paravirtualized devices that are designed specifically for this purpose.

【qemu使用virtIO緩解性能問題】
不一樣的硬件模型在虛擬化時具備不一樣的性能。 
然而,通常而言,對於真實硬件最有效的方法並不必定適合虛擬化。
直到最近,硬件尚未設計爲在由QEMU等軟件模擬時實現快速運行。
所以,QEMU專門設計了半虛擬化設備以緩解性能問題。

The meaning of 「paravirtualization」 here is slightly different from the original one of 「virtualization through cooperation between the guest and host」. The QEMU developers have produced a specification for a set of hardware registers and the behavior for those registers which are designed to result in the minimum number of vmexits possible while still accomplishing what a hard disk must do, namely, transferring data between normal guest memory and persistent storage. This specification is called virtio; using it requires installation of a virtio driver in the guest. While no physical device exists that follows the same register layout as virtio, the concept is the same: a virtio disk behaves like a memory-mapped register bank, where the guest OS driver then knows what sequence of register commands to write into that bank to cause data to be copied in and out of other guest memory. Much of the speedups in virtio come by its design - the guest sets aside a portion of regular memory for the bulk of its command queue, and only has to kick a single register to then tell QEMU to read the command queue (fewer mapped register accesses mean fewer vmexits), coupled with handshaking guarantees that the guest driver won’t be changing the normal memory while QEMU is acting on it.

【virtIO加速原理】
virtIO的實現方式與上述IDE磁盤讀寫的機制差很少,
它使用了一塊未被任何物理設備使用的寄存器映射,
客戶機中的virtIO驅動會操做這塊寄存器。
在virtIO實現中,客戶機會預留一塊內存區域做爲命令隊列,
經過執行一次寄存器訪問就能夠通知qemu要執行隊列中的命令(大大減小vmexit的次數)。
在qemu操做該隊列時,它會通知客戶機的driver不要再更改該隊列。

As an aside順便說一句, just like recent hardware is fairly efficient to emulate, virtio is evolving to be more efficient to implement in hardware, of course without sacrificing performance for emulation or virtualization. Therefore, in the future, you could stumble upon physical virtio devices as well.

In a similar vein與此相相似, many operating systems have support for a number of network cards, a common example being the e1000 card on the PCI bus. On bare metal, an OS will probe PCI space, see that a bank of registers with the signature for e1000 is populated, and load the driver that then knows what register sequences to write in order to let the hardware card transfer network traffic in and out of the guest. So QEMU has, as one of its many network card emulations, an e1000 device, which is mapped to the same guest memory region as a real one would live on bare metal.

操做系統將探測PCI空間,看看是否填充了一組帶有e1000簽名的寄存器。

And once again, the e1000 register layout tends to require a lot of register writes (and thus vmexits) for the amount of work the hardware performs, so the QEMU developers have added the virtio-net card (a PCI hardware specification, although no bare-metal hardware exists yet that actually implements it), such that installing a virtio-net driver in the guest OS can then minimize the number of vmexits while still getting the same side-effects of sending network traffic. If you tell QEMU to start a guest with a virtio-net card, then the guest OS will probe PCI space and see a bank of registers with the virtio-net signature, and load the appropriate driver like it would for any other PCI hardware.

virtio-net的PCI硬件規範,雖然尚未實際實現它的裸機硬件。
在虛擬機中查看:
# ethtool -i eth0
driver: virtio_net

In summary, even though QEMU was first written as a way of emulating hardware memory maps in order to virtualize a guest OS, it turns out that the fastest virtualization also depends on virtual hardware: a memory map of registers with particular documented side effects that has no bare-metal counterpart. And at the end of the day, all virtualization really means is running a particular set of assembly instructions (the guest OS) to manipulate locations within a giant memory map for causing a particular set of side effects, where QEMU is just a user-space application providing a memory map and mimicking the same side effects you would get when executing those guest instructions on the appropriate bare metal hardware.

【qemu的本質和演進】
總之,儘管QEMU最初是做爲模擬硬件內存映射的一種方式編寫的,以實現虛擬化,
但事實證實,最快的虛擬化還依賴於虛擬硬件:
沒有裸機對應物的具備特定「反作用」的寄存器的內存映射。

全部虛擬化的本質
其實是客戶機運行一組特定的彙編指令來操縱巨型內存映射中的位置以產生一組特定的「反作用」。
QEMU只是一個用戶空間應用程序,它負責提供內存映射,並模擬出在裸機硬件上執行客戶機指令時的相同「反作用」。

(This post is a slight update on an email originally posted to the qemu-devel list back in July 2017).

相關文章
相關標籤/搜索