【翻譯】QEMU內部機制:vhost的架構

系列文章:html

  1. 【翻譯】QEMU內部機制:宏觀架構和線程模型
  2. 【翻譯】QEMU內部機制:vhost的架構(本文)
  3. 翻譯】QEMU內部機制:頂層概覽

原文地址:http://blog.vmsplice.net/2011/09/qemu-internals-vhost-architecture.htmllinux

原文時間:2011年9月7日後端

做者介紹:Stefan Hajnoczi來自紅帽公司的虛擬化團隊,負責開發和維護QEMU項目的block layer, network subsystem和tracing subsystem。架構

目前工做是multi-core device emulation in QEMU和host/guest file sharing using vsock,過去從事過disk image formats, storage migration和I/O performance optimizationapp

QEMU內部機制:vhost的架構socket

該文章揭示了vhost機制如何在內核層面爲KVM提供對virtIO設備的支持。
我本人最近在研究vhost-scsi,並且回答了許多關於ioeventfd, irqfd和vhost相關的問題,因此我認爲這篇文章將對你們瞭解QEMU內部機制十分有用。tcp

vhost介紹
在linux中,vhost驅動程序提供內核級別的virtIO設備模擬。
在此以前,virtIO的後端驅動通常是由QEMU用戶空間的進程來模擬的。
vhost在內核中實現了virtIO的後端驅動,將用戶態的QEMU從virtIO的機制中剔除。
這使得設備模擬代碼無需經過從用戶態的系統調用就能夠直接調用內核子系統的功能。ide

vhost-net驅動在內核態模擬了網卡相關的IO,它是最先以vhost形式實現的而且是惟一的被linux主線接納的驅動。同時,vhost-blk和vhost-scsi項目也在開發中。函數

Linux內核v3.0的vhost代碼在drivers/vhost/中。post

被全部virtIO設備使用的通用代碼位於drivers/vhost/vhost.c中,其中包含了全部設備都會用來與vm交互的vring處理函數。

vhost-net驅動的代碼在drivers/vhost/net.c中。

vhost驅動的模型
vhost-net驅動會在宿主機上建立/dev/vhost-net字符型設備,該設備是配置vhost-net實例的接口。
當QEMU進程以-netdev tap,vhost=on參數啓動時,它會使用ioctl調用/dev/vhost-net,實現vhost-net實例的初始化。

該操做有三個做用:將QEMU進程與vhost-net實例關聯起來、爲協商virtIO的特性作準備、將vm的物理內存映射傳遞給vhost-net驅動。

在初始化過程當中,vhost驅動會建立一個名爲vhost-$pid_of_qemu的內核線程,該線程稱爲"vhost工做線程",它負責處理IO事件並執行設備模擬。

生產環境下qemu-kvm進程參數:
qemu 2726347 1 /usr/libexec/qemu-kvm -netdev tap,fd=40,id=hostnet0,vhost=on,vhostfd=42
對應的內核線程:
root 2726349 2 [vhost-2726347]

在內核態如何工做
vhost並不會模擬一個完整的virtIO PCI適配器,它僅用於處理virtqueue。
在vhost機制中,依舊會使用QEMU來進行好比說virtIO特性協商、動態遷移等操做。
也就是說vhost驅動不是一個獨立的virtIO設備實現,它藉助用戶態處理控制平面,而在內核態,它實現了數據平面的處理。

"vhost工做線程"等待virtqueue的kicks,而後處理virtqueue中的buffers數據。
對vhost-net驅動來講,就是從tx virtqueue中取出數據包而且把他們傳輸到tap設備的文件描述符中。

對tap設備的文件描述符的輪詢也是由"vhost工做線程"負責的(這裏指的是它使用epoll機制監聽該fd),該線程會在數據包到達tap設備時被喚醒,而且將數據包放置在rx virtqueue中以便vm能夠接收。

"vhost工做線程"的運行原理
vhost架構的一個神奇之處是它並非只爲KVM設計的。vhost爲用戶空間提供了接口,並且徹底不依賴於KVM內核模塊。
這意味着其餘用戶空間代碼在理論上也能夠使用vhost設備,好比libpcap(tcpdump使用的庫)若是須要方便、高效的IO接口時。【什麼場景?】

當vm將buffers放置到virtqueue中的事件發生時,vm會kicks宿主,此時,須要一種機制通知"vhost工做線程"有新的任務要作。

因爲vhost與kvm模塊是獨立的,他們之間不能直接進行交互。vhost實例在啓動時會被分配一個eventfd文件描述符,"vhost工做線程"會監聽eventfd。KVM擁有一個叫作ioeventfd的機制,它能夠將一個eventfd掛鉤(hook)於某個特定vm的IO exit事件(當vm進行I/O操做時,虛擬機經過vm exit將cpu控制權返回給VMM)。QEMU用戶空間進程則爲啓動了virtqueue的VIRTIO_PCI_QUEUE_NOTIFY硬件寄存器訪問事件註冊了一個ioeventfd。

如此一來,"vhost工做線程"就可以在vm啓動virtqueue時得到kvm模塊的通知了。

回程時,"vhost工做線程"經過類似的方法向vm發起中斷。vhost會接受一個"irqfd"文件描述符,並向其中寫入數據以便kick vm。
KVM有一個叫作irqfd的機制用於讓一個eventfd向vm發起中斷。QEMU用戶空間進程則爲virtIO PCI設備的中斷註冊一個irqfd,而且通知給vhost實例。
如此一來,"vhost工做線程"就能夠向vm發起中斷了。

總之,vhost實例僅能感知到vm的內存映射、一個用於啓動的eventfd和一個用於回調的irqfd。

參考代碼:
drivers/vhost/vhost.c - vhost設備都會使用到的通用代碼
drivers/vhost/net.c - vhost-net驅動
virt/kvm/eventfd.c - ioeventfd和irqfd

QEMU用戶空間的用於初始化vhost實例的代碼:
hw/vhost.c - vhost設備通用的初始化代碼
hw/vhost_net.c - vhost-net驅動的初始化代碼

===精選評論===
評論1:
途中顯示在qemu中有一個DMA訪問過程,
可否說明這個DMA-Transfer是在哪裏初始化的麼
我理解的是:只有物理網卡的driver可以在它的RX/TX buffers中執行DMA操做,
vhost自己是不支持的。
vhost與物理網卡的driver是經過sockets通訊的麼?
如此,RX/TX buffers的傳輸使用的是一個普通的memcpy嗎?
回覆1:
關於DMA和內存拷貝:
vhost-net支持zero-copy的傳輸方式。也就是說經過映射vm的RAM以便實現物理網卡直接從它進行DMA。
在接收路徑中,仍須要一個從宿主內核態socket buffers到vm RAM(該RAM由QEMU用戶空間進程映射至此)的內存拷貝過程。
能夠參考drivers/vhost/net.c的handle_tx()和handle_rx()函數。
關於vhost與物理網卡的sockets通訊:
vhost-net不會直接與物理網卡驅動通訊,它只和tun設備(tap或macvtap)驅動打交道。
tap設備通常會放置於網橋中,以便將數據傳輸給物理網卡。
vhost-net使用內核態socket接口結構體,可是僅能與tun驅動實例協同工做。它會拒絕使用一個常規的socket文件描述符。能夠參考drivers/vhost/net.c:get_tap_socket()
請注意:tun驅動的socket是不會以socket文件描述符的形式暴露給用戶空間的,如同用戶空間的AF_PACKET sockets同樣,它僅在內核中使用。

 

 

原文以下:

QEMU Internals: vhost architecture

This post explains how vhost provides in-kernel virtio devices for KVM. I have been hacking on vhost-scsi and have answered questions about ioeventfd, irqfd, and vhost recently, so I thought this would be a useful QEMU Internals post.

Vhost overview

The vhost drivers in Linux provide in-kernel virtio device emulation. Normally the QEMU userspace process emulates I/O accesses from the guest. Vhost puts virtio emulation code into the kernel, taking QEMU userspace out of the picture. This allows device emulation code to directly call into kernel subsystems instead of performing system calls from userspace.

The vhost-net driver emulates the virtio-net network card in the host kernel. Vhost-net is the oldest vhost device and the only one which is available in mainline Linux. Experimental vhost-blk and vhost-scsi devices have also been developed.

In Linux 3.0 the vhost code lives in drivers/vhost/. Common code that is used by all devices is in drivers/vhost/vhost.c. This includes the virtio vring access functions which all virtio devices need in order to communicate with the guest. The vhost-net code lives in drivers/vhost/net.c.

The vhost driver model

The vhost-net driver creates a /dev/vhost-net character device on the host. This character device serves as the interface for configuring the vhost-net instance.

When QEMU is launched with -netdev tap,vhost=on it opens /dev/vhost-net and initializes the vhost-net instance with several ioctl(2) calls. These are necessary to associate the QEMU process with the vhost-net instance, prepare for virtio feature negotiation, and pass the guest physical memory mapping to the vhost-net driver.

During initialization the vhost driver creates a kernel thread called vhost-$pid, where $pid is the QEMU process pid. This thread is called the "vhost worker thread". The job of the worker thread is to handle I/O events and perform the device emulation.

In-kernel virtio emulation

Vhost does not emulate a complete virtio PCI adapter. Instead it restricts itself to virtqueue operations only. QEMU is still used to perform virtio feature negotiation and live migration, for example. This means a vhost driver is not a self-contained virtio device implementation, it depends on userspace to handle the control plane while the data plane is done in-kernel.

The vhost worker thread waits for virtqueue kicks and then handles buffers that have been placed on the virtqueue. In vhost-net this means taking packets from the tx virtqueue and transmitting them over the tap file descriptor.

File descriptor polling is also done by the vhost worker thread. In vhost-net the worker thread wakes up when packets come in over the tap file descriptor and it places them into the rx virtqueue so the guest can receive them.

Vhost as a userspace interface

One surprising aspect of the vhost architecture is that it is not tied to KVM in any way. Vhost is a userspace interface and has no dependency on the KVM kernel module. This means other userspace code, like libpcap, could in theory use vhost devices if they find them convenient high-performance I/O interfaces.

When a guest kicks the host because it has placed buffers onto a virtqueue, there needs to be a way to signal the vhost worker thread that there is work to do. Since vhost does not depend on the KVM kernel module they cannot communicate directly. Instead vhost instances are set up with an eventfd file descriptor which the vhost worker thread watches for activity. The KVM kernel module has a feature known as ioeventfd for taking an eventfd and hooking it up to a particular guest I/O exit. QEMU userspace registers an ioeventfd for the VIRTIO_PCI_QUEUE_NOTIFY hardware register access which kicks the virtqueue. This is how the vhost worker thread gets notified by the KVM kernel module when the guest kicks the virtqueue.

On the return trip from the vhost worker thread to interrupting the guest a similar approach is used. Vhost takes a "call" file descriptor which it will write to in order to kick the guest. The KVM kernel module has a feature called irqfd which allows an eventfd to trigger guest interrupts. QEMU userspace registers an irqfd for the virtio PCI device interrupt and hands it to the vhost instance. This is how the vhost worker thread can interrupt the guest.

In the end the vhost instance only knows about the guest memory mapping, a kick eventfd, and a call eventfd.

Where to find out more

Here are the main points to begin exploring the code:
  • drivers/vhost/vhost.c - common vhost driver code
  • drivers/vhost/net.c - vhost-net driver
  • virt/kvm/eventfd.c - ioeventfd and irqfd
The QEMU userspace code shows how to initialize the vhost instance:
  • hw/vhost.c - common vhost initialization code
  • hw/vhost_net.c - vhost-net initialization
相關文章
相關標籤/搜索