雲計算之路-阿里雲上：基於Xen的IO模型進一步分析「黑色0.1秒」問題

時間 2019-11-10

標籤計算之路阿里基於 xen 模型進一步分析黑色 0.1秒問題欄目阿里巴巴简体版

原文原文鏈接

　　在發現雲服務器讀取OCS緩存的「黑色0.1秒」是發生在socket讀取數據時，並且是發生在讀取開始的字節，甚至在socket寫數據時（好比寫入緩存key）也會出現超過50ms的狀況，咱們的好奇心被激發到一個新的高度。html

　　根據咱們的實測，在雲服務器上建立一個新的TCP鏈接一般也不過3ms左右。在黑色0.1秒期間，TCP包已經到達網卡，從網卡讀到內存中居然超過100ms，這太難以想象了！後來想一想，若是.Net或Windows存在這樣的問題，那微軟就不是全球第一大軟件公司，而是全球第一大忽悠公司，這個可能性真的很是很是小。緩存

　　因此，咱們以爲「黑色0.1秒」問題最大的懷疑對象依然是阿里雲的Xen虛擬機。再加上以前對黑色n秒（n大於1）問題的分析，最大的懷疑對象也是Xen。若是真的是Xen的問題，那就不單單是阿里雲的問題，這讓咱們的好奇心更上了一層樓。服務器

　　既然「黑色0.1秒」發生在Xen的網絡IO層面，那咱們還等什麼，趕忙去了解Xen的IO虛擬化機制！網絡

　　經過Google很快搜索到一篇關於Xen的重要論文——Diagnosing Performance Overheads in the Xen Virtual Machine Environment：架構

　　這篇論文的「3.1 Xen」第2段文字專門講到了Xen的IO模型：併發

3.1 Xenapp

The latest version of the Xen architecture introduces a new I/O model, where special privileged virtual machines called driver domains own specific hardware devices and run their I/O device drivers. All other domains (guest domains in Xen terminology) run a simple device driver that communicates via the device channel mechanism with the driver domain to access the real hardware devices. Driver domains directly access hardware devices that they own; however, interrupts from these devices are first handled by the VMM which then notifies the corresponding driver domain through virtual interrupts delivered over the event mechanism. The guest domain exchanges service requests and responses with the driver domain over an I/O descriptor ring in the device channel. An asynchronous inter-domain event mechanism is used to send notification of queued messages. To support high-performance devices, references to page-sized buffers are transferred over the I/O descriptor ring rather than actual I/O data (the latter would require copying). When data is sent by the guest domain, Xen uses a sharing mechanism where the guest domain permits the driver domain to map the page with the data and pin it for DMA by the device. When data is sent by the driver domain to the guest domain, Xen uses a page-remapping mechanism which maps the page with the data into the guest domain in exchange for an unused page provided by the guest domain.dom

　　虛擬機的世界果真不同。原來在Xen中，每個物理設備都有一個專門的特權虛擬機（driver domain）在管理，其餘虛擬機（guest domain，雲服務器就運行於guest domain）訪問物理設備都要經過對應的driver domain。driver domain上運行着直接能夠訪問物理設備的驅動程序；而guest domain中的驅動程序至關於只是一箇中介，它經過設備信道機制（device channel mechanism）與driver domain進行通訊，來完成物理設備的訪問操做（見下圖，來自這個PPT——Diagnosing Performance Overheads in the Xen Virtual Machine Environment）。（關鍵點1：雲服務器中的網絡IO操做最終是由driver domain完成的）socket

　　而來自物理設備的中斷（interrupt）首先由VMM(Virtual Machine Monitor)處理，而後基於事件機制，經過相應的虛擬中斷通知相應的driver domain（關鍵點2：當網卡收到包時，中斷首先是由VMM處理的，而後發給Driver Domain）。關於這一點，在該論文中的6.1.1節中也提到了：async

For each packet received, the network device raises a physical interrupt which is first handled in Xen, and then the appropriate 「event」 is delivered to the correct driver domain through a virtual interrupt mechanism.

　　當driver domain未來自物理設備的數據（好比網卡接收到的網絡包）發給guest domain時，Xen會使用page-remapping（內存頁重映射）機制。driver domain會先將數據從物理設備讀取到內存中，而後將這部份內存頁與guest domain中的未使用內存頁進行交換，而後guest domain直接讀取這部份內存頁，有點偷樑換柱的味道（關鍵點3：當socket讀取數據時，會進行driver domain與guest domain的內存頁重映射操做）。關於這一點，在該論文的6.1.2節佔也提到到了：

For each page of network data received, the page is remapped into the guest domain and the driver domain acquires a replacement page from the guest.

　　再來看看「黑色0.1秒」期間的狀況。Wireshark的抓包數據顯示，當時來自OCS的TCP包已經到達guest domain：

　　這說明了什麼呢？先看一張更詳細的Xen I/O架構圖（圖片來自[pdf]Xen I/O Overview）：

　　咱們推斷，當時TCP包已經到達上圖中的Netfront——guest domain中的網卡。也就是說物理網卡收到了網絡包，併發出了中斷；中斷被VMM處理併發給了driver domain，driver domain已經將網卡中的數據讀取到內存中；而且已經完成了與guest domain的page-remapping。socket讀取數據時，實際就是在讀這塊從drvier domain的remap過來的內存頁，就在讀的過程當中發生了「黑色0.1秒」。

　　再看一張更詳細的圖（圖片來自Optimizing Network Virtualization in Xen）：

　　在上圖片中，「黑色0.1秒」就發生在guest domain從Hypervisor Page Flipping中讀取package data。

　　經過此次分析，咱們以爲問題可能發生在guest domain從remap過來的內存頁中讀取數據時。在這個讀的過程當中，不只涉及內存，還要涉及CPU——CPU執行的指令狀況，CPU的緩存，CPU與內存之間的距離。這是一個更復雜的問題，目前咱們沒有足夠的知識，也沒有足夠的參考資料進行分析。只能把問題留在這裏，期待有經驗的朋友提供線索。