Revisiting Network Support for RDMA

從新審視RDMA的網絡支持

本文爲SIGCOMM 2018會議論文。node

筆者翻譯了該論文。因爲時間倉促,且筆者英文能力有限,錯誤之處在所不免;歡迎讀者批評指正。react

本文及翻譯版本僅用於學習使用。若是有任何不當,請聯繫筆者刪除。ios

Abstract (摘要)

The advent of RoCE (RDMA over Converged Ethernet) has led to a signifcant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network. However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet? 算法

RoCE(RDMA over Converged Ethernet,基於融合以太網的RDMA)的出現使得RDMA在數據中心網絡中的使用量顯着增長。爲了得到良好的性能,RoCE要求網絡是不丟包網絡,這經過在網絡中啓用優先級流控(Priority Flow Control, PFC)來實現。然而,PFC帶來了許多問題,例如隊頭阻塞、擁塞擴散和偶爾的死鎖。 咱們不是解決這些問題,而是要詢問:爲了支持基於以太網上的RDMA,RFC是不是必須的?緩存

We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA. 服務器

咱們代表,對PFC的需求是當前RoCE NIC設計的一種人爲因素,而不是基本要求。咱們提出了一種改進的RoCE NIC (IRN)設計,經過對RoCE NIC進行一些簡單的更改,以便更好地處理數據包丟失。咱們代表,對於典型的網絡場景,IRN(沒有PFC)優於RoCE(使用PFC) 6-83%。 所以,IRN不只消除了對PFC的需求,並且還提升了處理過程當中的性能!咱們進一步代表,IRN引入的更改能夠經過大約3-10%的適度NIC資源開銷來實現。根據咱們的結果,咱們認爲研究界和工業界應從新考慮當前RDMA的網絡支持。cookie

1 Introduction (引言)

Datacenter networks offer higher bandwidth and lower latency than traditional wide-area networks. However, traditional endhost networking stacks, with their high latencies and substantial CPU overhead, have limited the extent to which applications can make use of these characteristics. As a result, several large datacenters have recently adopted RDMA, which bypasses the traditional networking stacks in favor of direct memory accesses. 網絡

與傳統的廣域網相比,數據中心網絡具備更高的帶寬和更低的延遲。可是,傳統的主機端網絡協議棧具備高延遲和較大的CPU開銷,這限制了應用程序利用數據中心網絡高帶寬和低延遲特性的程度。所以,最近幾個大型數據中心採用了RDMA;RDMA繞過了傳統的網絡協議棧,採用直接內存訪問。架構

RDMA over Converged Ethernet (RoCE) has emerged as the canonical method for deploying RDMA in Ethernet-based datacenters [23, 38]. The centerpiece of RoCE is a NIC that (i) provides mechanisms for accessing host memory without CPU involvement and (ii) supports very basic network transport functionality. Early experience revealed that RoCE NICs only achieve good end-to-end performance when run over a lossless network, so operators turned to Ethernet’s Priority Flow Control (PFC) mechanism to achieve minimal packet loss. The combination of RoCE and PFC has enabled a wave of datacenter RDMA deployments. app

融合以太網上的RDMA(RoCE)已經成爲在基於以太網的數據中心中部署RDMA的規範方法[23,38]。RoCE的核心是一個NIC,它(i)提供了在沒有CPU參與的狀況下訪問主機內存的機制,並(ii)支持很是基本的網絡傳輸功能。早期的經驗代表,RoCE NIC只有在不丟包網絡上運行時才能取得良好的端到端性能,所以運營商轉向以太網優先級流控(PFC)機制,以實現最少的數據包丟失。 RoCE和PFC的組合成爲數據中心RDMA部署的浪潮。

However, the current solution is not without problems. In particular, PFC adds management complexity and can lead to signifcant performance problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks [23, 24, 35, 37, 38]. Rather than continue down the current path and address the various problems with PFC, in this paper we take a step back and ask whether it was needed in the first place. To be clear, current RoCE NICs require a lossless fabric for good performance. However, the question we raise is: can the RoCE NIC design be altered so that we no longer need a lossless network fabric? 

然而,目前的解決方案仍然存在問題。特別地,PFC增長了管理的複雜性,並可能致使嚴重的性能問題,如隊頭阻塞、擁塞傳播和偶爾的死鎖[23,24,35,37,38]。 與繼續沿着當前路徑解決PFC的各類問題不一樣,本文中咱們退後一步,詢問PFC是不是必須的。 須要明確的是,目前的RoCE NIC須要網絡是不丟包的才能得到良好的性能。 可是,咱們提出的問題是:RoCE網卡設計是否能夠改變,以便咱們再也不須要不丟包網絡?

We answer this question in the affirmative, proposing a new design called IRN (for Improved RoCE NIC) that makes two incremental changes to current RoCE NICs (i) more efficient loss recovery, and (ii) basic end-to-end flow control to bound the number of in-flight packets (§3). We show, via extensive simulations on a RoCE simulator obtained from a commercial NIC vendor, that IRN performs better than current RoCE NICs, and that IRN does not require PFC to achieve high performance; in fact, IRN often performs better without PFC (§4). We detail the extensions to the RDMA protocol that IRN requires (§5) and use comparative analysis and FPGA synthesis to evaluate the overhead that IRN introduces in terms of NIC hardware resources (§6). Our results suggest that adding IRN functionality to current RoCE NICs would add as little as 3-10% overhead in resource consumption, with no deterioration in message rates. 

咱們確定地回答了這個問題,提出了一個名爲IRN(Improved RoCE NIC,改進的RoCE NIC)的新設計,它對當前的RoCE NIC進行了兩個增量式更改(i)更有效的丟包恢復機制,以及(ii)基本的端到端流控(限制飛行中(in-flight)數據包的數量(§3))。 咱們在從商用NIC供應商處得到的RoCE仿真器上進行了大量仿真,結果代表IRN的性能優於當前的RoCE NIC,而且IRN不須要PFC來取得高性能。實際上,IRN在沒有PFC的狀況下一般表現更好(§4)。 咱們詳細介紹了IRN要求的對RDMA協議的擴展(§5),並使用比較分析和FPGA綜合來評估IRN在NIC硬件資源方面引入的開銷(§6)。咱們的結果代表,在當前的RoCE網卡中添加IRN功能會增長3-10%的資源開銷,而不會下降消息速率。

A natural question that arises is how IRN compares to iWARP? iWARP [33] long ago proposed a similar philosophy as IRN: handling packet losses efficiently in the NIC rather than making the network lossless. What we show is that iWARP’s failing was in its design choices. The differences between iWARP and IRN designs stem from their starting points: iWARP aimed for full generality which led them to put the full TCP/IP stack on the NIC, requiring multiple layers of translation between RDMA abstractions and traditional TCP bytestream abstractions. As a result, iWARP NICs are typically far more complex than RoCE ones, with higher cost and lower performance (§2). In contrast, IRN starts with the much simpler design of RoCE and asks what minimal features can be added to eliminate the need for PFC.

一個明顯的問題是與iWARP比較,IRN如何?好久之前,iWARP [33]提出了與IRN相似的哲學:在NIC中有效地處理數據包丟失而不是使網絡不丟包。咱們代表iWARP的失敗之處在於它的設計選擇。iWARP和IRN設計之間的差別源於他們的出發點:iWARP旨在實現全面的通用性,這使得他們將完整的TCP/IP協議棧實現於NIC中,須要在RDMA抽象和傳統的TCP字節流抽象之間進行多層轉換。 所以,iWARP NIC一般比RoCE更復雜、成本更高,且性能更低(§2)。相比之下,IRN從更簡單的RoCE設計開始,並詢問能夠經過添加哪些最小功能以消除對PFC的需求。

More generally: while the merits of iWARP vs. RoCE has been a long-running debate in industry, there is no conclusive or rigorous evaluation that compares the two architectures. Instead, RoCE has emerged as the de-facto winner in the marketplace, and brought with it the implicit (and still lingering) assumption that a lossless fabric is necessary to achieve RoCE’s high performance. Our results are the first to rigorously show that, counter to what market adoption might suggest, iWARP in fact had the right architectural philosophy, although a needlessly complex design approach. 

更通常地說:雖然iWARP與RoCE的優勢一直是業界長期爭論的問題,但沒有比較兩種架構的結論性或嚴格的評估。相反,RoCE已成爲市場上事實上的贏家,並帶來了隱含(而且仍然揮之不去)的假設,即不丟包網絡是實現RoCE高性能所必需的。咱們的結果是第一個嚴格代表(與市場採用建議相反),儘管具備一種沒必要要的複雜設計方法,iWARP實際上具備正確的架構理念。

Hence, one might view IRN and our results in one of two ways: (i) a new design for RoCE NICs which, at the cost of a few incremental modifcations, eliminates the need for PFC and leads to better performance, or, (ii) a new incarnation of the iWARP philosophy which is simpler in implementation and faster in performance. 

所以,能夠經過如下兩種方式之一來審視IRN和咱們的結果:(i)RoCE NIC的新設計,以少許增量修改成代價,消除了對PFC的需求並致使更好的性能,或者,(ii)iWARP理念的新實現,其實現更簡單,性能更好。

2 Background (背景)

We begin with reviewing some relevant background. 

咱們以回顧一些相關背景開始。

2.1 Infiniband RDMA and RoCE (Infiniband RDMA和ROCE)

RDMA has long been used by the HPC community in special-purpose Infniband clusters that use credit-based flow control to make the network lossless [4]. Because packet drops are rare in such clusters, the RDMA Infiniband transport (as implemented on the NIC) was not designed to efficiently recover from packet losses. When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender. When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission). 

長期以來,HPC(高性能計算)社區一直在特定用途的Infiniband集羣中使用RDMA,這些集羣使用基於信用的流控制來使網絡不丟包[4]。因爲數據包丟失在此類羣集中不多見,所以RDMA Infiniband傳輸層(在NIC上實現)並不是旨在高效恢復數據包丟失。 當接收方收到無序數據包時,它只是丟棄該數據包並向發送方發送否認確認(NACK)。當發送方看到NACK時,它從新發送在最後一個確認的數據包以後發送的全部數據包(即,它執行回退N重傳)。

To take advantage of the widespread use of Ethernet in datacenters, RoCE [5, 9] was introduced to enable the use of RDMA over Ethernet. RoCE adopted the same Infiniband transport design (including go-back-N loss recovery), and the network was made lossless using PFC. 

爲了利用以太網在數據中心中的普遍使用的優點,引入了RoCE [5,9],以便在以太網上使用RDMA(咱們對RoCE [5]及其後繼者RoCEv2 [9]使用術語RoCE,RoCEv2使得RDMA不只能夠經過以太網運行,還能夠運行在IP路由網絡。)。RoCE採用了相同的Infiniband傳輸層設計(包括回退N重傳),而且使用PFC使網絡不丟包。

2.2 Priority Flow Control (優先級流控)

Priority Flow Control (PFC) [6] is Ethernet’s flow control mechanism, in which a switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain confgured threshold. When the queue drains below this threshold, an X-ON frame is sent to resume transmission. When confgured correctly, PFC makes the network lossless (as long as all network elements remain functioning). However, this coarse reaction to congestion is agnostic to which flows are causing it and this results in various performance issues that have been documented in numerous papers in recent years [23, 24, 35, 37, 38]. These issues range from mild (e.g., unfairness and head-of-line blocking) to severe, such as 「pause spreading」 as highlighted in [23] and even network deadlocks [24, 35, 37]. In an attempt to mitigate these issues, congestion control mechanisms have been proposed for RoCE (e.g., DCQCN [38] and Timely [29]) which reduce the sending rate on detecting congestion, but are not enough to eradicate the need for PFC. Hence, there is now a broad agreement that PFC makes networks harder to understand and manage, and can lead to myriad performance problems that need to be dealt with.

優先級流控(PFC)[6]是以太網的流量控制機制,當隊列超過某個特定的配置閾值時,交換機會向上遊實體(交換機或NIC)發送暫停(或X-OFF)幀。當隊列低於此閾值時,將發送X-ON幀以恢復傳輸。正確配置後,PFC使網絡不丟包(只要全部網絡元素保持正常運行)。然而,這種對擁堵的粗略反應對哪些流致使擁塞是不可知的,這致使了近年來在許多論文中記載的各類性能問題[23,24,35,37,38]。這些問題的範圍從輕微(例如,不公平和線頭阻塞)到嚴重(例如[23]中突出顯示的「暫停傳播」),甚至是網絡死鎖[24,35,37]。爲了緩解這些問題,已經爲RoCE提出了擁塞控制機制(例如,DCQCN [38]和Timely [29]),其下降了擁塞時的發送速率,可是不足以消除對PFC的需求。所以,如今廣泛認爲PFC使網絡更難理解和管理,而且可能致使須要處理的無數性能問題。

2.3 iWARP vs RoCE (iWARP對比RoCE )

iWARP [33] was designed to support RDMA over a fully general (i.e., not loss-free) network. iWARP implements the entire TCP stack in hardware along with multiple other layers that it needs to translate TCP’s byte stream semantics to RDMA segments. Early in our work, we engaged with multiple NIC vendors and datacenter operators in an attempt to understand why iWARP was not more broadly adopted (since we believed the basic architectural premise underlying iWARP was correct). The consistent response we heard was that iWARP is signifcantly more complex and expensive than RoCE, with inferior performance [13]. 

iWARP [33]旨在經過徹底通用(即非不丟包網絡)網絡支持RDMA。iWARP在硬件中實現了整個TCP棧以及將TCP的字節流語義轉換爲RDMA分段所需的多個其餘層。 在咱們的早期工做中,咱們與多家NIC供應商和數據中心運營商合做,試圖瞭解爲何iWARP沒有獲得更普遍的採用(由於咱們認爲iWARP的基本架構前提是正確的)。咱們聽到的一致反應是iWARP明顯比RoCE更復雜和昂貴,性能較差[13]。

We also looked for empirical datapoints to validate or refute these claims. We ran RDMA Write benchmarks on two machines connected to one another, using Chelsio T-580-CR 40Gbps iWARP NICs on both machines for one set of experiments, and Mellanox MCX416A-BCAT 56Gbps RoCE NICs (with link speed set to 40Gbps) for another. Both NICs had similar specifcations, and at the time of purchase, the iWARP NIC cost $760, while the RoCE NIC cost $420. Raw NIC performance values for 64 bytes batched Writes on a single queue-pair are reported in Table 1. We fnd that iWARP has 3× higher latency and 4× lower throughput than RoCE. 

 

咱們還尋找經驗數據點以驗證或駁斥這些觀點。咱們在兩臺相互鏈接的機器上運行RDMA寫入基準測試,兩臺機器上的Chelsio T-580-CR 40Gbps iWARP網卡用於一組實驗,Mellanox MCX416A-BCAT 56Gbps RoCE網卡(鏈路速度設置爲40Gbps)用於另外一組實驗。兩個NIC的規格相似,在購買時,iWARP NIC售價760美圓,而RoCE NIC售價420美圓。表1中給出了單個隊列對上64字節批量寫入的原始NIC性能值。咱們發現iWARP的延遲比RoCE高3倍,吞吐量低4倍。

表1: 單一隊列對情形下64字節RDMA寫操做的iWARP和RoCE NIC的原始性能
NIC 吞吐量 延遲
Chelsio T-580-CR (iWARP) 3.24 Mpps 2.89 us
Mellanox MCX416A-BCAT (RoCE) 14.7 Mpps 0.94 us

 

These price and performance differences could be attributed to many factors other than transport design complexity (such as differences in profit margins, supported features and engineering effort) and hence should be viewed as anecdotal evidence as best. Nonetheless, they show that our conjecture (in favor of implementing loss recovery at the endhost NIC) was certainly not obvious based on current iWARP NICs. 

這些價格和性能差別可歸因於除運輸層設計複雜性以外的許多因素(例如,利潤率差別、支持的特徵和工程效率),所以應被視爲最佳的軼事證據。儘管如此,他們代表:基於當前的iWARP NIC,咱們的猜測(支持在終端主機NIC上實現丟包恢復)確定是不明顯的。

Our primary contribution is to show that iWARP, somewhat surprisingly, did in fact have the right philosophy: explicitly handling packet losses in the NIC leads to better performance than having a lossless network. However, efficiently handling packet loss does not require implementing the entire TCP stack in hardware as iWARP did. Instead, we identify the incremental changes to be made to current RoCE NICs, leading to a design which (i) does not require PFC yet achieves better network-wide performance than both RoCE and iWARP (§4), and (ii) is much closer to RoCE’s implementation with respect to both NIC performance and complexity (§6) and is thus signifcantly less complex than iWARP. 

咱們的主要貢獻是代表iWARP實際上確實具備正確的理念(儘管有些驚訝):在NIC中顯式處理數據包丟失致使比不丟包網絡更好的性能。可是,有效地處理數據包丟失並不須要像iWARP那樣在硬件中實現整個TCP棧。相反,咱們肯定了對當前RoCE NIC進行的增量更改,從而致使一種設計(i)不須要PFC,但取得比RoCE和iWARP(§4)更好的網絡範圍性能,而且(ii)就NIC性能和複雜度而言,更接近於RoCE的實現(§6),所以複雜度比iWARP顯著簡化。

3 IRN Design (IRN設計)

We begin with describing the transport logic for IRN. For simplicity, we present it as a general design independent of the specific RDMA operation types. We go into the details of handling specific RDMA operations with IRN later in §5. 

咱們首先描述IRN的傳輸層邏輯。 爲了簡單起見,咱們將其做爲一種獨立於特定RDMA操做類型的通用設計。咱們將在§5中詳細介紹如何使用IRN處理特定的RDMA操做。

Changes to the RoCE transport design may introduce overheads in the form of new hardware logic or additional perflow state. With the goal of keeping such overheads as small as possible, IRN strives to make minimal changes to the RoCE NIC design in order to eliminate its PFC requirement, as opposed to squeezing out the best possible performance with a more sophisticated design (we evaluate the small overhead introduced by IRN later in §6). 

對RoCE傳輸設計的更改可能會以新硬件邏輯或額外的每一個流狀態的形式引入開銷。爲了儘量減小這種開銷,IRN努力對RoCE NIC設計進行微小的更改,以消除其PFC要求,而不是經過更復雜的設計來擠出最佳性能(咱們在§6評估 IRN引入的小開銷)。

IRN, therefore, makes two key changes to current RoCE NICs, as described in the following subsections: (1) improving the loss recovery mechanism, and (2) basic end-to-end flow control (termed BDP-FC) which bounds the number of in-flight packets by the bandwidth-delay product of the network. We justify these changes by empirically evaluating their signifcance, and exploring some alternative design choices later in §4.3. Note that these changes are orthogonal to the use of explicit congestion control mechanisms (such as DCQCN [38] and Timely [29]) that, as with current RoCE NICs, can be optionally enabled with IRN. 

所以,IRN對當前的RoCE NIC進行了兩個關鍵更改,如如下小節所述:(1)改進丟包恢復機制,以及(2)基本的端到端流控(稱爲BDP-FC),它經過網絡的帶寬-延遲積限制了 網絡中數據包數量。 咱們經過實證評估它們的重要性來證實這些變化,並在§4.3中探索一些替代設計選擇。 請注意,這些更改與使用顯式擁塞控制機制(例如DCQCN [38]和Timely [29])正交,與當前的RoCE NIC同樣,能夠在IRN選擇啓用它們。

3.1 IRN’s Loss Recovery Mechanism (IRN丟包恢復機制)

As discussed in §2, current RoCE NICs use a go-back-N loss recovery scheme. In the absence of PFC, redundant retransmissions caused by go-back-N loss recovery result in signifcant performance penalties (as evaluated in §4). Therefore, the first change we make with IRN is a more efficient loss recovery, based on selective retransmission (inspired by TCP’s loss recovery), where the receiver does not discard out of order packets and the sender selectively retransmits the lost packets, as detailed below. 

如§2中所述,當前的RoCE NIC使用回退N丟包恢復方案。 在沒有PFC的狀況下,由回退N丟包恢復引發的冗餘重傳致使顯着的性能損失(如§4中所評估的)。所以,咱們使用IRN進行的第一個更改是基於選擇性重傳(受TCP的丟包恢復啓發)的更有效的丟失恢復,其中接收方不丟棄無序數據包,而且發送方選擇性地重傳丟失的數據包,以下所述。

Upon every out-of-order packet arrival, an IRN receiver sends a NACK, which carries both the cumulative acknowledgment (indicating its expected sequence number) and the sequence number of the packet that triggered the NACK (as a simplifed form of selective acknowledgement or SACK). 

在每一個亂序數據包到達時,IRN接收方發送NACK,其攜帶累積確認(指示其預期序列號)和觸發NACK的數據包的序列號(做爲選擇性確認或SACK的簡化形式)。

An IRN sender enters loss recovery mode when a NACK is received or when a timeout occurs. It also maintains a bitmap to track which packets have been cumulatively and selectively acknowledged. When in the loss recovery mode, the sender selectively retransmits lost packets as indicated by the bitmap, instead of sending new packets. The first packet that is retransmitted on entering loss recovery corresponds to the cumulative acknowledgement value. Any subsequent packet is considered lost only if another packet with a higher sequence number has been selectively acked. When there are no more lost packets to be retransmitted, the sender continues to transmit new packets (if allowed by BDP-FC). It exits loss recovery when a cumulative acknowledgement greater than the recovery sequence is received, where the recovery sequence corresponds to the last regular packet that was sent before the retransmission of a lost packet. 

當收到NACK或發生超時時,IRN發送方進入丟包恢復模式。它還維護一個位圖,以跟蹤哪些數據包已被累積並有選擇地確認。當處於丟失恢復模式時,發送方選擇性地從新發送丟失的數據包(根據位圖信息),而不是發送新的數據包。 在進入丟失恢復時從新發送的第一個數據包對應於累積確認值。 只有在選擇性地確認了具備更高序列號的另外一個數據包時,才認爲後續數據包丟失了。 當沒有更多丟失的數據包要從新傳輸時,發送方繼續傳輸新數據包(若是BDP-FC容許)。 當接收到大於恢復序列號的累積確認時,它退出丟包恢復模式,其中恢復序列號對應於在重傳丟失分組以前發送的最後一個常規分組。

SACKs allow efficient loss recovery only when there are multiple packets in flight. For other cases (e.g., for single packet messages), loss recovery gets triggered via timeouts. A high timeout value can increase the tail latency of such short messages. However, keeping the timeout value too small can result in too many spurious retransmissions, affecting the overall results. An IRN sender, therefore, uses a low timeout value of RTOlow only when there are a small N number of packets in flight (such that spurious retransmissions remains negligibly small), and a higher value of RTOhigh otherwise. We discuss how the values of these parameters are set in §4, and how the timeout feature in current RoCE NICs can be easily extended to support this in §6. 

只有在有多個飛行中數據包時,SACK才能實現有效的丟包恢復。 對於其餘狀況(例如,對於單個數據包消息),經過超時觸發丟包恢復。 高超時值可能會增長此類短消息的尾部延遲。 可是,保持過小的超時值會致使過多的虛假重傳,從而影響總體結果。 所以,IRN發送方僅在存在少許N個飛行中數據包時使用RTOlow的低超時值(使得虛假重傳較小且可忽略),不然使用較高的RTOhigh值。 咱們將在§4中討論如何設置這些參數的值,以及在§6中討論如何輕鬆擴展當前RoCE NIC中的超時功能以支持此功能。

3.2 IRN’s BDP-FC Mechanism (IRN的BDP-FC機制)

The second change we make with IRN is introducing the notion of a basic end-to-end packet level flow control, called BDP-FC, which bounds the number of outstanding packets in flight for a flow by the bandwidth-delay product (BDP) of the network, as suggested in [17]. This is a static cap that we compute by dividing the BDP of the longest path in the network (in bytes) with the packet MTU set by the RDMA queue-pair (typically 1KB in RoCE NICs). An IRN sender transmits a new packet only if the number of packets in flight (computed as the difference between current packet’s sequence number and last acknowledged sequence number) is less than this BDP cap. 

咱們對IRN作出的第二個改變是引入了一個基本的端到端數據包級流控的概念,稱爲BDP-FC,它經過網絡的帶寬延遲乘(BDP)限制數據流的飛行中數據包的數量, 如[17]中所建議的。 這是一個靜態上限,咱們經過將網絡中最長路徑的BDP(以字節爲單位)除以RDMA隊列對設置的數據包MTU(在RoCE網卡中一般爲1KB)來計算。 只有當飛行中的數據包數量(按當前數據包的序列號和最後確認的序列號之間的差別計算)小於此BDP上限時,IRN發送方纔會發送新數據包。

BDP-FC improves the performance by reducing unnecessary queuing in the network. Furthermore, by strictly upper bounding the number of out-of-order packet arrivals, it greatly reduces the amount of state required for tracking packet losses in the NICs (discussed in more details in §6). 

BDP-FC經過減小網絡中沒必要要的排隊來提升性能。 此外,經過嚴格限制亂序數據包到達的數量,它大大減小了跟蹤NIC中數據包丟失所需的狀態量(在第6節中有更詳細的討論)。

As mentioned before, IRN’s loss recovery has been inspired by TCP’s loss recovery. However, rather than incorporating the entire TCP stack as is done by iWARP NICs, IRN: (1) decouples loss recovery from congestion control and does not incorporate any notion of TCP congestion window control involving slow start, AIMD or advanced fast recovery, (2) operates directly on RDMA segments instead of using TCP’s byte stream abstraction, which not only avoids the complexity introduced by multiple translation layers (as needed in iWARP), but also allows IRN to simplify its selective acknowledgement and loss tracking schemes. We discuss how these changes effect performance towards the end of §4. 

如前所述,IRN的丟包恢復受到了TCP丟包恢復的啓發。 可是,IRN不是像iWARP網卡那樣整合整個TCP棧,而是:(1)將丟包恢復與擁塞控制解耦,而且不包含涉及慢啓動、AIMD或高級快速恢復的TCP擁塞窗口控制的任何概念(2)直接在RDMA分段上運行,而不是使用TCP的字節流抽象,這不只避免了多個轉換層引入的複雜性(在iWARP中須要),並且還容許IRN簡化其選擇性確認和丟包跟蹤方案。 咱們將在§4結束時討論這些變化如何影響性能。

4 Evaluating IRN’s Transport Logic (評估IRN的傳輸層邏輯)

We now confront the central question of this paper: Does RDMA require a lossless network? If the answer is yes, then we must address the many difculties of PFC. If the answer is no, then we can greatly simplify network management by letting go of PFC. To answer this question, we evaluate the network-wide performance of IRN’s transport logic via extensive simulations. Our results show that IRN performs better than RoCE, without requiring PFC. We test this across a wide variety of experimental scenarios and across different performance metrics. We end this section with a simulation-based comparison of IRN with Resilient RoCE [34] and iWARP [33]. 

咱們如今面對本文的核心問題:RDMA是否須要不丟包網絡? 若是答案是確定的,那麼咱們必須解決PFC的諸多難題。 若是答案是否認的,那麼咱們能夠經過放棄PFC來大大簡化網絡管理。 爲了回答這個問題,咱們經過大量的模擬評估了IRN傳輸層邏輯的網絡性能。 咱們的結果代表IRN比RoCE表現更好,而且不須要PFC。 咱們在各類實驗場景和不一樣的性能指標中對此進行了測試。 咱們以IRN與彈性RoCE [34]和iWARP [33]的模擬比較結束本節。

4.1 Experimental Settings (實驗設置)

We begin with describing our experimental settings. 

咱們首先介紹實驗設置。

Simulator: Our simulator, obtained from a commercial NIC vendor, extends INET/OMNET++ [1, 2] to model the Mellanox ConnectX4 RoCE NIC [10]. RDMA queue-pairs (QPs) are modelled as UDP applications with either RoCE or IRN transport layer logic, that generate flows (as described later). We define a flow as a unit of data transfer comprising of one or more messages between the same source-destination pair as in [29, 38]. When the sender QP is ready to transmit data packets, it periodically polls the MAC layer until the link is available for transmission. The simulator implements DCQCN as implemented in the Mellanox ConnectX-4 ROCE NIC [34], and we add support for a NIC-based Timely implementation. All switches in our simulation are input-queued with virtual output ports, that are scheduled using round-robin. The switches can be confgured to generate PFC frames by setting appropriate buffer thresholds. 

模擬器:咱們的模擬器,從某個商業NIC供應商處得到,其擴展了INET/OMNET++ [1,2],以模擬Mellanox ConnectX4 RoCE NIC [10]。 RDMA隊列對(QP)被建模爲具備RoCE或IRN傳輸層邏輯的UDP應用,其生成數據流(如稍後所述)。咱們將數據流定義爲數據傳輸單元,包括與[29,38]中相同的源-目的地對之間的一個或多個消息。 當發送方QP準備好發送數據包時,它會週期性地輪詢MAC層,直到鏈路可用於傳輸。 模擬器實現了在Mellanox ConnectX-4 ROCE NIC [34]中實現的DCQCN,而且咱們添加了對基於NIC的Timely實現的支持。 咱們模擬中的全部交換機都使用虛擬輸出端口的輸入端口排隊機制,使用循環調度。 經過設置適當的緩衝閾值,能夠配置交換機以生成PFC幀。

Default Case Scenario: For our default case, we simulate a 54-server three-tiered fat-tree topology, connected by a fabric with full bisection-bandwidth constructed from 45 6-port switches organized into 6 pods [16]. We consider 40Gbps links, each with a propagation delay of 2µs, resulting in a bandwidth-delay product (BDP) of 120KB along the longest (6-hop) path. This corresponds to ∼110 MTU-sized packets (assuming typical RDMA MTU of 1KB). 

默認情景:對於咱們的默認情景,咱們模擬一個由54臺服務器構成的三層胖樹拓撲,由一個45個6端口交換機構成網絡設施鏈接,具備完整的折半帶寬,組成6個pod[16]。 咱們考慮40Gbps鏈路,每一個鏈路的傳播延遲爲2μs,致使沿最長(6跳)路徑的帶寬延遲乘(BDP)爲120KB(筆者注:40Gbps*2us*6*2=120KB,最後一個乘2的緣由是BDP計算的往返時間)。 這至關於約110個MTU大小的數據包(假設典型的RDMA MTU爲1KB)。

Each end host generates new flows with Poisson interarrival times [17, 30]. Each flow’s destination is picked randomly and size is drawn from a realistic heavy-tailed distribution derived from [19]. Most flows are small (50% of the flows are single packet messages with sizes ranging between 32 bytes-1KB representing small RPCs such as those generated by RDMA based key-value stores [21, 25]), and most of the bytes are in large flows (15% of the flows are between 200KB-3MB, representing background RDMA trafc such as storage). The network load is set at 70% utilization for our default case. We use ECMP for load-balancing [23]. We vary different aspects from our default scenario (including topology size, workload pattern and link utilization) in §4.4. 

每一個終端主機產生具備Poisson到達時間的新數據流[17,30]。 每一個數據流的目的地都是隨機挑選的,大小來自[19]的現實重尾分佈。大多數流量很小(50%的數據流是單個數據包消息,大小介於32字節-1KB之間,表示小型RPC,例如基於RDMA的鍵值存儲生成的那些[21,25]),大多數字節都在大數據流中(15%的數據流在200KB-3MB之間,表明背景RDMA數據流,如存儲)。對於咱們的默認情景,網絡負載設置爲70%利用率。 咱們使用ECMP進行負載平衡[23]。 咱們在§4.4中的改變默認方案(包括拓撲大小,工做負載模式和鏈路利用率)的不一樣方面。

Parameters: RTOhigh is set to an estimation of the maximum round trip time with one congested link. We compute this as the sum of the propagation delay on the longest path and the maximum queuing delay a packet would see if the switch buffer on a congested link is completely full. This is approximately 320µs for our default case. For IRN, we set RTOlow to 100µs (representing the desirable upper-bound on tail latency for short messages) with N set to a small value of 3. When using RoCE without PFC, we use a fixed timeout value of RTOhigh. We disable timeouts when PFC is enabled to prevent spurious retransmissions. We use buffers sized at twice the BDP of the network (which is 240KB in our default case) for each input port [17, 18]. The PFC threshold at the switches is set to the buffer size minus a headroom equal to the upstream link’s bandwidth-delay product (needed to absorb all packets in flight along the link). This is 220KB for our default case. We vary these parameters in §4.4 to show that our results are not very sensitive to these specifc choices. When using RoCE or IRN with Timely or DCQCN, we use the same congestion control parameters as specifed in [29] and [38] respectively. For fair comparison with PFC-based proposals [37, 38], the flow starts at line-rate for all cases. 

參數:RTOhigh設置爲具備一條擁塞鏈路的估計最大往返時間。咱們將其計算爲最長路徑上的傳播延遲與數據包在擁塞鏈路上的交換機徹底滿時所經歷的最大排隊延遲之和。對於咱們的默認狀況,這大約是320μs。對於IRN,咱們將RTOlow設置爲100μs(表示短消息的尾部延遲的指望上限),其中N設置爲較小的值3。當使用不帶PFC的RoCE時,咱們使用固定超時值RTOhigh。咱們在啓用PFC時禁用超時以防止虛假重傳。對於每一個輸入端口,咱們使用大小爲網絡BDP兩倍的緩衝區(在默認狀況下爲240KB)[17,18]。交換機的PFC閾值設置爲緩衝區大小減去淨空間(等於上游鏈路的帶寬延遲乘,須要吸取沿鏈路上飛行中的全部數據包)。對於咱們的默認狀況,這是220KB。咱們在§4.4中改變這些參數,以代表咱們的結果對這些特定選擇不是很是敏感。當使用具備Timely或DCQCN的RoCE或IRN時,咱們分別使用與[29]和[38]中指定的相同的擁塞控制參數。爲了與基於PFC的方案進行公平比較[37,38],全部情形下數據流均從線速開始。

Metrics: We primarily look at three metrics: (i) average slowdown, where slowdown for a flow is its completion time divided by the time it would have taken to traverse its path at line rate in an empty network, (ii) average flow completion time (FCT), (iii) 99%ile or tail FCT. While the average and tail FCTs are dominated by the performance of throughput-sensitive flows, the average slowdown is dominated by the performance of latency-sensitive short flows. 

度量標準:咱們主要考慮三個指標:(i)平均放緩,其中數據流的放緩是其完成時間除以在空網絡中以線速穿越其路徑所花費的時間,(ii)平均流完成時間(FCT),(iii)99%ile或尾部FCT。 雖然平均和尾部FCT主要受吞吐量敏感性數據流的影響,但平均減速主要受延遲敏感性短流的影響。

4.2 Basic Results (基礎結果)

We now present our basic results comparing IRN and RoCE for our default scenario. Unless otherwise specified, IRN is always used without PFC, while RoCE is always used with PFC for the results presented here. 

咱們如今提供咱們的基本結果,比較IRN和RoCE的默認狀況。 除非另有說明,IRN始終在沒有PFC的狀況下使用,而RoCE始終與PFC一塊兒使用。

4.2.1 IRN performs better than RoCE. We begin with comparing IRN’s performance with current RoCE NIC’s. The results are shown in Figure 1. IRN’s performance is up to 2.8-3.7× better than RoCE across the three metrics. This is due to the combination of two factors: (i) IRN’s BDP-FC mechanism reduces unnecessary queuing and (ii) unlike RoCE, IRN does not experience any congestion spreading issues, since it does not use PFC. (explained in more details below). 

4.2.1 IRN比RoCE表現更好。 咱們首先將IRN的性能與當前的RoCE NIC進行比較。 結果如圖1所示。在三個指標中,IRN的性能比RoCE高2.8-3.7倍。這是因爲兩個因素的綜合做用:(i)IRN的BDP-FC機制減小了沒必要要的排隊;(ii)與RoCE不一樣,IRN不會遇到任何擁塞傳播問題,由於它不使用PFC。 (在下面更詳細地解釋)。

圖1:比較IRN和RoCE的性能。

4.2.2 IRN does not require PFC. We next study how IRN’s performance is impacted by enabling PFC. If enabling PFC with IRN does not improve performance, we can conclude that IRN’s loss recovery is sufficient to eliminate the requirement for PFC. However, if enabling PFC with IRN signifcantly improves performance, we would have to conclude that PFC continues to be important, even with IRN’s loss recovery. Figure 2 shows the results of this comparison. Remarkably, we find that not only is PFC not required, but it signifcantly degrades IRN’s performance (increasing the value of each metric by about 1.5-2×). This is because of the head-of-the-line blocking and congestion spreading issues PFC is notorious for: pauses triggered by congestion at one link, cause queue build up and pauses at other upstream entities, creating a cascading effect. Note that, without PFC, IRN experiences significantly high packet drops (8.5%), which also have a negative impact on performance, since it takes about one round trip time to detect a packet loss and another round trip time to recover from it. However, the negative impact of a packet drop (given efficient loss recovery), is restricted to the flow that faces congestion and does not spread to other flows, as in the case of PFC. While these PFC issues have been observed before [23, 29, 38], we believe our work is the first to show that a well-design loss-recovery mechanism outweighs a lossless network. 

4.2.2 IRN不須要PFC。咱們接下來研究啓用PFC如何影響IRN的性能。若是啓用PFC不會使IRN的性能提升,咱們能夠得出結論,IRN的丟包恢復足以消除對PFC的要求。可是,若是啓用PFC能夠顯着提升IRN的性能,咱們必須得出結論,即便使用IRN的丟包恢復機制,PFC仍然很重要。圖2顯示了這種比較的結果。值得注意的是,咱們認爲不只不須要PFC,並且它還會顯着下降IRN的性能(將每一個指標的值增長約1.5-2倍)。這是由於PFC的臭名昭着的線頭阻塞和擁塞擴散問題:一條鏈路上的擁塞引起的暫停,致使隊列創建並暫停其餘上游實體,從而產生級聯效應。請注意,若是沒有PFC,IRN會經歷顯著較高的數據包丟失(8.5%),這也會對性能產生負面影響,由於它須要大約一個往返時間來檢測數據包丟失以及另外一個往返時間以恢復丟包。然而,數據包丟失的負面影響(考慮有效的丟包恢復機制)僅限於經歷擁塞的數據流,而且不會擴散到其它數據流(如PFC的狀況)。雖然在[23,29,38]已經觀察到這些PFC問題,但咱們認爲咱們的工做首次代表設計良好的丟包恢復機制優於不丟包網絡。

圖2: 使能RFC對IRN的影響。

4.2.3 RoCE requires PFC. Given the above results, the next question one might have is whether RoCE required PFC in the first place? Figure 3 shows the performance of RoCE with and without PFC. We fnd that the use of PFC helps considerably here. Disabling PFC degrades performance by 1.5-3× across the three metrics. This is because of the go-back-N loss recovery used by current RoCE NICs, which penalizes performance due to (i) increased congestion caused by redundant retransmissions and (ii) the time and bandwidth wasted by flows in sending these redundant packets. 

4.2.3 RoCE須要PFC。 鑑於上述結果,下一個可能的問題是RoCE是否須要PFC? 圖3顯示了使用和不使用PFC的RoCE的性能。 咱們認爲PFC的使用在這裏有很大幫助。 禁用PFC會使三個指標的性能下降1.5-3倍。 這是由於當前RoCE NIC使用的返回N損失恢復,因爲(i)由冗餘重傳引發的擁塞增長以及(ii)發送這些冗餘分組時所浪費的時間和帶寬,這會損害性能。

圖3:不使用PFC對RoCE的影響。

4.2.4 Effect of Explicit Congestion Control. The previous comparisons did not use any explicit congestion control. However, as mentioned before, RoCE today is typically deployed in conjunction with some explicit congestion control mechanism such as Timely or DCQCN. We now evaluate whether using such explicit congestion control mechanisms affect the key trends described above.

4.2.4顯式擁塞控制的影響。 以前的比較沒有使用任何顯式的擁塞控制。 然而,如前所述,今天的RoCE一般與一些顯式擁塞控制機制(例如Timely或DCQCN)一塊兒部署。 咱們如今評估使用這種顯式擁塞控制機制是否影響上述關鍵趨勢。

Figure 4 compares IRN and RoCE’s performance when Timely or DCQCN is used. IRN continues to perform better by up to 1.5-2.2× across the three metrics. 

圖4比較了使用Timely或DCQCN時的IRN和RoCE的性能。 在三個指標中,IRN的表現仍然比RoCE高1.5-2.2倍。

圖4:使用顯式擁塞控制(Timely和DCQCN)時IRN和RoCE的性能比較。

Figure 5 evaluates the impact of enabling PFC with IRN, when Timely or DCQCN is used. We find that, IRN’s performance is largely unaffected by PFC, since explicit congestion control reduces both the packet drop rate as well as the number of pause frames generated. The largest performance improvement due to enabling PFC was less than 1%, while its largest negative impact was about 3.4%.

圖5評估了使用Timely或DCQCN時啓用PFC對IRN的影響。 咱們發現,因爲顯式擁塞控制下降了數據包丟包率以及生成的暫停幀數,所以IRN的性能在很大程度上不受PFC的影響。 因爲啓用PFC,最大性能提高不到1%,而其最大的負面影響約爲3.4%。

圖5:使用顯式擁塞控制時,啓用RFC對IRN的影響。

Finally, Figure 6 compares RoCE’s performance with and without PFC, when Timely or DCQCN is used.3 We find that, unlike IRN, RoCE (with its inefcient go-back-N loss recovery) requires PFC, even when explicit congestion control is used. Enabling PFC improves RoCE’s performance by 1.35× to 3.5× across the three metrics. 

最後,圖6比較了使用Timely或DCQCN時,使用和不使用RFC時RoCE的性能。咱們發現,與IRN不一樣,RoCE(使用低效的的回退N丟包恢復)須要PFC,即便使用顯式擁塞控制也是如此。 在三個指標中,啓用PFC可將RoCE的性能提升1.35倍至3.5倍。

圖6:當使用顯式控制時,不啓用RFC對RoCE的影響。

Key Takeaways: The following are, therefore, the three takeaways from these results: (1) IRN (without PFC) performs better than RoCE (with PFC), (2) IRN does not require PFC, and (3) RoCE requires PFC.

關鍵要點:所以,如下是這些結果的三個要點:(1)IRN(無PFC)的性能優於RoCE(帶PFC),(2)IRN不須要PFC,(3)RoCE須要PFC。

4.3 Factor Analysis of IRN (IRN的因子分析)

We now perform a factor analaysis of IRN, to individually study the signifcance of the two key changes IRN makes to RoCE, namely (1) efcient loss recovery and (2) BDP-FC. For this we compare IRN’s performance (as evaluated in §4.2) with two different variations that highlight the signifcance of each change: (1) enabling go-back-N loss recovery instead of using SACKs, and (2) disabling BDP-FC. Figure 7 shows the resulting average FCTs (we saw similar trends with other metrics). We discuss these results in greater details below. 

咱們如今進行IRN的因子分析,以單獨研究IRN對RoCE作出的兩個關鍵更改的重要性,即(1)有效的丟包恢復和(2)BDP-FC。 爲此,咱們將IRN的性能(在§4.2中評估)與兩個不一樣的更改進行比較,突出了每一個更改的重要性:(1)啓用回退N丟包恢復而不是使用SACK,以及(2)禁用BDP-FC。 圖7顯示了獲得的平均FCT(咱們看到了與其餘指標相似的趨勢)。 咱們將在下面詳細討論這些結果。

圖7:本圖給出回退N丟包恢復和不使用BDP-FC時IRN的影響。y軸限制到3ms以更好地突出趨勢。

Need for Efficient Loss Recovery: The frst two bars in Figure 7 compare the average FCT of default SACK-based IRN and IRN with go-back-N respectively. We find that the latter results in signifcantly worse performance. This is because of the bandwidth wasted by go-back-N due to redundant retransmissions, as described before. 

須要有效的丟包恢復:圖7中的前兩個條形圖分別比較了默認的基於SACK的IRN和使用回退N的IRN的平均FCT。咱們認爲後者會致使性能顯着降低。 這是由於因爲回退N的冗餘重傳致使的帶寬浪費,如前所述。

Before converging to IRN’s current loss recovery mechanism, we experimented with alternative designs. In particular we explored the following questions: 

在使用IRN目前的丟包恢復機制以前,咱們嘗試了其它替代設計。 咱們特別探討了如下問題:

(1) Can go-back-N be made more efcient? Go-back-N does have the advantage of simplicity over selective retransmission, since it allows the receiver to simply discard out-oforder packets. We, therefore, tried to explore whether we can mitigate the negative effects of go-back-N. We found that explicitly backing off on losses improved go-back-N performance for Timely (though, not for DCQCN). Nonetheless, SACK-based loss recovery continued to perform significantly better across different scenarios (with the difference in average FCT for Timely ranging from 20%-50%). 

(1)回退N能夠更有效嗎?相比於選擇性重傳, Go-back-N確實具備簡單的優勢,由於它容許接收端簡單地丟棄亂序數據包。 所以,咱們試圖探討是否能夠減輕回退N的負面影響。 咱們發現:對於Timely(但不是DCQCN)丟失時的顯式回退能夠提升回退N的性能。 儘管如此,基於SACK的丟包恢復在不一樣的情景中繼續表現得更好(Timely的平均FCT差別在20%-50%之間)。

(2) Do we need SACKs? We tried a selective retransmit scheme without SACKs (where the sender does not maintain a bitmap to track selective acknowledgements). This performed better than go-back-N. However, it fared poorly when there were multiple losses in a window, requiring multiple round-trips to recover from them. The corresponding degradation in average FCT ranged from <1% up to 75% across different scenarios when compared to SACK-based IRN. 

(2)咱們須要SACK嗎?咱們嘗試了一種沒有SACK的選擇性重傳方案(發送方沒有維護位圖來跟蹤選擇性確認)。這比回退N表現得更好。然而,當窗口中出現多處丟包時,它的表現不好,須要屢次往返才能從中恢復。 與基於SACK的IRN相比,在不一樣狀況下,平均FCT的相應降級範圍從<1%到75%不等。

(3) Can the timeout value be computed dynamically? As described in §3, IRN uses two static (low and high) timeout values to allow faster recovery for short messages, while avoiding spurious retransmissions for large ones. We also experimented with an alternative approach of using dynamically computed timeout values (as with TCP), which not only complicated the design, but did not help since these effects were then be dominated by the initial timeout value. 

(3)能夠動態計算超時值嗎? 如§3所述,IRN使用兩個靜態(低和高)超時值,以便更快地恢復短消息,同時避免大消息的虛假重傳。 咱們還嘗試了一種使用動態計算超時值的替代方法(與TCP同樣),這不只使設計複雜化,並且沒有幫助,由於這些效果隨後由初始超時值主導。

Signifcance of BDP-FC: The frst and the third bars in Figure 7 compare the average FCT of IRN with and without BDP-FC respectively. We fnd that BDP-FC signifcantly improves performance by reducing unnecessary queuing. Furthermore, it prevents a flow that is recovering from a packet loss from sending additional new packets and increasing congestion, until the loss has been recovered. 

BDP-FC的重要性:圖7中的第一和第三條分別比較了有和沒有BDP-FC的IRN的平均FCT。 咱們認爲BDP-FC能夠經過減小沒必要要的排隊來顯着提升性能。 此外,它能夠防止從丟包中恢復的數據流發送額外的新數據包並增長擁塞,直到丟包已經恢復。

Efficient Loss Recovery vs BDP-FC: Comparing the second and third bars in Figure 7 shows that the performance of IRN with go-back-N loss recovery is generally worse than the performance of IRN without BDP-FC. This indicates that of the two changes IRN makes, efcient loss recovery helps performance more than BDP-FC. 

BDP-FC和高效丟包恢復對比:比較圖7中的第二和第三條柱,顯示:回退N丟包恢復的IRN性能一般比沒有BDP-FC的IRN性能差。 這代表在IRN作出的兩個更改中,有效的丟包恢復比BDP-FC更有助於提升性能。

 4.4 Robustness of Basic Results (基礎結果的魯棒性)

We now evaluate the robustness of the basic results from §4.2 across different scenarios and performance metrics.

咱們如今評估§4.2在不一樣場景和性能指標中的基礎結果的穩健性。

4.4.1 Varying Experimental Scenario. We evaluate the robustness of our results, as the experimental scenario is varied from our default case. In particular, we run experiments with (i) link utilization levels varied between 30%-90%, (ii) link bandwidths varied from the default of 40Gbps to 10Gbps and 100Gbps, (iii) larger fat-tree topologies with 128 and 250 servers, (iv) a different workload with flow sizes uniformly distributed between 500KB to 5MB, representing background and storage trafc for RDMA, (v) the per-port buffer size varied between 60KB-480KB, (vi) varying other IRN parameters (increasing RTOhiдh value by up to 4 times the default of 320µs, and increasing the N value for using RTOlow to 10 and 15). We summarize our key observations here and provide detailed results for each of these scenarios in Appendix §A of an extended report [31].

4.4.1不一樣的實驗場景。 咱們評估結果的穩健性,由於實驗場景與咱們的默認情景不一樣。 特別是,咱們進行了如下實驗:(i)鏈路利用率水平在30%-90%之間變化,(ii)鏈路帶寬從默認的40Gbps到10Gbps和100Gbps不等,(iii)更大的胖樹拓撲結構,128和250臺服務器 ,(iv)不一樣的工做負載,數據流大小均勻分佈在500KB到5MB之間,表明RDMA的背景和存儲數據流,(v)每端口緩衝區大小在60KB-480KB之間變化,(vi)改變其餘IRN參數(增長RTOhigh) 值最多爲默認值320μs的4倍,並將使用RTOlow的N值增長到10和15)。 咱們在此總結了咱們的主要觀察結果,並在擴展報告的附錄§A中爲每一個場景提供了詳細的結果[31]。

Overall Results: Across all of these experimental scenarios, we find that: 

整體結果:在全部這些實驗場景中,咱們發現:

(a) IRN (without PFC) always performs better than RoCE (with PFC), with the performance improvement ranging from 6% to 83% across different cases. 

(a) IRN(沒有PFC)老是比RoCE(使用PFC)表現更好,在不一樣狀況下性能改善範圍從6%到83%。

(b) When used without any congestion control, enabling PFC with IRN always degrades performance, with the maximum degradation across different scenarios being as high as 2.4×.

(b) 在沒有任何擁塞控制的狀況下,啓用PFC總會下降IRN的性能,不一樣狀況下的最大降級高達2.4倍。

(c) Even when used with Timely and DCQCN, enabling PFC with IRN often degrades performance (with the maximum degradation being 39% for Timely and 20% for DCQCN). Any improvement in performance due to enabling PFC with IRN stays within 1.6% for Timely and 5% for DCQCN. 

(c) 即便與Timely和DCQCN一塊兒使用,啓用PFC一般會下降IRN的性能(Timely的最大降級爲39%,DCQCN的降級最大爲20%)。 因爲啓用PFC致使的性能改善對Timely保持在1.6%以內,對於DCQCN保持在5%以內。

Some observed trends: The drawbacks of enabling PFC with IRN: 

一些觀察到的趨勢:使IRN啓用PFC的缺點:

(a) generally increase with increasing link utilization, as the negative impact of congestion spreading with PFC increases. 

(a) 一般隨着鏈路利用率的增長而增長,這是由於PFC致使擁塞擴散的負面影響增長。

(b) decrease with increasing bandwidths, as the relative cost of a round trip required to react to packet drops without PFC also increases. 

(b) 隨着帶寬的增長而減小,由於在沒有PFC的狀況下對丟包作出反應所需的往返的相對成本也會增長。

(c) increase with decreasing buffer sizes due to more pauses and greater impact of congestion spreading. 

(c) 隨着緩存大小的減小而增長,這是由於更多的暫停幀和更大的擁塞擴散的影響。

We further observe that increasing RTOhigh or N had a very small impact on our basic results, showing that IRN is not very sensitive to the specifc parameter values. 

咱們進一步觀察到,增長RTOhigh或N對咱們的基礎結果的影響很是小,代表IRN對特定參數值不是很是敏感。

4.4.2 Tail latency for small messages. We now look at the tail latency (or tail FCT) of the single-packet messages from our default scenario, which is another relevant metric in datacenters [29]. Figure 8 shows the CDF of this tail latency (from 90%ile to 99.9%ile), across different congestion control algorithms. Our key trends from §4.2 hold even for this metric. This is because IRN (without PFC) is able to recover from single-packet message losses quickly due to the low RTOlow timeout value. With PFC, these messages end up waiting in the queues for similar (or greater) duration due to pauses and congestion spreading. For all cases, IRN performs signifcantly better than RoCE. 

4.4.2小消息的尾部延遲。 咱們如今看一下咱們的默認場景的單數據包消息的尾部延遲(或尾部FCT),這是數據中心的另外一個相關指標[29]。 圖8顯示了不一樣擁塞控制算法的尾部延遲(從90%ile到99.9%ile)的CDF。 咱們關於§4.2的主要趨勢甚至適用於該指標。 這是由於因爲RTOlow超時值較低,IRN(無PFC)可以快速從單包消息丟失中恢復。 使用PFC,因爲暫停和擁塞傳播,這些消息最終會在隊列中等待類似(或更長)的持續時間。 對於全部狀況,IRN比RoCE表現更好。

圖8:本圖比較了不一樣擁塞控制算法下,IRN、使用PFC的IRN和RoCE(使用RFC)的單包消息尾部延遲。

4.4.3 Incast. We now evaluate incast scenarios, both with and without cross-traffic. The incast workload without any cross traffic can be identifed as the best case for PFC, since only valid congestion-causing flows are paused without unnecessary head-of-the-line blocking. 

4.4.3 Incast。 咱們如今評估incast場景,不管是否有交叉流量。 沒有任何交叉流量的incast工做負載被認爲PFC的最佳狀況,由於只有有效的致使擁塞的數據流被暫停而沒有沒必要要的線頭阻塞。

Incast without cross-traffic: We simulate the incast workload on our default topology by striping 150MB of data across M randomly chosen sender nodes that send it to a fixed destination node [17]. We vary M from 10 to 50. We consider the request completion time (RCT) as the metric for incast performance, which is when the last flow completes. For each M, we repeat the experiment 100 times and report the average RCT. Figure 9 shows the results, comparing IRN with RoCE. We find that the two have comparable performance: any increase in the RCT due to disabling PFC with IRN remained within 2.5%. The results comparing IRN’s performance with and without PFC looked very similar. We also varied our default incast setup by changing the bandwidths to 10Gbps and 100Gbps, and increasing the number of connections per machine. Any degradation in performance due to disabling PFC with IRN stayed within 9%. 

沒有交叉流量的Incast:咱們經過在M個隨機選擇的發送方節點上分送150MB數據來模擬咱們默認拓撲中的incast工做負載,這些節點將數據其發送到固定目標節點[17]。 咱們將M從10到50變化。咱們將請求完成時間(RCT)視爲incast性能的度量標準,即最後一個數據流完成的時間。 對於每一個M,咱們重複實驗100次並報告平均RCT。 圖9顯示了將IRN與RoCE進行比較的結果。 咱們發現二者具備類似的性能:不使用IRN的的PFC致使的RCT的增長保持在2.5%之內。 比較IRN在有和沒有PFC的狀況下的性能結果很是類似。 咱們還經過將帶寬更改成10Gbps和100Gbps以及增長每臺計算機的鏈接數來改變咱們的默認incast設置。 禁用IRN的PFC而致使的性能降低保持在9%之內。

圖9: 本圖給出IRN(不使用PFC)與RoCE(使用RFC)相比,使用不一樣的擁塞控制算法,改變扇入度的狀況下,請求完成時間的比率。

Incast with cross traffic: In practice we expect incast to occur with other cross traffic in the network [23, 29]. We started an incast as described above with M = 30, along with our default case workload running at 50% link utilization level. The incast RCT for IRN (without PFC) was always lower than RoCE (with PFC) by 4%-30% across the three congestion control schemes. For the background workload, the performance of IRN was better than RoCE by 32%-87% across the three congestion control schemes and the three metrics (i.e., the average slowdown, the average FCT and the tail FCT). Enabling PFC with IRN generally degraded performance for both the incast and the cross-traffic by 1-75% across the three schemes and metrics, and improved performance only for one case (incast workload with DCQCN by 1.13%).

包含交叉流量的incast:在實踐中,咱們指望在網絡中incast和網絡中的其它較差流量同時存在[23,29]。 咱們如上所述開始了一個M = 30的incast,以及咱們的默認案例工做負載以50%的鏈路利用率運行。 在三種擁塞控制方案中,IRN(沒有PFC)的incast RCT老是低於RoCE(使用PFC)4%-30%。 對於後臺工做負載,在三種擁塞控制方案和三個指標(即平均減緩,平均FCT和尾部FCT)中,IRN的性能優於RoCE 32%-87%。 啓用PFC的IRN一般會使三個方案和指標中的incast和較差流量的性能下降1-75%,而且僅針對一種狀況提升性能(DCQCN的工做負載爲1.13%)。

4.4.4 Window-based congestion control. We also implemented conventional window-based congestion control schemes such as TCP’s AIMD and DCTCP [15] with IRN and observed similar trends as discussed in §4.2. In fact, when IRN is used with TCP’s AIMD, the benefits of disabling PFC were even stronger, because it exploits packet drops as a congestion signal, which is lost when PFC is enabled. 

4.4.4基於窗口的擁塞控制。 咱們還實現了傳統的基於窗口的擁塞控制方案,如TCP的AIMD和DCTCP [15],並觀察到相似於§4.2中討論的趨勢。 事實上,當IRN與TCP的AIMD一塊兒使用時,禁用PFC的好處甚至更強,由於它利用數據包丟包做爲擁塞信號,當啓用PFC時該好處丟失。

Summary: Our key results i.e., (1) IRN (without PFC) performs better than RoCE (with PFC), and (2) IRN does not require PFC, hold across varying realistic scenarios, congestion control schemes and performance metrics. 

總結:咱們的關鍵結果,即(1)IRN(沒有PFC)比RoCE(使用PFC)表現更好,(2)IRN不須要PFC,適用於不一樣的現實場景,擁塞控制方案和性能指標。

4.5 Comparison with Resilient RoCE (與彈性RoCE比較)

A recent proposal on Resilient RoCE [34] explores the use of DCQCN to avoid packet losses in specifc scenarios, and thus eliminate the requirement for PFC. However, as observed previously in Figure 6, DCQCN may not always be successful in avoiding packet losses across all realistic scenarios with more dynamic traffic patterns and hence PFC (with its accompanying problems) remains necessary. Figure 10 provides a direct comparison of IRN with Resilient RoCE. We fnd that IRN, even without any explicit congestion control, performs signifcantly better than Resilient RoCE, due to better loss recovery and BDP-FC. 

最近關於彈性RoCE [34]的提案探討了在特定狀況下使用DCQCN來避免數據包丟失,從而消除了對PFC的要求。 然而,如先前在圖6中所觀察到的,DCQCN可能並不老是成功地避免在具備更多動態數據流模式的全部現實場景中的分組丟失,所以PFC(及其伴隨的問題)仍然是必要的。 圖10提供了IRN與彈性RoCE的直接比較。 咱們發現IRN,即便沒有任何顯式的擁塞控制,因爲更好的丟包恢復和BDP-FC,也比彈性RoCE表現得更好。

圖10: IRN和彈性RoCE(RoCE+DCQCN,沒有PFC)的比較。

4.6 Comparison with iWARP (與iWARP對比)

We finally explore whether IRN’s simplicity over the TCP stack implemented in iWARP impacts performance. We compare IRN’s performance (without any explicit congestion control) with full-blown TCP stack’s, using INET simulator’s in-built TCP implementation for the latter. Figure 11 shows the results for our default scenario. We fnd that absence of slow-start (with use of BDP-FC instead) results in 21% smaller slowdowns with IRN and comparable average and tail FCTs. These results show that in spite of a simpler design, IRN’s performance is better than full-blown TCP stack’s, even without any explicit congestion control. Augmenting IRN with TCP’s AIMD logic further improves its performance, resulting in 44% smaller average slowdown and 11% smaller average FCT as compared to iWARP. Furthermore, IRN’s simple design allows it to achieve message rates comparable to current RoCE NICs with very little overheads (as evaluated in §6). An iWARP NIC, on the other hand, can have up to 4× smaller message rate than a RoCE NIC (§2). Therefore, IRN provides a simpler and more performant solution than iWARP for eliminating RDMA’s requirement for a lossless network. 

咱們最後探討IRN比iWARP中實現的更簡單的TCP棧是否會影響性能。咱們將IRN的性能(沒有任何顯式的擁塞控制)與完整的TCP棧進行比較,使用INET模擬器的內置TCP實現來實現後者。圖11顯示了咱們的默認方案的結果。咱們認爲沒有慢啓動(使用BDP-FC代替)致使IRN減緩減小21%,而且平均和尾部FCT至關。這些結果代表,儘管設計更簡單,但即便沒有任何顯式的擁塞控制,IRN的性能也優於完整的TCP棧。使用TCP的AIMD邏輯加強IRN進一步提升了其性能,與iWARP相比,平均減緩下降了44%,平均FCT下降了11%。此外,IRN的簡單設計使其可以以很是小的開銷實現與當前RoCE NIC至關的消息速率(如§6中所述)。另外一方面,iWARP NIC的消息速率最高可比RoCE NIC小4倍(§2)。所以,IRN提供了比iWARP更簡單且更高性能的解決方案,以消除RDMA對不丟包網絡的要求。

圖11: iWARP傳輸層(TCP棧)和IRN的對比。

5 Implementation Considerations (實現考量)

We now discuss how one can incrementally update RoCE NICs to support IRN’s transport logic, while maintaining the correctness of RDMA semantics as defned by the Infniband RDMA specifcation [4]. Our implementation relies on extensions to RDMA’s packet format, e.g., introducing new fields and packet types. These extensions are encapsulated within IP and UDP headers (as in RoCEv2) so they only effect the endhost behavior and not the network behavior (i.e. no changes are required at the switches). We begin with providing some relevant context about different RDMA operations before describing how IRN supports them. 

咱們如今討論如何逐步更新RoCE NIC以支持IRN的傳輸層邏輯,同時保持RDMA語義的正確性(如Infiniband RDMA規範所定義[4])。 咱們的實現依賴於對RDMA數據包格式的擴展,例如,引入新的域和數據包類型。這些擴展封裝在IP和UDP頭中(如RoCEv2中),所以它們隻影響端主機行爲,而不影響網絡行爲(即交換機不須要進行任何更改)。 在描述IRN如何支持它們以前,咱們首先提供一些關於不一樣RDMA操做的相關上下文。

5.1 Relevant Context (相關上下文)
The two remote endpoints associated with an RDMA message transfer are called a requester and a responder. The interface between the user application and the RDMA NIC is provided by Work Queue Elements or WQEs (pronounced as wookies). The application posts a WQE for each RDMA message transfer, which contains the application-specifed metadata for the transfer. It gets stored in the NIC while the message is being processed, and is expired upon message completion. The WQEs posted at the requester and at the responder NIC are called Request WQEs and Receive WQEs respectively. Expiration of a WQE upon message completion is followed by the creation of a Completion Queue Element or a CQE (pronounced as cookie), which signals the message completion to the user application. There are four types of message transfers supported by RDMA NICs: 

與RDMA消息傳輸相關聯的兩個遠程端點稱爲請求者和響應者。 用戶應用程序和RDMA NIC之間的接口由工做隊列元素或WQE(發音爲wookies)提供。 應用程序爲每一個RDMA消息傳輸發佈WQE,其中包含用於傳輸的應用程序特定的元數據。它在處理消息時存儲在NIC中,並在消息完成時過時。 在請求者和響應者NIC上發佈的WQE分別稱爲請求WQE和接收WQE。 在消息完成時WQE到期以後是建立完成隊列元素或CQE(發音爲cookie),其向用戶應用程序發信號通知消息完成。 RDMA NIC支持四種類型的消息傳輸:

Write: The requester writes data to responder’s memory. The data length, source and sink locations are specifed in the Request WQE, and typically, no Receive WQE is required. However, Write-with-immediate operation requires the user application to post a Receive WQE that expires upon completion to generate a CQE (thus signaling Write completion at the responder as well). 

寫:請求者將數據寫入響應者的內存。 數據長度、源和接收者的位置在請求WQE中指定,一般不須要接收WQE。 然而,當即寫入操做要求用戶應用程序發佈在完成時到期的接收WQE以生成CQE(從而也在響應者處發信號通知寫入完成)。

Read: The requester reads data from responder’s memory. The data length, source and sink locations are specifed in the Request WQE, and no Receive WQE is required. 

讀:請求者從響應者的內存中讀取數據。 數據長度、源和接收者的位置在請求WQE中指定,而且不須要接收WQE。

Send: The requester sends data to the responder. The data length and source location is specifed in the Request WQE, while the sink location is specifed in the Receive WQE. 

發送:請求者將數據發送給響應者。 數據長度和源位置在請求WQE中指定,而接收者位置在接收WQE中指定。

Atomic: The requester reads and atomically updates the data at a location in the responder’s memory, which is specifed in the Request WQE. No Receive WQE is required. Atomic operations are restricted to single-packet messages. 

原子:請求者讀取並原子地更新響應者內存中某個位置的數據,該位置在請求WQE中指定。 不須要接收WQE。 原子操做僅限於單包消息。

5.2 Supporting RDMA Reads and Atomics (支持RDMA讀和原子)

IRN relies on per-packet ACKs for BDP-FC and loss recovery. RoCE NICs already support per-packet ACKs for Writes and Sends. However, when doing Reads, the requester (which is the data sink) does not explicitly acknowledge the Read response packets. IRN, therefore, introduces packets for read (N)ACKs that are sent by a requester for each Read response packet. RoCE currently has eight unused opcode values available for the reliable connected QPs, and we use one of these for read (N)ACKs. IRN also requires the Read responder (which is the data source) to implement timeouts. New timer-driven actions have been added to the NIC hardware implementation in the past [34]. Hence, this is not an issue. 

IRN依賴於BDP-FC的每包ACK和丟包恢復。 對於寫和發送,RoCE NIC已經支持每一個數據包的ACK。 可是,在執行讀取時,請求者(數據接收器)不會顯式確認讀響應數據包。 所以,IRN引入了由請求者爲每一個讀響應數據包發送的讀(N)ACK數據包。 RoCE目前有八個未使用的操做碼值可用於可靠鏈接的QP,咱們使用其中一個用於讀(N)ACK。 IRN還須要Read響應者(它是數據源)實現超時。 過去,新的計時器驅動操做已添加到NIC硬件實現中[34]。 所以,這不是問題。

RDMA Atomic operations are treated similar to a single packet RDMA Read messages.

RDMA原子操做的處理相似於單包RDMA讀取消息。

Our simulations from §4 did not use ACKs for the RoCE (with PFC) baseline, modelling the extreme case of all Reads. Therefore, our results take into account the overhead of perpacket ACKs in IRN. 

咱們從§4開始的模擬沒有使用針對RoCE(帶有PFC)基線的ACK,對全部讀取的極端狀況進行建模。 所以,咱們的結果考慮了IRN中每數據包ACK的開銷。

5.3 Supporting Out-of-order Packet Delivery (支持亂序包交付)

One of the key challenges for implementing IRN is supporting out-of-order (OOO) packet delivery at the receiver – current RoCE NICs simply discard OOO packets. A naive approach for handling OOO packet would be to store all of them in the NIC memory. The total number of OOO packets with IRN is bounded by the BDP cap (which is about 110 MTU-sized packets for our default scenario as described in §4.1) 4. Therefore to support a thousand flows, a NIC would need to buffer 110MB of packets, which exceeds the memory capacity on most commodity RDMA NICs. 

實現IRN的關鍵挑戰之一是在接收者上支持無序(OOO)數據包傳輸 - 當前的RoCE NIC只是簡單地丟棄OOO數據包。 處理OOO數據包的一種簡單方法是將全部數據包存儲在NIC內存中。 IRN中OOO數據包的總數受BDP上限的限制(對於咱們的默認狀況,大約爲110個MTU大小的數據包,如第4.1節所述)。所以,爲了支持一千個數據流,一個NIC須要增長110MB的數據包緩存,這超過大多數商用RDMA網卡的內存容量。

We therefore explore an alternate implementation strategy, where the NIC DMAs OOO packets directly to the final address in the application memory and keeps track of them using bitmaps (which are sized at BDP cap). This reduces NIC memory requirements from 1KB per OOO packet to only a couple of bits, but introduces some additional challenges that we address here. Note that partial support for OOO packet delivery was introduced in the Mellanox ConnectX-5 NICs to enable adaptive routing [11]. However, it is restricted to Write and Read operations. We improve and extend this design to support all RDMA operations with IRN. 

所以,咱們探索了另外一種實現策略,其中NIC將OOO數據包直接經過DMA發送到應用程序內存中的最終地址,並使用位圖(其大小爲BDP上限)追蹤它們。 這將NIC內存要求從每一個OOO數據包1KB減小到僅幾個比特,可是咱們在此處提出了一些額外的挑戰。 請注意,Mellanox ConnectX-5網卡中引入了對OOO數據包傳輸的部分支持,以實現自適應路由[11]。 可是,它僅限於寫入和讀取操做。 咱們改進並擴展了此設計,以支持全部的RDMA操做。

We classify the issues due to out-of-order packet delivery into four categories. 

咱們將由亂序數據包傳送引發的問題分爲四類。

5.3.1 First packet issues. For some RDMA operations, critical information is carried in the first packet of a message, which is required to process other packets in the message. Enabling OOO delivery, therefore, requires that some of the information in the first packet be carried by all packets. 

5.3.1第一個數據包問題。 對於某些RDMA操做,關鍵信息攜帶在消息的第一個數據包中,這是處理消息中其餘數據包所必需的。 所以,啓用OOO傳送要求第一個數據包中的某些信息由全部數據包承載。

In particular, the RETH header (containing the remote memory location) is carried only by the first packet of a Write message. IRN requires adding it to every packet. 

特別地,RETH報頭(包含遠程存儲器位置)僅由寫消息的第一個分組攜帶。 IRN須要將其添加到每一個數據包。

5.3.2 WQE matching issues. Some operations require every packet that arrives to be matched with its corresponding WQE at the responder. This is done implicitly for in-order packet arrivals. However, this implicit matching breaks with OOO packet arrivals. A work-around for this is assigning explicit WQE sequence numbers, that get carried in the packet headers and can be used to identify the corresponding WQE for each packet. IRN uses this workaround for the following RDMA operations: 

5.3.2 WQE匹配問題。 某些操做要求到達的每一個數據包與響應者處的相應WQE匹配。 對於有序分組到達,這是隱式完成的。 可是,這種隱式匹配會被OOO數據包到達破壞。 解決此問題的方法是分配顯式WQE序列號,這些序列號在數據包報頭中攜帶,可用於識別每一個數據包的相應WQE。 IRN將此解決方法用於如下RDMA操做:

Send and Write-with-immediate: It is required that Receive WQEs be consumed by Send and Write-with-immediate requests in the same order in which they are posted. Therefore, with IRN every Receive WQE, and every Request WQE for these operations, maintains a recv_WQE_SN that indicates the order in which they are posted. This value is carried in all Send packets and in the last Write-with-Immediate packet, and is used to identify the appropriate Receive WQE. IRN also requires the Send packets to carry the relative offset in the packet sequence number, which is used to identify the precise address when placing data. 

發送和當即寫入:要求發送和當即寫入請求按照發布的相同順序使用接收WQE。 所以,對於IRN,每一個接收WQE以及這些操做的每一個請求WQE都維護一個recv_WQE_SN,指示它們的發佈順序。 該值在全部發送數據包和最後一個Write-with-Immediate數據包中攜帶,用於標識相應的接收WQE。 IRN還要求發送數據包攜帶數據包序列號中設置的相對偏移值,用於在放置數據時識別精確地址。

Read/Atomic: The responder cannot begin processing a Read/Atomic request R, until all packets expected to arrive before R have been received. Therefore, an OOO Read/Atomic Request packet needs to be placed in a Read WQE buffer at the responder (which is already maintained by current RoCE NICs). With IRN, every Read/Atomic Request WQE maintains a read_WQE_SN, that is carried by all Read/Atomic request packets and allows identifcation of the correct index in this Read WQE buffer.

讀/原子:響應者沒法開始處理讀/原子請求R,直到全部預期在R以前到達的數據包被接收爲止。 所以,須要在響應者(已由當前RoCE NIC維護)的讀WQE緩衝區中放置OOO讀/原子請求包。 使用IRN,每一個讀/原子請求WQE維護一個read_WQE_SN,由全部讀/原子請求包攜帶,並容許在該讀WQE緩衝區中識別正確的索引。

5.3.3 Last packet issues. For many RDMA operations, critical information is carried in last packet, which is required to complete message processing. Enabling OOO delivery, therefore, requires keeping track of such last packet arrivals and storing this information at the endpoint (either on NIC or main memory), until all other packets of that message have arrived. We explain this in more details below. 

5.3.3最後的數據包問題。 對於許多RDMA操做,關鍵信息在最後一個數據包中攜帶,這是完成消息處理所必需的。 所以,啓用OOO傳送須要跟蹤最後的數據包到達並將此信息存儲在端點(NIC或主存儲器上),直到該消息的全部其餘數據包都到達。 咱們將在下面詳細解釋這一點。

A RoCE responder maintains a message sequence number (MSN) which gets incremented when the last packet of a Write/Send message is received or when a Read/Atomic request is received. This MSN value is sent back to the requester in the ACK packets and is used to expire the corresponding Request WQEs. The responder also expires its Receive WQE when the last packet of a Send or a Write-With-Immediate message is received and generates a CQE. The CQE is populated with certain meta-data about the transfer, which is carried by the last packet. IRN, therefore, needs to ensure that the completion signalling mechanism works correctly even when the last packet of a message arrives before others. For this, an IRN responder maintains a 2-bitmap, which in addition to tracking whether or not a packet p has arrived, also tracks whether it is the last packet of a message that will trigger (1) an MSN update and (2) in certain cases, a Receive WQE expiration that is followed by a CQE generation. These actions are triggered only after all packets up to p have been received. For the second case, the recv_WQE_SN carried by p (as discussed in §5.3.2) can identify the Receive WQE with which the meta-data in p needs to be associated, thus enabling a premature CQE creation. The premature CQE can be stored in the main memory, until it gets delivered to the application after all packets up to p have arrived. 

RoCE響應器維護消息序列號(MSN),當接收到寫入/發送消息的最後一個包或者接收到讀取/原子請求時,該消息序列號增長。該MSN值在ACK分組中被髮送回請求者,並用於使相應的請求WQE到期。當接收到Send或Write-With-Immediate消息的最後一個數據包併產生CQE時,響應者也使其接收WQE到期。 CQE填充有關於傳輸的某些元數據,其由最後一個分組攜帶。所以,IRN須要確保即便消息的最後一個數據包到達其餘數據包以前,完成信令機制也能正常工做。爲此,IRN響應者維護2位圖,除了跟蹤包p是否已到達以外,還跟蹤它是不是將觸發(1)MSN更新的消息的最後一個包,以及(2)在某些狀況下,接收WQE到期,而後是CQE生成。只有在收到全部直到p的數據包後纔會觸發這些操做。對於第二種狀況,由p攜帶的recv_WQE_SN(如第5.3.2節中所討論的)能夠識別與p中的元數據須要關聯的接收WQE,從而實現過早的CQE建立。過早的CQE能夠存儲在主存儲器中,直到它在到達p的全部數據包到達以後被傳送到應用程序。

5.3.4 Application-level Issues. Certain applications (for example FaRM [21]) rely on polling the last packet of a Write message to detect completion, which is incompatible with OOO data placement. This polling based approach violates the RDMA specifcation (Sec o9-20 [4]) and is more expensive than ofcially supported methods (FaRM [21] mentions moving on to using the ofcially supported Write-withImmediate method in the future for better scalability). IRN’s design provides all of the Write completion guarantees as per the RDMA specifcation. This is discussed in more details in Appendix §B of the extended report [31]. 

5.3.4應用程序級問題。 某些應用程序(例如FaRM [21])依賴於輪詢Write消息的最後一個數據包來檢測完成,這與OOO數據放置不兼容。 這種基於輪詢的方法違反了RDMA規範(Sec o9-20 [4]),而且比特別支持的方法更昂貴(FaRM [21]提到未來使用特定支持的Write-withImmediate方法以得到更好的可擴展性)。 IRN的設計根據RDMA規範提供了全部寫完成保證。 這在擴展報告[31]的附錄§B中有更詳細的討論。

OOO data placement can also result in a situation where data written to a particular memory location is overwritten by a restransmitted packet from an older message. Typically, applications using distributed memory frameworks assume relaxed memory ordering and use application layer fences whenever strong memory consistency is required [14, 36]. Therefore, both iWARP and Mellanox ConnectX-5, in supporting OOO data placement, expect the application to deal with the potential memory over-writing issue and do not handle it in the NIC or the driver. IRN can adopt the same strategy. Another alternative is to deal with this issue in the driver, by enabling the fence indicator for a newly posted request that could potentially overwrite an older one. 

OOO數據放置還可能致使寫入特定存儲器位置的數據被來自較舊消息的從新發送的數據包覆蓋的狀況。 一般,使用分佈式內存框架的應用程序假定放寬內存排序,並在須要強內存一致性時使用應用程序層圍欄[14,36]。 所以,支持OOO數據放置的iWARP和Mellanox ConnectX-5都但願應用程序可以處理潛在的內存覆蓋問題,而不是在NIC或驅動程序中處理它。 IRN能夠採用相同的策略。 另外一種方法是在驅動程序中處理此問題,方法是爲新發布的請求啓用fence指示符,該請求可能會覆蓋舊的請求。

5.4 Other Considerations (其它考量)

Currently, the packets that are sent and received by a requester use the same packet sequence number (PSN ) space. This interferes with loss tracking and BDP-FC. IRN, therefore, splits the PSN space into two different ones (1) sPSN to track the request packets sent by the requester, and (2) rPSN to track the response packets received by the requester. This decoupling remains transparent to the application and is compatible with the current RoCE packet format. IRN can also support shared receive queues and send with invalidate operations and is compatible with use of end-to-end credit. We provide more details about these in Appendix §B of the extended report [31]. 

目前,請求者發送和接收的數據包使用相同的數據包序列號(PSN)空間。 這會干擾丟失跟蹤和BDP-FC。 所以,IRN將PSN空間分紅兩個不一樣的空間(1)sPSN來跟蹤請求者發送的請求包,以及(2)rPSN跟蹤請求者接收的響應包。 這種解耦對應用程序仍然是透明的,而且與當前的RoCE數據包格式兼容。 IRN還能夠支持共享接收隊列,並經過無效操做發送,並與端到端信用的使用兼容。 咱們在擴展報告的附錄§B中提供了有關這些的更多細節[31]。

相關文章
相關標籤/搜索