B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google’s

B4及以後:爲谷歌軟件定義WAN的可用性和擴展管理層次化、劃分和不對稱

本文爲SIGCOMM 2018會議論文,由谷歌提供。node

筆者翻譯了該論文。因爲時間倉促,且筆者英文能力有限,錯誤之處在所不免;歡迎讀者批評指正。react

本文及翻譯版本僅用於學習使用。若是有任何不當,請聯繫筆者刪除。ios

ABSTRACT (摘要)

Private WANs are increasingly important to the operation of enterprises, telecoms, and cloud providers. For example, B4, Google’s private software-defned WAN, is larger and growing faster than our connectivity to the public Internet. In this paper, we present the five-year evolution of B4. We describe the techniques we employed to incrementally move from offering best-effort content-copy services to carrier-grade availability, while concurrently scaling B4 to accommodate 100x more traffic. Our key challenge is balancing the tension introduced by hierarchy required for scalability, the partitioning required for availability, and the capacity asymmetry inherent to the construction and operation of any large-scale network. We discuss our approach to managing this tension: i) we design a custom hierarchical network topology for both horizontal and vertical software scaling, ii) we manage inherent capacity asymmetry in hierarchical topologies using a novel traffic engineering algorithm without packet encapsulation, and iii) we re-architect switch forwarding rules via two-stage matching/hashing to deal with asymmetric network failures at scale.算法

私有廣域網對企業、電信和雲提供商的運維來講正變得愈來愈重要。例如,谷歌的私有軟件定義廣域網B4的鏈接規模和增加速度超過其到公共因特網的鏈接。本文中,咱們介紹了B4系統的5年演進過程。咱們描述了提供由盡力而爲的內容拷貝服務到電信級可用性逐漸轉變的技術,同時B4系統擴展到適應100倍以上的通訊量。咱們的關鍵挑戰是均衡如下因素致使的矛盾:可擴展性要求的層次化、可用性要求的網絡劃分以及任何大規模網絡構建和運維所固有的容量不對稱。咱們討論瞭解決這一矛盾的方法:1)咱們爲水平和垂直軟件擴展設計了一種定製的層次化網絡拓撲,2)咱們在層次化拓撲中使用一種創新的無需數據包封裝的流量工程算法來解決固有的容量不對稱,和3)咱們經過兩階段匹配/哈希重構了交換機轉發規則,以解決必定規模下的不對稱網絡故障問題。編程

1 Introduction (引言)

B4 [18] is Google’s private backbone network, connecting data centers across the globe (Figure 1). Its software-defned network control stacks enable flexible and centralized control, offering substantial cost and innovation benefts. In particular, by using centralized traffic engineering (TE) to dynamically optimize site to site pathing based on utilization and failures, B4 supports much higher levels of utilization and provides more predictable behavior. 後端

圖1: B4全球拓撲。每一個標記點代表單個站點或多個地理位置接近的站點。截止到2018年1月,B4包含33個站點。promise

B4[18]是谷歌用來鏈接其全球數據中心的私有骨幹網絡(見圖1)。它所採用的軟件定義網絡控制棧具備靈活的和集中的網絡管理能力,帶來極大的成本和革新效益。具體地講,使用集中式流量工程(TE)基於利用率和故障動態優化站間選路,B4支持更高級別的利用率,提供更具預測性地行爲。安全

B4’s initial scope was limited to loss-tolerant, lower availability applications that did not require better than 99% availability, such as replicating indexes across data centers. However, over time our requirements have become more stringent, targeting applications with service level objectives (SLOs) of 99.99% availability. Specifcally, an SLO of X% availability means that both network connectivity (i.e., packet loss is below a certain threshold) and promised bandwidth [25] between any pair of datacenter sites are available X% of the minutes in the trailing 30-day window. Table 1 shows the classifcation of applications into service classes with the required availability SLOs. 網絡

表1:服務等級及他們對應的應用示例,以及爲他們分配的可用性SLO。架構

B4的初始範圍只限於容丟失的低可用性應用,這些應用不須要高於99%的可用性(如跨數據中心索引複製)。然而,隨着時間的遷移,咱們的需求變得更爲嚴格,以服務水平目標(SLOs)爲99.99%可用性的應用爲目標。具體地,SLO爲X%可用性指的是任意一對數據中心站點間的網絡連通性(即,數據丟包低於某一閾值)和許諾帶寬[25]在30天的時間窗口內的可用分鐘數達到X%[筆者注:30天包含43200分鐘,則10%的可用性要求可用分鐘數達到4320分鐘]。表1給出應用到服務等級的分類,這些服務等級具備指定的可用性SLO。

Matching the reported availability of carrier-grade networks initially appeared daunting. Given the inherent unreliability of long-haul links [17] as well as unavoidable downtime associated with necessary management operations [13], conventional wisdom dictates that this level of availability requires substantial over-provisioning and fast local failure recovery, e.g., within 50 milliseconds [7]. 

初看起來,匹配電信級網絡的可用性是使人畏懼的。考慮長距離傳輸鏈路的固有不可靠性[17]和不可避免的與必要的管理操做相關的宕機[13],傳統的知識認爲達到這一等級的可用性(電信級可用性)須要大量的超額配置和快速本地故障恢復(如50毫秒內[7])。

Complicating our push for better availability was the exponential growth of WAN traffic; our bandwidth requirements have grown by 100x over a fve year period. In fact, this doubling of bandwidth demand every nine months is faster than all of the rest of our infrastructure components, suggesting that applications derive signifcant benefts from plentiful cluster to cluster bandwidth. This scalability requirement spans multiple dimensions, including aggregate network capacity, the number of data center sites, the number of TE paths, and network aggregate prefixes. Moreover, we must achieve scale and availability without downtime for existing traffic. As a result, we have had to innovate aggressively and develop novel point solutions to a number of problems. 

廣域網流量的指數增加使咱們對更高可用性的努力複雜化;咱們的帶寬需求在5年期間增加了100倍。事實上,帶寬需求每九個月翻一番的速度比其它基礎設施組件更快,代表應用能夠從大量的集羣到集羣帶寬中顯著受益。可擴展性需求包含多個維度,包括聚合網絡容量、數據中心站點數目、TE路徑數量和網絡聚合前綴。此外,咱們必須在不致使現有數據流中斷的情形下得到擴展性和可用性。所以,咱們須要激進地革新並研發解決多個問題的創新點方案。

In this paper, we present the lessons we learned from our fve-year experience in managing the tension introduced by the network evolution required for achieving our availability and traffic demand requirements (§2). We gradually evolve B4 into a hierarchical topology (§3) while developing a decentralized TE architecture and scalable switch rule management (§4). Taken together, our measurement results show that our design changes have improved availability by two orders of magnitude, from 99% to 99.99%, while simultaneously supporting an increase in traffic scale of 100x over the past fve years (§6) 

本文中,咱們介紹了過去5年在解決網絡演進致使的矛盾方面的經驗教訓;網絡演進是取得咱們可用性和流量需求所必須的(第二部分)。咱們將B4逐漸演化爲一種層次化拓撲(第三部分),同時研發了一種去中心化的TE架構和可擴展交換機規則管理方法(第四部分)。整體來看,實驗結果代表咱們的設計改變將可用性提高了兩個量級(由99%提升到99.99%),同時在過去5年支持了100倍的流量增加(第6部分)。

2 BACKGROUND AND MOTIVATION (背景和動機) 

In this section, we present the key lessons we learned from our operational experience in evolving B4, the motivating examples which demonstrate the problems of alternative design choices, as well as an outline of our developed solutions.

本節,咱們給出B4演進過程當中運維方面的經驗教訓,展現其它設計選擇問題的動機性實例,以及咱們設計方案的概要。

2.1 Flat topology scales poorly and hurts availability (扁平拓撲擴展性差且損害可用性)

We learned that existing B4 site hardware topologies imposed rigid structure that made network expansion difficult. As a result, our conventional expansion practice was to add capacity by adding B4 sites next to the existing sites in a close geographical proximity. However, this practice led to three problems that only manifested at scale. First, the increasing site count signifcantly slowed the central TE optimization algorithm, which was operated at site-level topology. The algorithm run time increased super-linearly with the site count, and this increasing runtime caused extended periods of traffic blackholing during data plane failures, ultimately violating our availability targets. Second, increasing the site count caused scaling pressure on limited space in switch forwarding tables. Third, and most important, this practice complicated capacity planning and confused application developers. Capacity planning had to account for inter-site WAN bandwidth constraints when compute and storage capacity were available in the close proximity but behind a different site. Developers went from thinking about regional replication across clusters to having to understand the mapping of cluster to one of multiple B4 sites. 

現有的B4站點硬件拓撲使用剛性結構,致使網絡難於擴展。其結果是,咱們傳統的擴展方法是經過增長B4站點的方式增長容量;增長的B4站點與現有站點在地理位置鄰近。然而,這種方法致使三個問題,這些問題只在必定規模後纔會呈現。首先,站點數量的增長致使集中TE優化算法(在站點級拓撲下運行)顯著變慢。隨着站點數量的增長,算法的運行時間超線性增加;運行時間的增加致使數據平面故障期間數據流黑洞的時間延長,最終違反咱們的可用性目標。其次,站點數量的增長致使有限交換機轉發表空間方面的擴展壓力。最後,最重要的是,這種方法使容量規劃複雜化並混淆應用開發者。當計算和存儲容量在鄰近點(一個不一樣的站點以後)可用時,容量規劃必須考慮站點間的WAN帶寬約束。開發者由考慮跨集羣的區域複製轉變到必須理解集羣到多個B4站點中一個站點的映射。

To solve this tension introduced by exponentially increasing bandwidth demands, we redesign our hardware topology to a hierarchical topology abstraction (details in Figure 4 and §3). In particular, each site is built with a two-layer topology abstraction: At the bottom layer, we introduce a 「supernode」 fabric, a standard two-layer Clos network built from merchant silicon switch chips. At the top layer, we loosely connect multiple supernodes into a full mesh topology to manage incremental expansion and inherent asymmetries resulting from network maintenance, upgrades and natural hardware/software failures. Based on our operational experience, this two-layer topology provides scalability, by horizontally adding more supernodes to the top layer as needed without increasing the site count, and availability, by vertically upgrading a supernode in place to a new generation without disrupting existing traffic. 

爲了解決帶寬需求指數增加引入的矛盾,咱們從新設計了硬件拓撲爲層次化拓撲抽象(細節見圖4和第三部分)。具體地講,每一個站點構建爲2層拓撲抽象:在底層,咱們引入"超級節點」結構(由商用硅交換芯片構建的標準兩層Clos網絡);在上層,咱們鬆散的連接多個超級節點構成一個全mesh拓撲,以解決增量擴展和網絡維護、升級及軟硬件故障致使的固有不對稱。基於咱們的運維經驗,這種兩層拓撲提供了可擴展性(按需向上層水平增長更多的超級節點,而不須要增長站點數量)和可用性(垂直就地升級超級節點爲新一代設備,而無需打斷現有流量)。

2.2 Solving capacity asymmetry in hierarchical topology (解決層次化拓撲中的容量不對稱)

The two-layer hierarchical topology, however, also causes challenges in TE. We fnd that capacity asymmetry is inevitable at scale due to inherent network maintenance, operation, and data plane device instability. We design our topology to have symmetric WAN capacity at the supernode  level, i.e., all supernodes have the same outgoing confgured capacity toward other sites. However, Figure 2 shows that 6-20% of the site-level links in B4 still have > 5% capacity asymmetry. We defne capacity asymmetry of a site-level link as (avg∀i(Ci) - min∀i(Ci))/avg∀i(Ci), where Ci is the total active capacity of each supernode i toward next site.

 圖2: 2017年10月到2018年1月,B4中具備不一樣量級容量不對稱的站點級鏈路的比例。

然而,2層層次化拓撲給TE帶來挑戰。咱們發現,因爲固有的網絡維護、運維和數據平面設備的不穩定,在必定規模下容量不對稱問題是不可避免的。咱們設計的拓撲在超級節點級具備對稱的WAN容量,即全部的超級節點具備相同的向其它站點的出口容量。然而,圖2代表B4中6-20%的站點級鏈路仍然具備大於5%的容量不對稱。咱們定義站點級鏈路的容量不對稱爲(avg∀i(Ci) - min∀i(Ci))/avg∀i(Ci),這裏Ci表示超級節點i向其它相鄰站點的總活躍流量。

Moreover, we fnd that capacity asymmetry signifcantly impedes the efficiency of our hierarchical topology—In about 17% of the asymmetric site-level links, we have 100% site level abstraction capacity loss because at least one supernode completely loses connectivity toward another site. To understand why capacity asymmetry management is critical in hierarchical topologies, we frst present a motivating example. Figure 3a shows a scenario where supernodes A1 and A2 respectively have 10 and 2 units of active capacity toward site B, resulting from network operation or physical link failure. To avoid network congestion, we can only admit 4 units of traffic to this site-level link, because of the bottlenecked supernode A2 which has lower outgoing capacity. This indicates that 8 units of supernode-level capacity are wasted, as they cannot be used in the site-level topology due to capacity asymmetry. 

 

圖3: 包含2個站點且每一個站點包含2個超級節點的站點級拓撲示例。每條鏈路上的標註表示該鏈路的有效容量,入口流量在A1和A2之間平均分配。子圖給出受限於鏈路容量,站點級鏈路(A,B)的最大准入流量。(a)沒有旁路;(b)包含旁路。 

此外,咱們發現容量不對稱顯著抑制了層次化拓撲的效率—在約17%的不對稱站點級鏈路中,具備100%的站點級抽象容量損失,這是由於至少一個超級節點徹底失去到另外一個站點的連接。爲了理解層次化次拓撲中容量不對稱管理的重要性,咱們首先給出一個動機性實例。圖3a中超級節點A1和A2到站點B分別有10和2個單位的有效容量,這是由網絡運維或物理鏈路故障致使的。因爲瓶頸超級節點A2具備更低的出口容量,爲了不網絡擁塞,此站點級鏈路僅容許4個單位的流量。這代表有8個單位的超級節點級容量是浪費的,由於容量不對稱致使他們沒法在站點級拓撲中使用。  

To reduce the inefficiency caused by capacity asymmetry, we introduce sidelinks, which interconnect supernodes within a site in a full mesh topology. Figure 3b shows how the use of sidelinks increases admissible traffic volume by 3x in this example. The central controller dynamically rebalances traffic within a site to accommodate asymmetric WAN link failures using these sidelinks. Since supernodes of a B4 site are located in close proximity, sidelinks, like other datacenter network links, are much cheaper and signifcantly more reliable than long-haul WAN links. Such heterogeneous link cost/reliability is a unique WAN characteristic which motivates our design. The flexible use of sidelinks with supernode level TE not only improves WAN link utilization but also enables incremental, in-place network upgrade, providing substantial up-front cost savings on network deployment. Specifcally, with non-shortest-path TE via sidelinks, disabling supernode-level links for upgrade/maintenance does not cause any downtime for existing traffic—It merely results in a slightly degraded site-level capacity. 

爲了減小容量不對稱致使的低效問題,咱們引入了旁路(sidelink)的概念;旁路互連同一站點中的超級節點構成全mesh拓撲。圖3b展現瞭如何經過使用旁路將本例中的准入流量增長3倍。集中控制器利用旁路動態重均衡站點內的數據流以適應不對稱WAN鏈路故障。因爲B4站點內的超級節點的位置是鄰近的,與其餘數據中心網絡鏈路相似,旁路的價格更爲低廉且比長傳輸距離的WAN鏈路更爲可靠。這種異構鏈路的成本/可靠性是驅動咱們設計的獨特WAN特徵。旁路的靈活使用配合超級節點級TE不只提升了WAN鏈路的利用率,而且使得增量式就地網絡升級成爲可能,極大地下降了網絡部署的預付成本。具體地,使用旁路的非最短路徑TE,因升級/維護而斷掉超級節點級鏈路不會致使任何現有流量的中斷。即,僅致使站點級容量的輕微退化。

Evolving the existing site-level TE to a hierarchical TE (site-level, and then supernode-level) turned out to be challenging. TE typically requires tunnel encapsulation (e.g., via MPLS label [34], IP-in-IP encapsulation [30], or VLAN tag [3]), while off-the-shelf commodity switch chips can only hash on either the inner or outer layer packet header. With two layers of tunnel encapsulation, the packet header in both the inner and outer layer has very low entropy, making it impossible to enforce traffic splits. Another option is to overwrite the MAC address on each packet as a tag for supernode-level TE. However, we already reserve MAC addresses for more efficient hashing (§5). To solve this, we design and implement a novel intra-site TE algorithm (§4) which requires no packet tagging/encapsulation. Moreover, we fnd that our algorithm is scalable, by taking 200-700 milliseconds to run at our target scale, and effective, by reducing capacity loss due to topology abstraction to 0.6% on a typical site-level link even in the face of capacity asymmetry. 

將現有的站點級TE演化爲層次化TE(站點級,其後是超級節點級)證實是具備挑戰性的。TE一般須要隧道封裝(如,MPLS標籤[34],IP-in-IP封裝[30]或VLAN標籤[3]),然而商用交換機芯片只能依據數據包內部或者外部報頭進行哈希計算。使用兩層隧道封裝,內部和外部數據報頭均只有很是低的熵值,致使沒法實施流量切分。另外一種可選方案是重寫每一個數據包的MAC地址做爲超級節點級TE的標籤。然而,咱們已經保留MAC地址以實行更有效的哈希計算(第五部分)。爲了解決這個問題,咱們設計和實現了新型站點內TE算法(第四部分),該算法不須要數據包打標/封裝。此外,該算法是可擴展的,在目標規模下運行時間爲200-700毫秒;該算法是高效的,在典型站點級鏈路上,即便在容量不對稱問題存在時也可減小拓撲抽象致使的容量損失到0.6%。

A further challenge is that TE updates can introduce routing blackholes and loops. With pre-installed tunnels, steering traffic from one tunnel to another is an atomic operation, as only the ingress source node needs to be updated to encapsulate traffic into a different tunnel. However, removing tunnel encapsulation complicates network updates. We fnd that applying TE updates in an arbitrary order results in more than 38% packet blackholing from forwarding loops in 2% of network updates (§6.3). To address this issue, earlier work enables loop-free update with two-phase commit [31] and dependency tree/forest based approach [28]. More recently, Dionysus [21] models the network update as a resource scheduling problem and uses critical path scheduling to dynamically find a feasible update schedule. However, these efforts assume tunneling/version tagging, leading to the previously described hashing problem for our hierarchical TE. We develop a simple dependency graph based solution to sequence supernode-level TE updates in a provably blackhole/loop free manner without any tunneling/tagging (§4.4). 

另外一個挑戰是TE升級可能會引入路由黑洞和環路。使用預設隧道,因爲只有入口源節點須要更新從而將流量封裝到一個不一樣的隧道,將流量由一個隧道導向另外一個隧道是的過程是原子操做。然而,移除隧道封裝使網絡更新複雜化。咱們發現,以任意順序應用TE更新致使2%的網絡更新的前向環路中有多達38%的數據包黑洞(見6.3節)。爲了解決這個問題,前期工做採用兩階段提交[31]和基於依賴樹/森林的方法[28]實現無環更新。最近,Dionysus[21]將網絡更新建模爲資源調度問題,並使用關鍵路徑調度來動態搜索可行的更新調度方法。然而,這些工做假設隧道/版本打標,對咱們的層次化TE來講,致使前文所描述的哈希問題。咱們開發了一種簡單的基於依賴圖的方法,排定超級節點級TE更新操做的順序,其是可證實無黑洞/環路的,並且不須要任何隧道/標籤(見4.4節)。

2.3 Efcient switch rule management (高效交換機規則管理) 

Merchant silicon switches support a limited number of matching and hashing rules. Our scaling targets suggested that we would hit the limits of switch matching rules at 32 sites using our existing packet forwarding pipeline. On the other hand, hierarchical TE requires many more switch hashing entries to perform two-layer, fne-grained traffic splitting. SWAN [16] studied the tradeoffs between throughput and the switch hashing rule limits and found that dynamic tunneling requires much fewer hashing rules than a fxed tunnel set. However, we fnd that dynamic tunneling in hierarchical  TE still requires 8x more switch hashing rules than available to achieve maximal throughput at our target scale. 

商用硅交換機支持有限數量的匹配和哈希規則。使用咱們現有的包轉發流水線,在32個站點時,咱們將達到交換機匹配規則的數量限制。另外一方面,層次化TE須要更多的交換機哈希項來執行兩層的細粒度流量劃分。SWAN[16]研究了吞吐量和交換機哈希規則限制之間的權衡,發現動態隧道化比靜態隧道集合須要更少的哈希規則。然而,咱們發如今層次下TE下采用動態隧道化仍然須要比達到目標規模下的最大吞吐量所須要的交換機哈希規則多8倍以上。

We optimize our switch forwarding behavior with two mechanisms (§5). First, we decouple flow matching rules into two switch pipeline tables. We fnd this hierarchical, two-phase matching mechanism increases the number of supported sites by approximately 60x. Second, we decentralize the path split rules into two-stage hashing across the edge and the backend stage of the Clos fabric. We fnd this optimization is key to hierarchical TE, which otherwise would offer 6% lower throughput, quite substantial in absolute terms at our scale, due to insufficient traffic split granularity. 

咱們使用兩種機制優化交換機轉發行爲(第五部分)。首先,咱們將流匹配規則分解到兩個交換機流水錶。咱們發現這種層次化的兩階段匹配機制將支持的站點數目增長了接近60倍。其次,咱們將路徑分割規則去中心化爲跨Clos設施邊緣和後端階段的2階段哈希。咱們發現這種優化方法是層次化TE的關鍵,將吞吐量提高6%(在咱們的規模下,就絕對值來講是很是大的),這是由於不採用這種優化方法致使流量劃分粒度不足。

 3 SITE TOPOLOGY EVOLUTION (站點拓撲演化) 

Table 2 summarizes how B4 hardware and the site fabric topology abstraction have evolved from a flat topology to a scalable two-stage hierarchy over the years. We next present Saturn (§3.1), our frst-generation network fabric, followed by Jumpgate (§3.2), a new generation of site network fabrics with improved hardware and topology abstraction.

表2:B4設施代次。全部的設施均有經常使用硅交換芯片構建。

 

表2總結了B4硬件和站點設施拓撲如何由扁平拓撲逐年演化爲可擴展的兩階段層次化拓撲。咱們接下來討論咱們第一代網絡設施Saturn(見3.1節),接着討論新一代站點網絡設施Jumpgate(見3.2節);jumpgate提高了硬件和拓撲結構。 

3.1 Saturn

Saturn was B4’s frst-generation network fabric, deployed globally in 2010. As shown in Figure 4, the deployment consisted of two stages: A lower stage of four Saturn chassis, offering 5.12 Tbps of cluster-facing capacity, and an upper stage of two or four Saturn chassis, offering a total of 3.2 Tbps and 6.4 Tbps respectively toward the rest of B4. The difference between cluster and WAN facing capacity allowed the fabric to accommodate additional transit traffic. For availability, we physically partitioned a Saturn site across two buildings in a datacenter. This allowed Saturn sites to continue operating, albeit with degraded capacity, if outages caused some or all of the devices in a single building to become unreachable. Each physical rack contained two Saturn chassis, and we designed the switch to rack mapping to minimize the capacity loss upon any single rack failure. 

圖4:B4設施由扁平Saturn拓撲到層次化Jumpgate拓撲的演化;Jumpgate包含一個稱爲超級節點的抽象層。

Saturn是B4的第一代網絡設施,2010年在全球部署。如圖4所示,部署包括兩個步驟:4個Saturn底架構成的底層,提供5.12Tbps的面向集羣的容量;由2個或4個底架構成的上層,向B4的其餘部分分別提供3.2 Tbps和6.4 Tbps的容量。面向集羣和WAN的容量的差別使得設施能夠提供額外的中轉流量。針對可用性,咱們將單一Saturn的站點從物理上劃分到數據中心的兩個建築。這使得Saturn站點能夠在容量下降的情形下仍然工做(單一建築中的某些或者全部設備不可達)。每一個物理機架包含2個Saturn底架,咱們設計交換機到機架映射使得任意單機架故障情形下容量損失最小化。 

3.2 Jumpgate

Jumpgate is an umbrella term covering two flavors of B4 fabrics. Rather than inheriting the topology abstraction from Saturn, we recognize that the flat topology was inhibiting scale, and so we design a new custom hierarchical network topology in Jumpgate for both horizontal and vertical scaling of  site-level capacity without impacting the scaling requirements of control software. As shown in Figure 4, we introduce the concept of a supernode as an intermediate topology abstraction layer. Each supernode is a 2-stage folded-Clos network. Half the ports in the lower stage are external-facing and can be flexibly allocated toward peering B4 sites, cluster fabrics, or other supernodes in the same site. We then build a Jumpgate site using a full mesh topology interconnecting supernodes. These intra-site links are called sidelinks. In addition, B4’s availability in Saturn was signifcantly reduced by having a single control domain per site. We had a number of outages triggered by a faulty domain controller that caused widespread damage to all trafc passing through the affected site. Hence, in Jumpgate we partition a site into multiple control domains, one per supernode. This way, we improve availability by reducing the blast radius of any domain controller fault to traffic transiting a single supernode. 

Jumpgate是覆蓋兩種B4設施特色的涵蓋性術語。與其繼承Saturn的拓撲結構,咱們意識到扁平拓撲會限制擴展性,所以咱們在Jumpgate中設計了一種新的定製的層次化網絡拓撲;Jumpgate爲站點級容量提供水平和垂直擴展,而不影響控制軟件的擴展需求。如圖4所示,咱們引入超級節點做爲中間拓撲抽象層。每一個超級節點是一個2階段folded-Clos網絡。底層的一半端口是面向外部的,能夠靈活地分配給對等B4站點、集羣設施或者同一站點內其它的超級節點。接着,咱們使用互連超級節點的全mesh拓撲構建Jumpgate站點。這些互連鏈路稱爲旁路。此外,因爲每一個站點只有一個控制域,Saturn中B4的可用性顯著下降。咱們觀察到大量的由域控制器故障致使的中斷,這會普遍損害全部途徑受影響站點的數據流。所以,Jumpgate中將站點劃分爲多個控制域,每一個超級節點一個控制域。採用這種方式,經過減小任意域控制器故障的爆炸半徑到途徑單一超級節點的數據流,提高了B4的可用性。

We present two generations of Jumpgate where we improve availability by partitioning the site fabric into increasingly more supernodes and more control domains across generations, as shown in Figure 4. This new architecture solves the previously mentioned network expansion problem by incrementally adding new supernodes to a site with flexible sidelink reconfgurations. Moreover, this architecture also facilitates seamless fabric evolution by sequentially upgrading each supernode in place from one generation to the next without disrupting traffic in other supernodes.

咱們展現了兩代Jumpgate,不一樣代次間站點劃分爲更多的超級節點和控制域,從而提高了可用性,如圖4所示。這種新型架構解決了先前提到的網絡擴展問題;解決方案是使用靈活的旁路重配和增量增長站點內的超級節點數量。此外,這種架構經過依次升級每一個超級節點(由某一代到下一代)爲無縫設施演進提供了便利,而且無需干擾其餘超級節點的流量。

Jumpgate POP (JPOP): Strategically deploying transitionly sites improves B4’s overall availability by reducing the network cable span between datacenters. Transit sites also improve site-level topology 「meshiness,」 which improves centralized TE’s ability to route around a failure. Therefore, we develop JPOP, a low-cost confguration for lightweight deployment in the transit POP sites supporting only transit traffic. Since POP sites are often constrained by power and physical space, we develop JPOP with high bandwidth density 16x40G merchant silicon (Figure 4), requiring a smaller number of switch racks per site. 

JPOP:經過減小數據中心間網絡線纜的跨度,戰略性部署中轉站點提高了B4的總可用性。中轉站點同時提升了站點級拓撲的「meshiness」,進而提高了集中TE繞過故障的能力。所以,咱們開發了JPOP,一種用於只支持中轉流量的中轉POP站點輕量級部署的低成本配置方案。因爲POP站點一般受到電源和物理空間的限制,咱們使用高帶寬密度16x40G商用硅(圖4)構建JPOP,所以每一個站點只須要少許的交換機機架。 

Stargate: We globally deprecate Saturn with Stargate, a large network fabric to support organic WAN traffic demand growth in datacenters. A Stargate site consists of up to four supernodes, each a 2-stage folded-Clos network built with 32x40G merchant silicon (Figure 4). Stargate is deployed in datacenters and can provide up to 81.92 Tbps site-external capacity that can be split among WAN, cluster and sidelinks. Compared with Saturn, Stargate improves site capacity by more than 8x to keep up with growing traffic demands. A key for this growth is the increasing density of forwarding capacity in merchant silicon switch chips, which enables us to maintain a relatively simple topology. The improved capacity allows Stargate to subsume the campus aggregation network. As a result, we directly connect Stargate to Jupiter cluster fabrics [32], as demonstrated in Figure 5. This architecture change simplifes network modeling, capacity planning and management. 

圖5:Stargate歸入園區匯聚網絡。

Stargate: 咱們在全球範圍內使用Stargate替代Saturn。Stargate是一種支持數據中心WAN流量需求增加的大型網絡設施。Stargate站點由多達4個超級節點構成,每一個超級節點由32x40G商用硅(圖4)構建成2階段folded-Clos網絡。Stargate部署於數據中心,能夠提供高達81.92 Tbps的站點到外部容量,這些容量能夠在WAN、集羣和旁路間劃分。與Saturn相比,Stargate將站點容量提高了8倍以上,從而知足增加的流量需求;這種增加的關鍵是商用硅交換芯片中轉發容量密度的增加,使得咱們能夠維護相對簡單的拓撲。容量的提高使得Stargate能夠歸入園區匯聚網絡。其結果是,咱們能夠直接將Stargate連接到Jupiter集羣設施[32],如圖5所示。架構改變簡化了網絡建模、容量規劃和管理。

4 HIERARCHICAL TRAFFIC ENGINEERING (層次化流量工程TE)

In this section, we start with two straw-man proposals for the capacity asymmetry problem (§4.1). We solve this problem by evolving from flat TE into a hierarchical TE architecture (§4.2) where a scalable, intra-site TE algorithm is developed to maximize site-level link capacity (§4.3). Finally, we present a dependency-graph based algorithm to sequence the supernode-level rule updates (§4.4). Both algorithms are highly scalable, blackhole and loop free, and do not require packet encapsulation/tagging. 

本節,咱們以兩個解決容量不對稱問題的稻草人方案爲起始(見4.1節):咱們經過扁平化TE演化爲層次化TE架構解決這個問題(見4.2節),開發了一種可擴展的站內TE算法,最大化站點間鏈路容量(見4.3節)。最後,咱們給出了基於依賴圖的超級節點級規則升級序列算法(見4.4節)。這些算法均是可擴展,無黑洞和環路的,而且不須要數據包封裝/打標。

4.1 Motivating Examples (驅動性示例)

Managing capacity asymmetry in hierarchical topologies requires a supernode-level load balancing mechanism to maximize capacity at the site-level topology abstraction. Additionally, we need the solution to be fast, improving availability by reducing the window of blackholing after data plane failures, and efficient, achieving high throughput within the hardware switch table limits. Finally, the solution must not require more than one layer of packet encapsulation. We discuss two straw-man proposals: 

在層次化拓撲中管理容量不對稱須要一種超級節點級負載均衡機制,以最大化站點級拓撲容量。此外,咱們須要解決方案快速(經過減小數據平面故障後黑洞窗口時間提升可用性)和高效(在硬件交換機表約束下取得高吞吐量)。最後,該解決方案不能使用多於一層的數據包封裝。咱們討論兩種稻草人方案: 

Flat TE on supernodes does not scale. With a hierarchical topology, one could apply TE directly to the full supernodelevel topology. In this model, the central controller uses IP-in-IP encapsulation to load balance traffic across supernodelevel tunnels. Our evaluation in indicates that supporting high throughput with this approach leads to prohibitively high runtime, 188x higher than hierarchical TE, and it also requires a much larger switch table space (details in §6). This approach is untenable because the complexity of the TE problem grows super-linearly with the size of the topology. For example, suppose that each site has four supernodes, then a single site-to-site path with three hops can be represented by 4^3 = 64 supernode-level paths. 

超級節點上的扁平化TE沒法擴展。使用層次化拓撲,能夠直接將TE應用於所有超級節點拓撲。這種模型中,集中控制器使用IP-in-IP封裝在超級節點隧道間均衡負載。咱們的評估代表採用這種方法來支持高吞吐量會致使太高的運行時間(比層次化TE高188倍),而且須要更多的交換機表空間(詳見第6部分)。因爲TE問題的複雜度隨着拓撲大小超線性增加,這種方法是不可行的。例如,假設每一個站點包含4個超級節點,那麼單條三跳的站點-站點路徑能夠由4^3=64條超級節點級路徑表示。

Supernode-level shortest-path routing is inefficient against capacity asymmetry. An alternative approach combines site-level TE with supernode-level shortest path routing. Such two-stage, hierarchical routing achieves scalability and requires only one layer of encapsulation. Moreover, shortest path routing can route around complete WAN failure via sidelinks. However, it does not properly handle capacity asymmetry. For example, in Figure 3b, shortest path routing cannot exploit longer paths via sidelinks, resulting in suboptimal site-level capacity. One can tweak the cost metric of sidelinks to improve the abstract capacity between site A and B. However, changing metrics also affects the routing for other site-level links, as sidelink costs are shared by tunnels towards all nexthop sites. 

超級節點級最短路徑路由因容量不對稱問題而不高效。一種可選擇的方案是結合站點級TE和超級節點級最短路徑路由。這種兩階段的層次化路由是可擴展的,而且只須要一層封裝。此外,最短路徑路由能夠經過旁路徹底繞開WAN故障。然而,這種方法沒法有效的應對容量不對稱。例如,圖3b中,最短路徑路由沒法利用經過旁路的長路徑,致使站點級容量不理想。能夠微調旁路的代價度量提高站點A和站點B之間的容量。然而,因爲旁路的代價由全部到下一跳站點的隧道共享,改變度量值一樣會影響其餘站點級鏈路的路由。

4.2 Hierarchical TE Architecture (層次化TE架構)

Figure 6 demonstrates the architecture of hierarchical TE. In particular, we employ the following pathing hierarchy: 

圖6給出層次化TE的架構。具體地,咱們採用下述選路層次:

Flow Group (FG) specifes flow aggregation as a ⟨Source Site, Destination Site, Service Class⟩ tuple, where we currently map the service class to DSCP marking in the packet header. For scalability, the central controller allocates paths for each FG. 

流分組(FG)將流聚合做爲元組(源站點、目標站點、服務等級),當前,咱們將服務等級映射爲數據包頭中的DSCP標記。爲了擴展性,集中控制器爲每一個FG分配路徑。

Tunnel Group (TG) maps an FG to a collection of tunnels (i.e., site-level paths) via IP-in-IP encapsulation. We set the traffic split with a weight for each tunnel using an approximately max-min fair optimization algorithm (§4.3 in [18]). 

隧道分組(TG)經過IP-in-IP封裝將FG映射到隧道集合(即,站點級路徑)。咱們使用一種近似的最大-最小公平優化算法爲每一個隧道的流量劃分設置權重值([18]的4.3節)。

Tunnel Split Group (TSG), a new pathing abstraction, specifes traffic distribution within a tunnel. Specifcally, a TSG is a supernode-level rule which controls how to split traffic across supernodes in the self-site (other supernodes in the current site) and the next-site (supernodes in the tunnel’s next site). 

隧道劃分組(TSG),一種新的選路抽象,指明某一隧道中的流量分佈。具體地,TSG是一種超級節點級規則,用於控制如何在自站點(當前站點中的其餘超級節點)和下一站點(隧道另外一站點的超級節點)間劃分流量。

Switch Split Group (SSG) specifes traffic split on each switch across physical links. The domain controller calculates SSGs. 

交換機劃(SSG)分組:指明在交換機上跨物理鏈路的流量劃分。域控制器計算SSG。

We decouple TG and TSG calculations for scalability. In particular, the TG algorithm is performed on top of a site-level abstract topology which is derived from the results of TSG calculations. TSG calculation is performed using only topology data, which is unaffected by TG algorithm results. We outline the hierarchical TE algorithm as following steps. First, the domain controller calculates supertrunk (supernode-level link) capacity by aggregating the capacity of active physical links and then adjusting capacity based on fabric impairments. For example, the supernode Clos fabric may not have full bisection bandwidth because of failure or maintenance. Second, using supertrunk capacities, the central controller calculates TSGs for each outgoing site-level link. When the outgoing site-level link capacity is symmetric across supernodes, sidelinks are simply unused. Otherwise, the central controller generates TSGs to rebalance trafc arriving at each site supernode via sidelinks to match the outgoing supertrunk capacity. This is done via a fair-share allocation algorithm on a supernode-level topology including only the sidelinks of the site and the supertrunks in the site-level link (§4.3). Third, we use these TSGs to compute the effective capacity for each link in the site-level topology, which is in turn consumed by TG generation (§4.3 in [18]). Fourth, we generate a dependency-graph to sequence TE ops in a provably blackhole-free and loop-free manner (§4.4). Finally, we program FGs, TGs, and TSGs via the domain controllers, which in turn calculate SSGs based on the intradomain topology and implement hierarchical splitting rules across two levels in the Clos fabric for scalability (§5). 

爲了可擴展性,咱們解耦TG和TSG計算。具體地,TG算法運行於站點級拓撲之上,且基於TSG的計算結果。TSG的計算只使用拓撲數據,而且不受TG運算結果的影響。咱們以以下步驟簡述層次化TE算法。首先,域控制器經過聚合有效物理鏈路的容量計算超級幹線(超級節點級鏈路)的容量,而後基於設施損害調整容量。例如,因爲故障或者維護,超級節點Clos設施可能沒有全折半帶寬。其次,使用超級幹線容量,中央控制器計算每一個出口站點級鏈路的TSG。當入口站點級鏈路的容量在超級節點間是對稱的,旁路不被使用。不然,集中控制器生成TSG經過旁路從新均衡到達每一個站點超級節點的流量,以匹配出口超級幹線容量;這是經過超級節點級拓撲(只包含站點的旁路和站點級鏈路的超級幹線)上的公平共享算法實現的。第三,咱們使用TSG計算站點級拓撲中每條鏈路的有效容量,結果接着由TG生成所消耗(見[18]中4.3節)。第四,咱們生成依賴圖以序列化TE操做,使得其是無黑洞和環路的(見4.4節)。最後,咱們經過域控制器編排FG,TG和TSG;根據域內拓撲計算SSG,跨Clos設施的兩層實現層次化劃分規則以達到可擴展性(第5部分)。

圖6: 層次化TE示例;超級節點級TE爲新開發的組件以經過旁路應對容量不對稱。

4.3 TSG Generation (TSG生成)

Problem statement: Supposing that the incoming traffc of a tunnel is equally split across all supernodes in the source site, calculate TSGs for each supernode within each site along the tunnel such that the amount of traffic admitted into this tunnel is maximized subject to supertrunk capacity constraints. Moreover, an integer is used to represent the relative weight of each outgoing supertrunk in the TSG split. The sum of the weights on each TSG cannot exceed a threshold, T , because of switch hashing table entry limits. 

問題聲明:假設某隧道的輸入流量在源站點的全部超級節點間均勻劃分,在該隧道途徑的每一個站點內計算每一個超級節點的TSG使得該隧道的准入流量總量最大(受限於超級幹線約束)。此外,以整數表示該TSG劃分中每一個出口超級幹線的相對權重。每一個TSG中的權重總和不能超過閾值T,這是由於交換機哈希表項的限制。 

Examples: We first use examples of TSG calculations for fine-grained traffic engineering within a tunnel. In Figure 7a, traffic to supernodes Bi is equally split between two supernodes to the next tunnel site, C1 and C2. This captures a common scenario as the topology is designed with symmetric supernode-level capacity. However, capacity asymmetry may still occur due to data-plane failures or network operations. Figure 7b demonstrates a failure scenario where B1 completely loses connectivity to C1 and C2. To route around the failure, the TSG on B1 is programmed to only forward to B2. Figure 7c shows a scenario where B2 has twice the capacity toward C relative to B1. As a result, the calculated TSG for B1 has a 2:1 ratio for the split between the next-site (C2) and self-site (B2). For simplicity, the TSG at the next site C does not rebalance the incoming traffic. 

圖7:不一樣故障場景下TSG功能示例。此處展現的流量封裝爲A->B->C的隧道,TSG控制流量如何在超級節點級劃分(Ai,Bi和Ci)。每一個超級幹線的容量爲C,標記TC代表給定TSG配置下的最大隧道容量。

示例:咱們首先使用TSG計算的例子,用於計算某一隧道內細粒度流量工程。圖7a中,到超級節點Bi的流量在兩個到下一隧道站點的超級站點間均勻劃分(C1和C2)。因爲拓撲設計爲對稱超級節點級容量,圖7a展現了常見場景。然而,因爲數據平面故障和網絡操做,容量不對稱仍有可能發生。圖7b展現了一種故障場景,其中B1徹底失去與C1和C2的連接。爲了繞過故障,B1上的TSG編排爲僅轉發到B2。圖7c展現了B2相對於B1有兩倍到C的容量的場景。其結果是,計算出的B1的TSG在下一站點(C2)和自站點(B2)間劃分的比例爲2:1。簡化起見,下一站點C的TSG不須要從新均衡入口流量。 

We calculate TSGs independently for each site-level link. For our deployments, we assume balanced incoming traffc across supernodes as we observed this assumption was rarely violated in practice. This assumption allows TSGs to be reused across all tunnels that traverse the site-level link, enables parallel TSG computations, and allows us to implement a simpler solution which can meet our switch hardware limits. We discuss two rare scenarios where our approach can lead to suboptimal TSG splits. First, Figure 7d shows a scenario where B1 loses connectivity to both next site and self site. This scenario is uncommon and has happened only once, as the sidelinks are located inside a datacenter facility and hence much more reliable than WAN links. In this case, the site-level link (B,C) is unusable. One could manually recover the site-level link capacity by disabling B1 (making all its incident links unusable), but it comes at the expense of capacity from B1 to elsewhere. Second, Figure 7e shows a scenario with two consecutive asymmetric site-level links resulting from concurrent failure. Because TSG calculation is agnostic to incoming traffic balance at site B, the tunnel capacity is reduced to 9c/4, 25% lower than the optimal splitting where B1 forwards half of its traffic toward self-site (B2) to accommodate the incoming trafc imbalance between B1 and B2, as shown in Figure 7f. Our measurements show that this failure pattern happens rarely: to only 0.47% of the adjacent site-level link pairs on average. We leave more sophisticated per-tunnel TSG generation to future work. 

咱們單獨爲站點級鏈路計算TSG。對於咱們的部署來講,咱們假設超級節點間的入口流量是均衡的;實踐中,咱們發現不多出現違反這一假設的情形。這一假設容許TSG在途徑站點級鏈路的全部隧道間被重用,使得並行TSG計算成爲可能,而且容許咱們實現一種簡單的方案知足硬件限制。咱們討論兩種不多出現的場景,這些場景中咱們的解決方案可能致使TSG劃分不理想。首先,圖7d展現了一種場景,其中B1失去子站點和下一站點的連接。這種場景並不常見,且只發生了一次,這是由於旁路在同一數據中心中,所以比WAN鏈路更爲可靠。這種情形下,站點級鏈路(B,C)不可用。能夠經過使B1失效(全部進入B1的鏈路無效)恢復站點級鏈路的容量,但這種方法的代價是B1到其餘地方的容量不可用。第二,圖7e顯示了併發故障致使的兩個連續不對稱站點級鏈路的情景。因爲站點B處入口流量均衡對TSG計算是不可知的,隧道容量下降到9c/4,比理想劃分下降了25%(B1轉發其一半的流量到自站點B2以適應B1和B2間入口流量的不均衡,如圖7f所示)。咱們的測試代表這種故障模式極少發生:平均只有0.47%的連接站點級鏈路對。咱們將更精緻的每隧道TSG生成留做將來工做。

TSG generation algorithm: We model the TSG calculation as an independent network flow problem for each directed site-level link. We frst generate a graph GTSG where vertices include the supernodes in the source site of the site-level link (Si, 1 ≤ i ≤ N) and the destination site of the site-level link (D). We then add two types of links to the graph. First, we form a full mesh among the supernodes in the source site: (∀i, j 不等於 i : Si ↔ Sj). Second, links are added between each supernode in the source site and the destination site: (∀i : Si ↔ D). We associate the aggregate capacity of the corresponding supertrunks to each link. Subsequently, we generate flows with infnite demand from each supernode Si toward the target site D and try to satisfy this demand by using two kinds of pathing groups (PG) with one hop and two hop paths respectively. We use a greedy exhaustive waterfill algorithm to iteratively allocate bottleneck capacity in a max-min fair manner. We present the TSG generation pseudo code in Algorithm 1. 

TSG生成算法:咱們將TSG計算建模爲每條直連站點級鏈路的網絡流問題。首先,生成圖GTSG,圖中頂點包含站點級鏈路的源站點的超級節點(Si,1<=i<=N)和站點級鏈路的目的站點(D)。咱們增長兩種類型的鏈路到圖中。首先,咱們將源站點中全部超級節點構成全mesh(∀i, j 不等於 i : Si ↔ Sj)。第二,增長每一個源站點的超級節點和目的站點的連接:(∀i : Si ↔ D)。咱們將相應超級幹線容量關聯到每條鏈路。隨後,咱們生成由每一個超級節點i到目的站點D的具備無限需求的數據流,並嘗試使用兩種選路組(PG)知足這一需求。兩種選路組分別爲單跳路徑和2跳路徑。咱們使用貪心的窮舉瀑布算法以最大-最小公平方式迭代的分配瓶頸容量。TSG生成算法的僞代碼如算法1所示。

Theorem 4.1. The generated TSGs do not form any loop.
定理4.1:生成的TSG不會造成環路。

Proof. Assume the generated graph has K + 1 vertices: Si (0 ≤ i ≤ K - 1), each represents a supernode in the source site, and D represents the destination site. Given the pathing constraint imposed in step (1), each flow can only use either the one-hop direct path (Si→D) or two-hop paths (Si→Sj,i→D), while the one-hop direct path is strictly preferred over two-hop paths. Assume a loop with ℓ > 1 hops are formed with forwarding sequence, without loss of generality: <S0, S1, ..., Sℓ-1, S0>. Note that the loop cannot contain destination vertex D. Given the pathing preference and the loop sequence, the link (Si , D) must be bottlenecked prior to (Si+1, D), and (Sℓ-1, D) must be bottlenecked prior to (S0, D). We observe that bottleneck sequences form a loop, which is impossible given that the algorithm only bottlenecks each link once. 

證實. 假設生成圖中有K+1個頂點:Si(0 ≤ i ≤ K - 1)表示源站點中的超級節點,和表示目的站點的D。給定步驟(1)中施加的選路限制,每條流智能使用單跳直連鏈路 (Si→D) 或兩跳路徑 (Si→Sj不等於i→D),且單跳路徑嚴格優先於兩跳路徑。假設ℓ > 1跳的環路在轉發序列中造成,不失通常性,假設環路爲<S0, S1, ..., Sℓ-1, S0>。注意到環路中不能包含目的頂點D。考慮到選路優先和環路序列,鏈路 (Si , D) 在 (Si+1, D)前成爲瓶頸,且(Sℓ-1, D)在(S0, D)前成爲瓶頸。咱們發現這種瓶頸序列造成環路,考慮到算法中每條鏈路只成爲瓶頸一次,上述序列是不可能的。

算法1: TSG生成

4.4 Sequencing TSG Updates (編排TSG更新順序)

We fnd that applying TSG updates in an arbitrary order could cause transient yet severe traffic looping/blackholing (§6.3), reducing availability. Hence, we develop a scalable algorithm to sequence TSG updates in a provably blackhole/loop-free manner as follows. 

咱們發現以任意順序進行TSG更新會致使瞬時但嚴重的流量環路/黑洞(6.3節),下降可用性。爲此,咱們開發了編排TSG更新順序的可擴展算法,使得更新是無黑洞/環路的,描述以下。

TSG sequencing algorithm: Given the graph GTSG and the result of the TSGs described in §4.3, we create a dependency graph as follows. First, vertices in the dependency graph are the same as that in GTSG. We add a directed link from Si to Sj if Sj appears in the set of next hops for Si in the target TSG confguration. An additional directed link from Si to D is added if Si forwards any traffic directly to the next-site in the target TSG confguration. According to Theorem 4.1, this dependency graph will not contain a loop and is therefore a directed acyclic graph (DAG) with an existing topological ordering. We apply TSG updates to each supernode in the reverse topological ordering, and we show that the algorithm does not cause any transient loops or blackholes during transition as follows. Note that this dependency based TSG sequencing is similar to how certain events are handled in Interior Gateway Protocol (IGP), such as link down, metric change [11] and migrations [33].

TSG序列化算法:給定圖GTSG和4.3節描述的TSG結果,咱們按以下方式構造依賴圖。首先,依賴圖中的頂點和圖GTSG中同樣。若是Sj出如今目標TSG配置中Si的下一跳集合中,增長直連鏈路Si->Sj。若是Si直接轉發任意流量到目標TSG配置中的下一跳,則增長Si到D的直連鏈路。根據定理4.1,依賴圖中不包含環路,所以是具備當前拓撲序的有向無環圖(DAG)。咱們以逆拓撲序應用每一個超級節點的TSG更新;在以下轉變期間,該算法不會致使瞬時環路或黑洞。注意,基於依賴的TSG序列和內部網關協議中(IGP)特定時間如何處理相似,如鏈路斷掉、度量值改變[11]和遷移[33]。

Theorem 4.2. Assuming that neither the original nor the target TSG confguration contains a loop, none of the intermediate TSG confgurations contains a loop. 

定理4.2. 假設原始或目標TSG配置中均不含環路,那麼中間TSG配置也不會包含環路。 

Proof. At any intermediate step, each vertex can be considered either resolved, in which the represented supernode forwards traffic using the target TSG, or unresolved, in which the traffic is forwarded using the original TSG. The vertex D is always considered resolved. We observe that the loop cannot be formed among resolved vertices, because the target TSG confguration does not contain a loop. Similarly, a loop cannot be formed among unresolved vertices, since the original TSG confguration is loop-free. Therefore, a loop can only be formed if it consists of at least one resolved vertex and at least one unresolved vertex. Thus, the loop must contain at least one link from a resolved vertex to an unresolved vertex. However, since the vertices are updated in the reverse topological ordering of the dependency graph, it is impossible for a resolved vertex to forward traffic to an unresolved vertex, and therefore a loop cannot be formed. 

證實:在任意中間步驟,每一個頂點能夠認爲是已決斷的(超級節點使用目標TSG轉發流)或者未決斷的(超級解決使用原始TSG轉發流)。頂點D老是已決斷的。咱們發現環路不會在已決斷頂點間造成,這是由於目標TSG配置不包含環路。相似地,環路也不能在未決斷頂點間造成,由於原始的TSG配置是無環的。所以,環路必需要包含至少一個已決斷頂點和至少一個未決斷頂點。然而,因爲頂點以依賴圖的逆拓撲序更新,已決斷頂點不可能轉發流到未決斷頂點,所以沒法造成環路。

Theorem 4.3. Consider a flow to be blackholing if it crosses a down link using a given TSG confguration. Assuming that the original TSG confguration may contain blackholing flows, the target TSG confguration is blackhole-free, and the set of down links is unchanged, none of the intermediate TSG confgurations causes a flow to blackhole if it was not blackholing in the original TSG confguration. 

定理4.3. 給定TSG配置,若是某流途徑失效鏈路那麼該留認爲具備黑洞。假設原始TSG配置可能包含黑洞流,目標TSG配置沒有黑洞,而且是失效鏈路未發生變化,那麼若是流在原始TSG配置中沒有黑洞,那麼在任何中間TSG配置中也不會具備黑洞。

Proof. See the defnition of resolved/unresolved vertex in the proof of Theorem 4.2. Assume a flow Si → D is blackholing at an intermediate step during the transition from the original to the target TSG. Assume that this flow was not blackholing in the original TSG. Therefore, at least one transit vertex for this flow must have been resolved, and the blackhole must happen on or after the frst resolved vertex. However, since the resolved vertices do not forward traffic back to unresolved vertices, the backhole can only happen in resolved vertices, which contradicts our assumption. 

證實:參見定理4.2證實中已決斷和未決斷頂點的定義。假設流Si->D在由原始TSG到目標TSG轉變的中間步驟中具備黑洞。假設該流在原始TSG中不具備黑洞。那麼,該留的至少一箇中轉頂點是已決斷的,且黑洞發生在第一個已決斷頂點或其後。然而,因爲已決斷頂點不會轉發流回到未決斷頂點,黑洞只能出如今已決斷頂點,與咱們的假設相悖。

5 EFFICIENT SWITCH RULE MANAGEMENT (高效交換機規則管理)

Off-the-shelf switch chips impose hard limits on the size of each table. In this section, we show that FG matching and traffic hashing are two key drivers pushing us against the limits of switch forwarding rules to meet our availability and scalability goals. To overcome these limits, we partition our FG matching rules into two switch pipeline tables to support 60x more sites (§5.1). We further decouple the hierarchical TE splits into two-stage hashing across the switches in our two-stage Clos supernode (§5.2). Without this optimization, we find that our hierarchical TE would lose 6% throughput as a result of coarser traffic splits. 

商用交換芯片具備嚴格的表大小限制。本節,咱們指出FG匹配和流哈希是推進咱們打破交換機轉發規則限制以知足可用性和可擴展性目標的關鍵驅動力。爲了解決這些限制,咱們將FG匹配規則劃分到兩個交換機流水錶以支持60倍數量的站點(5.1節)。咱們進一步解耦層次化TE劃分到跨2階段Clos超級節點交換機的2階段哈希計算(5.2節)。沒有上述優化方案的情形下,咱們發現層次化TE將損失6%的吞吐量,由於粗粒度流劃分。

5.1 Hierarchical FG Matching (層次化FG匹配) 

We initially implemented FG matching using Access Control List (ACL) tables to leverage their generic wildcard matching capability. The number of FG match entries was bounded by the ACL table size: 

最初,咱們採用訪問控制列表(ACL)表實現FG匹配,以利用他們的通配能力。FG匹配項的數量受限於ACL表大小:

sizeACL ≥ numSites × numPrefixes/Site × numServiceClasses  

given the ACL table size limit (sizeACL = 3K), the number of supported service classes (numServiceClasses = 6, see Table 1) and the average number of aggregate IPv4 and IPv6 cluster prefxes per site (numPrefixes/Site ≥ 16), we anticipated hitting the ACL table limit with ∼ 32 sites. 

給定ACL表大小限制(sizeACL = 3K),支持的服務等級數量(參見表1,numServiceClasses = 6)和每站點平均聚合IPv4和IPv6集羣前綴(numPrefixes/Site ≥ 16),在有約32個站點時,達到ACL表限制。

Hence, we partition FG matching into two hierarchical stages, as shown in Figure 8b. We frst move the cluster prefix matches to Longest Prefx Match (LPM) table, which is much larger than the ACL table, storing up to several hundreds of thousands of entries. Even though the LPM entries cannot directly match DSCP bits for service class support, LPM entries can match against Virtual Routing Forwarding (VRF) label. Therefore, we match DSCP bits via the Virtual Forwarding Plane (VFP) table, which allows the matched packets to be associated with a VRF label to represent its service class before entering the LPM table in switch pipeline. We fnd this two-stage, scalable approach can support up to 1,920 sites. 

圖8:解耦FG匹配爲兩個階段。

所以,咱們將FG匹配劃分爲兩個層次化階段,如圖8b所示。咱們首先將集羣前綴匹配移動到最長前綴匹配(LPM)表,該表比ACL表大得多,能夠存儲多達幾十萬表項。儘管LPM表項沒法直接匹配DSCP比特以支持服務等級,LPM表項能夠匹配虛擬路由轉發(VRF)標籤。所以,咱們經過虛擬轉發平面(VFP)表匹配DSCP標記,使得匹配的數據包在進入交換機流水線的LPM表以前能夠關聯一個VRF標籤標示其相應的服務等級。咱們發現這種2階段可擴展的方法能夠支持多達1920個站點。

This optimization also enables other useful features. We run TE as the primary routing stack and BGP/ISIS routing as a backup. In face of critical TE issues, we can disable TE at the source sites by temporarily falling back to BGP/ISIS routing. This means that we delete the TE forwarding rules at the switches in the ingress sites, so that the packets can fallback to match lower-priority BGP/ISIS forwarding rules without encapsulation. However, disabling TE end-to-end only for traffic between a single source-destination pair is more challenging, as we must also match cluster-facing ingress ports. Otherwise, even though the source site will not encapsulate packets, unencapsulated packets can follow the BGP/ISIS routes and later be incorrectly encapsulated at transit site where TE is still enabled towards the given destination. Adding ingress port matching was only feasible with the scalable, two-stage FG match. 

這種優化還可使能其餘有用的特徵。咱們以TE做爲主要的路由棧,BGP/ISIS路由做爲備份。面對關鍵TE問題時,咱們能夠在源站點斷掉TE臨時回滾到BGP/ISIS路由。這代表在入口站點咱們刪除交換機中TE轉發規則,這樣數據包能夠回滾回匹配低優先級BGP/ISIS轉發規則的方式,而不須要封裝。然而,僅使單一源-目的對(端到端)之間的流的TE失效是具備挑戰的,由於咱們必須匹配面向集羣的入口端口。不然,即便源站點不封裝數據包,爲封裝的數據包能夠隨着BGP/ISIS路由,隨後在中轉站點(到給定目標的TE仍然有效)被不正確的封裝。增長入口端口匹配只有在使用可擴展的兩階段FG匹配時纔可行。

5.2 Efficient Traffic Hashing By Partitioning (使用劃分的高效流哈希)

With hierarchical TE, the source site is responsible for implementing TG, TSG and SSG splits. In the original design, we collapsed the hierarchical splits and implemented them on only ingress edge switches. However, we anticipated approaching the hard limits on switch ECMP table size: 

使用層次化TE,源站點負責實現TG、TSG和SSG劃分。原始設計中,咱們摺疊層次劃分,並僅在入口邊交換機實現。然而,咱們預計這會接近交換機ECMP表大小的硬限制:

sizeECMP ≥ numSites × numPathingClasses × numTGs × numTSGs × numSSGs  

where numPathingClasses = 3 is the number of aggregated service classes which share common pathing constraints, numTGs = 4 is the tunnel split granularity, numTSGs = 32 is the per-supernode split granularity, and numSSGs = 16 splits across 16 switches in Stargate backend stage. To support up to 33 sites, we would need 198K ECMP entries while our switches support up to only sizeECMP = 14K entries, after excluding 2K BGP/ISIS entries. We could down-quantize the traffic splits to avoid hitting the table limit. However, we find the benefit of TE would decline sharply due to the poor granularity of traffic split. 

這裏,numPathingClasses = 3表示聚合服務等級的數量(共享相同的選路限制),numTGs = 4表示隧道劃分粒度,numTSGs = 32表示每超級節點劃分粒度,and numSSGs = 16在Stargate後端階段跨交換機劃分。爲了支持33個站點,須要198K ECMP表項,然而在排除2K BGP/ISIS表項後咱們的交換機只支持 sizeECMP = 14K表項,咱們下量化流劃分以免達到表限制。然而,咱們發現TE的優點將急劇降低,由於較差的流劃分粒度。

We overcome per-switch table limitations by decoupling traffic splitting rules across two levels in the Clos fabric, as shown in Table 3. First, the edge switches decide which tunnel to use (TG split) and which site the ingress traffic should be forwarded to (TSG split part 1). To propagate the decision to the backend switches, we encapsulate packets into an IP address used to specify the selected tunnel (TG split) and mark with a special source MAC address used to represent the self-/next-site target (TSG split part 1). Based on the tunnel IP address and source MAC address, backend switches decide the peer supernode the packet should be forwarded to (TSG split part 2) and the egress edge switch which has connectivity toward the target supernode (SSG split). To further reduce the required splitting rules on ingress switches, we confgured a link aggregation group (LAG) for each edge switch toward viable backend switches. For simplicity, we consider a backend switch is viable if the switch itself is active and has active connections to every edge switch in the supernode. 

咱們經過接口流劃分規則到跨兩級Clos設施克服每交換機限制(如表3所示)。首先,邊緣交換機決定使用哪一個隧道(TG劃分)以及入口流應該轉發到哪一個站點(TSG劃分第一部分)。爲了把決定傳播到後端交換機,咱們將數據封裝到用於指定選定隧道的IP地址(TG劃分)而且使用特定的MAC地址用來表示自站點/下一站點目標(TSG劃分部分1)。基於隧道IP和源MAC地址,後端交換機決定數據包應該轉發到的對等超級節點(TSG劃分部分2)以及與目標超級節點具備鏈接的出口邊緣交換機(SSG劃分)。爲了進一步下降入口交換機上的劃分規則,咱們爲每一個邊緣交換機到可見後端交換機配置了鏈路聚合組(LAG)。爲了簡化起見,咱們認爲後端交換機是可見的,若是該交換機自身是有效的而且到超級節點中的每一個邊緣交換機具備有效鏈接。

6 EVALUATION (評估)

In this section, we present our evaluation of the evolved B4 network. We fnd that our approaches successfully scale B4 to accommodate 100x more traffic in the past five years (§6.1) while concurrently satisfying our stringent availability targets in every service class (§6.2). We also evaluate our design choices, including the use of sidelinks in hierarchical topology, hierarchical TE architecture, and the two-stage hashing mechanism (§6.3) to understand their tradeoffs in achieving our requirements. 

本節,咱們評估演進的B4網絡。咱們發現,在過去5年,咱們的方法使得B4成功的擴展到應用100倍流量(6.1節),同時知足嚴格的每一個服務等級的可用性目標(6.2節)。咱們評估了設計選擇,包括層次化拓撲中旁路的應用,層次化TE架構和兩階段哈希機制(6.3節),用於理解取得咱們要求的權衡。

6.1 Scale (擴展性)

Figure 9 demonstrates that B4 has signifcant growth across multiple dimensions. In particular, aggregate B4 traffic was increased by two orders of magnitude in the last fve years, as shown in Figure 9a. On average, B4 traffic has doubled every nine months since its inception. B4 has been delivering more traffic and growing faster than our Internet-facing WAN. A key growth driver is that Stargate subsumed the campus aggregation network (§3.2) and started offering huge amounts of campus bandwidth in 2015. 

圖9:B4在5年內的規模擴展度量。(a)總流量;B4面向集羣的端口的字節計數總量。(b)B4域和站點的數量。(c)每站點FG數量。(d)每站點TG數量。(e)每站點中轉隧道數量。

圖9代表B4在多個維度取得長足的發展。具體地,B4聚合流量在過去的5年增加了2個量級,如圖9a所示。平均來說,B4流量每9個月翻一番。B4正在傳輸比面向因特網的WAN更多的流量且增加速度更快。關鍵的增加驅動是Stargate歸入了園區聚合網絡(3.2節),而且自2015年開始提供大量的園區帶寬。

Figure 9b shows the growth of B4 topology size as the number of sites and control domains. These two numbers matched in Saturn-based B4 (single control domain per site) and have started diverging since the deployment of JPOP fabrics in 2013. To address scalability challenges, we considerably reduced the site count by deprecating Saturn fabrics with Stargate in 2015. Ironically, this presented scaling challenges during the migration period because both Saturn and Stargate fabrics temporarily co-existed at a site. 

圖9b給出B4拓撲大小的增加,以站點數量和控制域數量計。在基於Saturn的B4中(每站點一個控制域),這兩個數字相同,並在2013年開始分離,這是由於JPOP開始部署。爲了解決擴展性挑戰,自2015年經過使用Stargate替代Saturn,咱們極大的下降了站點數量。諷刺的是,這裏給出遷移期間擴展性挑戰,這一期間Saturn和Stargate在同一站點共存

Figure 9c shows the number of FGs per source site has increased by 6x in the last fve years. In 2015, we stopped distributing management IP addresses for switch and supertrunk interfaces. These IP addresses cannot be aggregated with cluster prefxes, and removing these addresses helped reduce the number of FGs per site by ∼ 30%. 

圖9c給出每一個源站點的FG數量在過去5年增加了6倍。2015年,咱們中止爲交換機和超級幹線端口分發管理IP地址。這些IP地址沒法聚合爲集羣前綴,移除這些地址幫助將每一個站點的FG數量減小約30%。

B4 supports per service class tunneling. The initial feature was rolled out with two service classes in the beginning of 2013 and resulted in doubling the total number of TGs per site as shown in Figure 9d. After that, the TG count continues to grow with newly deployed sites and more service classes. 

B4支持每服務等級隧道。2013開始,最初是使用兩個服務等級,結果是致使每站點TG數量翻番,如圖9d所示。隨後,TG數量技術隨着新部署站點和更多服務等級增加。

The maximum number of transit tunnels per site is controlled by the central controller’s confguration. This constraint helps avoid installing more switch rules than available but also limits the number of backup tunnels which are needed for fast blackhole recovery upon data plane failure. In 2016, improvements to switch rule management enabled us to install more backup tunnels and improve availability against unplanned node/link failure. 

每站點最大中轉隧道數量由集中控制器配置控制。這一約束幫助咱們避免安裝比可用規則更多的交換機規則,也限制了備份隧道的數量(對於數據平面故障的快速黑洞恢復是必須的)。2016年,交換機規則管理的提高使咱們可以安裝更多的備份隧道,並提高非計劃節點/鏈路故障時的可用性。

6.2 Availability (可用性) 

We measure B4 availability at the granularity of FG via two methodologies combined together: 

咱們以FG的粒度度量B4的可用性,這一度量經過結合兩個方法:

Bandwidth-based availability is defned as the bandwidth fulfllment ratio given by Google Bandwidth Enforcer [25]: 

基於帶寬的可用性:定義爲谷歌帶寬實施者[25]給定的帶寬知足率:

allocation
------------------------------
min{demand, approval} 

 

where demand is estimated based on short-term historical usage, approval is the approved minimum bandwidth per SLO contract, and allocation is the current bandwidth admitted by bandwidth enforcement, subject to network capacity and fairness constraints. This metric alone is insufcient because of bandwidth enforcement reaction delay and the inaccuracy of bandwidth estimation (e.g., due to TCP backoff during congestion). 

這裏,需求基於短時間歷史使用量度量,批准指每SLO合約許可的最小帶寬,分配是當前准入帶寬(受限於網絡容量和公平性約束)。該度量是不充分的,由於帶寬實施響應延遲和帶寬評估的不許去性(如,因爲擁塞期間的TCP回退)。 

Connectivity-based availability complements the limitations of bandwidth fulfllment. A network measurement system is used to proactively send probes between all cluster pairs in each service class. The probing results are grouped into B4 site pairs using 1-minute buckets, and the availability for each bucket is derived as follows: 

基於連接的可用性做爲帶寬知足限制的互補。網絡度量系統用於主動在全部集羣對間的每一個服務等級發送探測。探測結果按B4站點對分組,並使用1-分鐘桶,每一個桶的可用性按以下計算:

where α = 5% is a sensitivity threshold which filters out most of the transient losses while capturing the bigger loss events which affect our availability budget. Beyond the threshold, availability decreases exponentially with a decay factor β = 2% as the traffic loss rate increases. The rationale is that most inter-datacenter services run in a parallel worker model where blackholing the transmission of any worker can disproportionately affect service completion time. 

這裏α = 5%爲敏感閾值,濾除大部分瞬時丟失,並捕獲影響可用性的大的丟失事件。閾值之上,可用性隨着流量丟失率的增長以衰減因子β = 2%指數級減小。合理性解釋是大部分數據中心間服務運行於並行工做者模式,任何工做者的黑洞傳輸能夠不成比例的影響服務完成時間。

Availability is calculated by taking the minimum between these two metrics. Figure 10 compares the measured and target availability in each service class. We see that initially B4 achieved lower than three nines of availability at 90th percentile flow across several service classes. At this time, B4 was not qualified to deliver SC3/SC4 traffic other than probing packets used for availability measurement. However, we see a clear improvement trend in the latest state: 4- 7x in SC1/SC2 and 10-19x in SC3/SC4. As a result, B4 successfully
achieves our availability targets in every service class.

圖10:網絡不可達性。y軸給出測量的和目標不可用性(使用演化先後每個月數據),這裏粗條表示95th百分比(站點對間),偏差條表示90th和99th百分比。x軸給出按目標不可用性排序的服務等級。

可用性經過這兩個度量值的較小者計算。圖10比較了每一個服務等級的測量的可用性和目標可用性。咱們發現B4最開始在多個服務等級的90th百分比的流取得低於三個九的可用性。那時,B4不適用於傳輸SC3/SC4流量(除了用於可用性測量的探測包)。然而,咱們見證了清晰的提高趨勢:SC1/SC2的4-7倍提高和SC3/SC4的10-19倍提高。其結果是,B4的每一個服務等級均可以取得咱們的可用性目標。

6.3 Design Tradeoffs (設計權衡)

Topology abstraction: Figure 11 demonstrates the importance of sidelinks in hierarchical topology. Without sidelinks, the central controller relies on ECMP to uniformly split traffic across supernodes toward the next site in the tunnel. With sidelinks, the central controller uses TSGs to minimize the impact of capacity loss due to hierarchical topology by rebalancing traffic across sidelinks to match the asymmetric connectivity between supernodes. In the median case, abstract capacity loss reduces from 3.2% (without sidelinks) to 0.6% (with sidelinks+TSG). Moreover, we observe that the abstract capacity loss without sidelinks is 100% at 83-rd percentile because at least one supernode has lost connectivity toward the next site. Under such failure scenarios, the vast majority of capacity loss can be effectively alleviated with sidelinks. 

圖11:層次化拓撲中容量不對稱致使的容量損失。

拓撲:圖11展現層次化拓撲中旁路的重要性。沒有旁路時,集中控制器依賴於ECMP以在多個通向隧道中下一站點的超級節點間均勻劃分流量。使用旁路,集中控制器使用TSG最小化容量損失(因爲層次化拓撲)的影響(經過從新均衡流量以匹配超級節點間的不對稱連通)。平均情形下,抽象容量損失由3.2%(沒有旁路)減小到0.6%(使用旁路和TSG)。此外,咱們發現不使用旁路在83rd百分比時容量損失達100%,這是由於至少一個超級節點失去到下一站點的連通。在這種故障情景,使用旁路能夠有效避免大部分容量損失。

Hierarchical TE with two-stage hashing: Figure 12 quantifies the tradeoff between throughput and runtime as a function of TE algorithm and traffic split granularity. We compare our new hierarchical TE approach with flat TE, which directly runs a TE optimization algorithm (§4.3 in [18]) on the supernode-level topology. To evaluate TE-delivered capacity subject to the trafic split granularity limit, we linearly scale the traffic usage of each FG measured in Jan 2018, feed the adjusted traffic usage as demand to both flat and hierarchical TE, and then feed the demands, topology and TE pathing to a bandwidth allocation algorithm (see §5 in [25]) for deriving the maximum achievable throughput based on the max-min fair demand allocations. 

圖12:使用兩階段哈希的層次化TE運行時間爲2-4秒,取得較高的吞吐量且知足交換機哈希限制。

2階段哈希的層次化TE:圖12量化吞吐量和運行時間的權衡(做爲TE算法和流量劃分粒度的函數)。咱們比較了新的層次化TE方法和扁平TE(直接在超級節點級拓撲上運行TE優化算法,參見[18]中4.3節)。爲了評估受限於劃分粒度限制的TE容量,咱們線性擴展每一個FG流量使用量(2018年1月測量),將調整後的流量使用量做爲需求灌輸到扁平和層次化TE,接着將需求、拓撲和TE選路灌輸到帶寬分配算法([25]中第5部分)以得到基於最大-最小公平需求分配的最大可得吞吐量。 

We make several observations from these results. First, we find that hierarchical TE takes only 2-3 seconds to run, 188 times faster than flat TE. TSG generation runtime is consistently a small fraction of the overall runtime, ranging from 200 to 700 milliseconds. Second, when the maximum traffic split granularity is set to 1, both hierarchical and flat TE perform poorly, achieving less than 91% and 46% of their maximum throughput. The reason is simple: Each FG can only take the shortest available path, and the lack of sufficient path diversity leaves many links under-utilized, especially for flat TE as we have exponentially more links in the supernode-level graph. Third, we see that with our original 8-way traffic splitting, hierarchical TE achieves only 94% of its maximum throughput. By moving to two-stage hashing, we come close to maximal throughput via 128-way TG and TSG split granularity while satisfying switch hashing limits. 

咱們由上述結果中得出以下結論。首先,咱們發現層次化TE的運行時間爲2-3秒,比扁平化TE快188倍。TSG生成運行時間是總運行時間的一小部分,從200到700毫秒。第二,當最大流劃分粒度設置爲1時,層次化和扁平TE性能較差,取得少於91%和46%的最大吞吐量。緣由很簡單:每一個FG只能使用最短可用路徑,且充分路徑多樣性的缺失致使不少鏈路沒法充分使用,特別是對扁平拓撲(由於在超級節點級圖中有指數級數量更多的鏈路)。第三,咱們發現使用原始的8-路流劃分,層次化TE取得94%的最大吞吐量。經過採用兩階段哈希,咱們使用128-路TG和TSG劃分粒度達到接近最大化吞吐量,而且知足交換機哈希限制。

TSG sequencing: We evaluate the tradeoffs with and without TSG sequencing using a one-month trace. In this experiment, we exclude 17% of TSG ops which update only one supernode. Figure 13a shows that TSG sequencing takes only one or two 「steps」 in > 99.7% of the TSG ops. Each step consists of one or multiple TSG updates where their relative order is not enforced. Figure 13b compares the end-to-end latency of TSG ops with and without sequencing. We observe a 2x latency increase at 99.7th percentile, while the worst case increase is 2.63x. Moreover, we find that the runtime of the TSG sequencing algorithm is negligible relative to the programming latency. Figure 13c shows the capacity available at each intermediate state during TSG ops. Without TSG sequencing, available capacity drops to zero due to blackholing/looping in ∼ 3% of the cases, while this number is improved by an order of magnitude with sequencing. Figure 13d demonstrates that without sequencing, more than 38% of the ingress traffic would be discarded due to the forwarding loop formed in > 2% of the intermediate states during TSG ops. 

TSG順序:咱們使用一個月的trace評估使用和不使用TSG序列的權衡。實驗中,咱們排除17%的只更新一個超級節點的TSG操做。圖13a顯示只採用1步或2步的TSG序列多達TSG操做的99.7%。每一個步驟只有一個或多個TSG更新(其相對順序沒有規定)。圖13b比較使用和不使用序列化時TSG操做的端到端延遲。咱們發現99.7th百分比時2倍的延遲增加,最壞情形下有2.63倍的延遲增加。此外,咱們發現TSG序列算法的運行時間與程序延遲負相關。圖13c給出TSG操做期間每一箇中間狀態的可用容量。不使用TSG序列化,因爲3%情形下的黑洞/環路,可用容量降至0,使用序列化時該數字提高了一個數量級。圖13d代表不使用序列化,多達38%的入口流量因大於2%的中間狀態造成環路而丟失。

7 OPERATIONAL EXPERIENCE AND OPEN PROBLEMS (運維經驗和開放問題) 

In this section, we summarize our experiences learned from production B4 network as well as several open problems that remain active areas of work. 

本節,咱們總結生產環境中B4網絡的經驗教訓和一些仍然活躍的開放問題。

Management workflow simplifcation: Before the evolution, we rely on ECMP to uniformly load-balance traffic across Saturn chassis at each stage (§3.1), and therefore the traffic is typically bottlenecked by the chassis which has the least amount of outgoing capacity. In this design, the admissible capacity of a Saturn site drops signifcantly in the presence of capacity asymmetry resulting from failure or disruptive network operation. Consequently, we had to manually account for the capacity loss due to capacity asymmetry for disruptive network operations in order to ensure the degraded site capacity still meets the traffic demand requirements. Jumpgate’s improved handling of asymmetry using sidelinks and TSGs has reduced the need for manual interaction, as the TE system can automatically use the asymmetric capacity effectively. 

管理工做流簡化:演化前,咱們依賴ECMP在Saturn底板的每一個階段均衡數據流(節3.1),所以,數據流受限於具備最小出口容量的瓶頸底板。設計中,當故障或網絡操做致使的容量不對稱存在時,Saturn的准入流量顯著下降。所以,咱們須要人爲考慮容量不對稱致使的容量損失以保證下降的站點容量仍然知足流量需求。Jumpgate採用旁路和TSG提升不對稱處理,減小了人工干預需求,這是由於TE系統能夠自動高效使用不對稱容量。 

By virtue of Jumpgate’s multiple independent control domains per site (§3.2), we now restrict operations to modify one domain at a time to limit potential impact. To assess a change’s impact on network availability, we perform impact analysis accounting for the projected capacity change, potential network failures, and other maintenance coinciding with the time window of network operation. We tightly couple our software controller with the impact analysis tool to accurately account for potential abstraction capacity loss due to disruptive network operations. Depending on the results, a network operation request can be approved or rejected. 

憑藉Jumpgate每站點多個獨立控制域(節3.2),咱們限制操做只能每次修改一個域以限制可能的影響。爲了評估改變對網絡可用性的影響,咱們以預測的容量改變、可能的網絡故障和其餘網絡操做窗口內的維護來執行影響分析。咱們將軟件控制器和影響分析工具緊耦合以精確覈算可能的容量損失(打斷性網絡操做致使)。基於這些結果,網絡操做可能被許可或拒絕。

To minimize potential impact on availability, we develop a mechanism called 「drain」 to shift traffic away from certain network entities before a disruptive change in a safe manner which prevents traffic loss. With the scale of B4, it is impractical for network operators and Site Reliability Engineers to use command-line interface (CLI) to manage each domain controller. Consequently, drain operations are invoked by management tools which orchestrate network operations through management RPCs exposed by the controllers. 

爲了最小化可用性影響,咱們開發了稱爲「drain」的機制在中斷性改變發生前安全地將流量轉移出特定的網絡實體,以防止流丟失。在B4的規模下,網絡操做者和站點可靠性工程師不太可能使用命令行接口(CLI)管理每一個域控制器。所以,drain操做由管理工具調用;管理工具經過控制器暴露的管理RPC編排網絡操做。

Sidelink capacity planning: Sidelinks form a full mesh topology among supernodes to account for WAN capacity asymmetry caused by physical failure, network operation, and striping inefficiency. Up to 16% of B4 site capacity is dedicated to sidelinks to ensure that our TSG algorithm can fully utilize WAN capacity against common failure patterns. However, determining the optimal sidelink capacity that should be deployed to meet bandwidth guarantees is a hard provisioning problem that relies on long-term demand forecasting, cost estimates and business case analysis [5]. We are actively working on a log-driven statistical analysis framework that will allow us to plan sidelink capacity while minimizing costs in order to meet our network availability requirements. 

旁路容量規劃:旁路在超級節點間造成mesh拓撲,應對物理故障、網絡操做和劃分不充分致使的WAN容量不對稱。多達16%的B4站點容量指定給旁路,保證TSG算法能夠充分利用WAN容量。然而,肯定最優的旁路容量(知足帶寬保證)是困難的供應問題,依賴於長期需求預測、成本評估和商業案例分析[5]。咱們正在積極開發日誌驅動的統計分析框架使得咱們能夠規劃旁路的容量,同時最小化成本以知足網絡可用性需求。

Imbalance ingress traffic handling: Our TSG algorithm assumes balanced incoming traffic across supernodes in a site (§4.3). This assumption simplifes our design and more importantly, it allows us to meet our switch hardware requirements at scale—Tunnels that shared the same site-level link use a common set of TSG rules programmed in switch forwarding tables. Comparing with per-tunnel TSG allocation, which requires > 1K TSG rules and exceeds our hardware rule limit, our TSG algorithm requires up to N TSG rules, where N ≈ 6 is the number of peering site. However, it does not handle imbalance of ingress traffic. We plan to study alternative designs such as computing TSGs for each pair of adjacent site-level links. This design will handle ingress traffic imbalance based on capacity asymmetry observed in the upstream site-level link, while requiring a slightly higher number of TSG rules (N(N - 1) = 30) in switch hardware. 

不均衡入口流哈希:TSG算法假設站點內超級節點間的輸入流量是均衡的(4.3節)。這一假設簡化了咱們的設計,更重要的是,這容許咱們知足必定規模下的交換機硬件需求(共享相同站點鏈路的隧道使用相同的TSG規則集合,這些集合編程到交換機轉發表)。與每隧道TSG分配相比(要求>1K TSG規則,超過硬件規則限制),TSG算法須要多大N TSG規則,這裏N ≈ 6指對等站點的數量。然而,它並不處理入口流的不均衡。咱們計劃研究可選設計,例如爲每對相鄰站點級鏈路計算TSG。這一設計將處理基於容量不對稱的入口流量不對稱(上游站點級鏈路中觀察到),同時交換機硬件中須要稍多數量的TSG規則(N(N - 1) = 30) 。

相關文章
相關標籤/搜索