在Google使用Borg進行大規模集羣的管理 7-8

時間 2019-11-09

標籤 google 使用 borg 進行大規模集羣管理欄目 Google 简体版

原文原文鏈接

【編者的話】最後兩章探討的是相關工做和改進。從中能夠看到從Borg到Kubernetes，他們也作了很多思考，而這方面的工做遠遠沒有完善，一直在進行中。期待你們都能從Google的實踐中學到一些東西，並分享出來。html

7. 相關工做

資源調度在各個領域已經被研究了數十年了，包括在廣域HPC超算集羣中，在工做站網絡中，在大規模服務器集羣中。咱們主要聚焦在最相關的大規模服務器集羣這個領域。node

最近的一些研究分析了集羣趨勢，來自於Yahoo、Google、和Facebook[20, 52, 63, 68, 70, 80, 82]，展示了這些現代的數據中心和工做負載在規模和異構化方面碰到的挑戰。[69]包含了這些集羣管理架構的分類。linux

Apache Mesos [45]把資源管理和應用部署作了分離，資源管理由中心管理器(相似於Bormaster+scheduler)和多種類的「框架」好比Hadoop [41]和Spark [73]，使用offer-based的機制。Borg則主要把這些幾種在一塊兒，使用request-based的機制，能夠大規模擴展。DRF [29, 35, 36, 66]策略是內賦在Mesos裏的；Borg則使用優先級和配額認證來替代。Mesos開發者已經宣佈了他們的雄心壯志：推測性資源分配和回收，而後把[69]裏面的問題都解決。git

YARN [76]是一個Hadoop中心集羣管理。每一個應用都有一個管理器和中央資源管理器談判；這和2008年開始Google MapReduce從Borg獲取資源一模一樣。YARN的資源管理器最近才能容錯。一個相關的開源項目是Hadoop Capacity Scheduler [42]，提供了多租戶下的容量保證、多層隊列、彈性共享和公平調度。YARN最近被擴展成支持多種資源類型、優先級、驅逐、和高級權限控制[21]。俄羅斯方塊原型[40]支持了最大完工時間覺察的job打包。github

Facebook的Tupperware [64]，是一個類Borg系統來調度cgroup容器；雖然只有少許資料泄露，看起來他也提供資源回收利用功能。Twitter有一個開源的Aurora[5]，一個類Borg的長進程調度器，跑在Mesos智商，有一個相似於Borg的配置語言和狀態機。web

來自於微軟的Autopilot[48]提供了「自動化的軟件部署和開通；系統監控，以及在軟硬件故障時的修復操做」給微軟集羣。Borg生態系統提供了相同的特性，不過還有沒說完的；Isaard [48]歸納和不少咱們想擁護的最佳實踐。docker

Quincy[49]使用了一個網絡流模型來提供公平性和數據局部性在幾百個節點的DAG數據處理上。Borg用的是配額和優先級在上萬臺機器上把資源分配給用戶。Quincy處理直接執行圖在Borg之上。apache

Cosmos [44]聚焦在批處理上，重點在於用戶得到對集羣捐獻的資源進行公平獲取。它使用一個每job的管理器來獲取資源；沒有更多公開的細節。服務器

微軟的Apollo系統[13]使用了一個每job的調度器給短時間存活的batch job使用，在和Borg差很少量級的集羣下獲取高流量輸出。Apollo使用了一個低優先級後臺任務隨機執行策略來增長資源利用率，代價是有多天的延遲。Apollo幾點提供一個預測矩陣，關於啓動時間爲兩個資源維度的函數。而後調度器會綜合計算啓動開銷、遠程數據獲取開銷來決定部署到哪裏，而後用一個隨機延時來避免衝突。Borg用的是中央調度器來決定部署位置，給予優先級分配處理更多的資源維度，並且更關注高可用、長期跑的應用；Apollo也許能夠處理更多的task請求併發。網絡

阿里巴巴的Fuxi(譯者：也就是伏羲啦) [84]支撐數據分析的負載，從2009年開始運行。就像Borgmaster，一箇中央的FuxiMaster(也是作了高可用多副本)從節點上獲取可用的資源信息、接受應用的資源請求，而後作匹配。伏羲增長了和Borg徹底相反的調度策略：伏羲把最新的可用資源分配給隊列裏面請求的任務。就像Mesos，伏羲容許定義「虛擬資源」類型。只有系統的工做負載輸出是公開的。

Omega [69]支持多並行，特別是「鉛垂線」策略，粗略至關於Borgmaster加上它的持久存儲和link shards(鏈接分配)。Omega調度器用的是樂觀並行的方式去控制一個共享的cell觀察和預期狀態，把這些狀態放在一箇中央的存儲裏面，和Borglet用獨立的鏈接器進行同步。Omega架構。Omage架構是被設計出來給多種不一樣的工做負載，這些工做負載都有本身的應用定義的RPC接口、狀態機和調度策略(例如長期跑的服務端程序、多個框架下的batch job、存儲基礎設施、GCE上的虛擬機)。造成對比的是，Borg提供了一種「萬靈藥」，一樣的RPC接口、狀態機語義、調度策略，隨着時間流逝規模和複雜度增長，須要支持更多的不一樣方式的負載，而可可擴展性目前來講還不算一個問題($3.4)

Google的開源Kubernetes系統[53]把應用放在Docker容器內[28]，分發到多機器上。它能夠跑在物理機(和Borg同樣)或跑在其餘雲好比GCE提供的主機上。Kubernetes的開發者和Borg是同一撥人並且正在狂開發中。Google提供了一個雲主機版本叫Google Container Engine [39]。咱們會在下一節裏面討論從Borg中學到了哪些東西用在了Kubernetes上。

在高性能計算社區有一些這個領域的長期傳統工做(e.g., Maui, Moab, Platform LSF [2, 47, 50])；可是這和Google Cell所須要的規模、工做負載、容錯性是徹底不同的。大概來講，這些系統經過讓不少任務等待在一個長隊列裏面來獲取極高的資源利用率。

虛擬化提供商例如VMware [77]和數據中心方案提供商例如HP and IBM [46]給了一個大概在1000臺機器量級的集羣解決方案。另外，一些研究小組用幾種方式提高了資源調度質量(e.g., [25, 40, 72, 74])。

最後，就像咱們所指出的，大規模集羣管理的另一個重要部分是自動化和無人化。[43]寫了如何作故障計劃、多租戶、健康檢查、權限控制、和重啓動性來得到更大的機器數/操做員比。Borg的設計哲學也是這樣的，讓咱們的一個SRE能支撐超過萬臺機器。

8. 經驗教訓和將來工做

在這一節中咱們會聊一些十年以來咱們在生產環境操做Borg獲得的定性經驗，而後描述下這些觀察結果是怎麼改善Kubernete[53]的設計。

8.1 教訓

咱們會從一些受到吐槽的Borg特性開始，而後說說Kubernetes是怎麼幹的。

**Jobs是惟一的task分組的機制。**Borg沒有自然的方法去管理多個job組成單個實體，或者去指向相關的服務實例(例如，金絲雀和生產跟蹤)。做爲hack，用戶把他們的服務拓撲編碼寫在job名字裏面，而後用更高層的工具區解析這些名字。這個問題的另一面是，沒辦法去指向服務的任意子集，這就致使了僵硬的語義，以致於沒法滾動升級和改變job的實例數。

爲了不這些困難，Kubernetes不用job這個概念，而是用標籤(label)來管理它的調度單位(pods)，標籤是任意的鍵值對，用戶能夠把標籤打在系統的全部對象上。這樣，對於一個Borg job，就能夠在pod上打上job:jobname這樣的標籤，其餘的有用的分組也能夠用標籤來表示，例如服務、層級、發佈類型(生產、測試、階段)。Kubernetes用標籤選擇這種方式來選取對象，完成操做。這樣就比固定的job分組更加靈活好用。

**一臺機器一個IP把事情弄複雜了。**在Borg裏面，全部一臺機器上的task都使用同一個IP地址，而後共享端口空間。這就帶來幾個麻煩：Borg必須把端口當作資源來調度；task必須先聲明他們須要多少端口，而後瞭解啓動的時候哪些能夠用；Borglet必須完成端口隔離；命名和RPC系統必須和IP同樣處理端口。

很是感謝Linux namespace，虛擬機，IPv6和軟件定義網絡SDN。Kubernetes能夠用一種更用戶友好的方式來消解這些複雜性：全部pod和service均可以有一個本身的IP地址，容許開發者選擇端口而不是委託基礎設施來幫他們選擇，這些就消除了基礎設置管理端口的複雜性。

**給資深用戶優化而忽略了初級用戶。**Borg提供了一大堆針對「資深用戶」的特性這樣他們能夠仔細的調試怎麼跑他們的程序(BCL有230個參數的選項)：開始的目的是爲了支持Google的大資源用戶，提高他們的效率會帶來更大的效益。可是很不幸的是這麼複雜的API讓初級用戶用起來很複雜，約束了他們的進步。咱們的解決方案是在Borg上又作了一些自動化的工具和服務，從實驗中來決定合理的配置。這就讓皮實的應用從實驗中得到了自由：即便自動化出了麻煩的問題也不會致使災難。

8.2 經驗

另外一方面，有很多Borg的設計是很是有益的，並且經歷了時間考驗。

**Allocs是有用的。**Borg alloc抽象導出了普遍使用的logsaver樣式($2.4)和另外一個流行樣式：按期數據載入更新的web server。Allocs和packages容許這些輔助服務能被一個獨立的小組開發。Kubernetes相對於alloc的設計是pod，是一個多個容器共享的資源封裝，老是被調度到同一臺機器上。Kubernetes用pod裏面的輔助容器來替代alloc裏面的task，不過思想是同樣的。

**集羣管理比task管理要作更多的事。**雖然Borg的主要角色是管理tasks和機器的生命週期，但Borg上的應用仍是從其餘的集羣服務中收益良多，例如命名和負載均衡。Kubernetes用service抽象來支持命名和負載均衡：service有一個名字，用標籤選擇器來選擇多個pod。在底下，Kubernetes自動的在這個service所擁有的pod之間自動負載均衡，而後在pod掛掉後被從新調度到其餘機器上的時候也保持跟蹤來作負載均衡。

**反觀自省是相當重要的。**雖然Borg基本上是「just works」的，但當有出了問題後，找到這個問題的根源是很是有挑戰性的。一個關鍵設計抉擇是Borg把全部的debug信息暴露給用戶而不是隱藏：Borg有幾千個用戶，因此「自助」是debug的第一步。雖然這會讓咱們很難拋棄一些用戶依賴的內部策略，但這仍是成功的，並且咱們沒有找到其餘現實的替代方式。爲了管理這麼巨量的資源，咱們提供了幾層UI和debug工具，這樣就能夠升入研究基礎設施自己和應用的錯誤日誌和事件細節。

Kubernetes也但願重現不少Borg的自探查技術。例如它和cAdvisor [15] 一切髮型用於資源監控，用Elasticsearch/Kibana [30] 和 Fluentd [32]來作日誌聚合。從master能夠獲取一個對象的狀態快照。Kubernetes有一個一致的全部組件都能用的事件記錄機制(例如pod被調度、容器掛了)，這樣客戶端就能訪問。

**master是分佈式系統的核心.**Borgmaster原來被設計成一個單一的系統，可是後來，它變成了服務生態和用戶job的核心。比方說，咱們把調度器和主UI(Sigma)分離出來成爲單獨的進程，而後增長了權限控制、縱向橫向擴展、重打包task、週期性job提交(cron)、工做流管理，系統操做存檔用於離線查詢。最後，這些讓咱們可以提高工做負載和特性集，而無需犧牲性能和可維護性。

Kubernetes的架構走的更遠一些：它有一個API服務在覈心，僅僅負責處理請求和維護底下的對象的狀態。集羣管理邏輯作成了一個小的、微服務類型的客戶端程序和API服務通訊，其中的副本管理器(replication controller)，維護在故障狀況下pod的服務數量，還有節點管理器(node controller)，管理機器生命週期。

8.3 總結

在過去十年間全部幾乎全部的Google集羣負載都移到了Borg上。咱們將會持續改進，並把學到的東西應用到Kubernetes上。

鳴謝

這篇文章的做者同時也評審了這篇文章。可是幾十個設計、實現、維護Borg組件和生態系統工程師纔是這個系統成功的關鍵。咱們在這裏列表設計、實現、操做Borgmaster和Borglet的主要人員。若有遺漏抱歉。

Borgmaster主設計師和實現者有Jeremy Dion和Mark Vandevoorde，還有Ben Smith, Ken Ashcraft, Maricia Scott, Ming-Yee Iu, Monika Henzinger。Borglet的主要設計實現者是Paul Menage。

其餘貢獻者包括Abhishek Rai, Abhishek Verma, Andy Zheng, Ashwin Kumar, Beng-Hong Lim, Bin Zhang, Bolu Szewczyk, Brian Budge, Brian Grant, Brian Wickman, Chengdu Huang, Cynthia Wong, Daniel Smith, Dave Bort, David Oppenheimer, David Wall, Dawn Chen, Eric Haugen, Eric Tune, Ethan Solomita, Gaurav Dhiman, Geeta Chaudhry, Greg Roelofs, Grzegorz Czajkowski, James Eady, Jarek Kusmierek, Jaroslaw Przybylowicz, Jason Hickey, Javier Kohen, Jeremy Lau, Jerzy Szczepkowski, John Wilkes, Jonathan Wilson, Joso Eterovic, Jutta Degener, Kai Backman, Kamil Yurtsever, Kenji Kaneda, Kevan Miller, Kurt Steinkraus, Leo Landa, Liza Fireman, Madhukar Korupolu, Mark Logan, Markus Gutschke, Matt Sparks, Maya Haridasan, Michael Abd-El-Malek, Michael Kenniston, Mukesh Kumar, Nate Calvin, OnufryWojtaszczyk, Patrick Johnson, Pedro Valenzuela, PiotrWitusowski, Praveen Kallakuri, Rafal Sokolowski, Richard Gooch, Rishi Gosalia, Rob Radez, Robert Hagmann, Robert Jardine, Robert Kennedy, Rohit Jnagal, Roy Bryant, Rune Dahl, Scott Garriss, Scott Johnson, Sean Howarth, Sheena Madan, Smeeta Jalan, Stan Chesnutt, Temo Arobelidze, Tim Hockin, Todd Wang, Tomasz Blaszczyk, TomaszWozniak, Tomek Zielonka, Victor Marmol, Vish Kannan, Vrigo Gokhale, Walfredo Cirne, Walt Drummond, Weiran Liu, Xiaopan Zhang, Xiao Zhang, Ye Zhao, Zohaib Maya.

Borg SRE團隊也是很是重要的，包括Adam Rogoyski, Alex Milivojevic, Anil Das, Cody Smith, Cooper Bethea, Folke Behrens, Matt Liggett, James Sanford, John Millikin, Matt Brown, Miki Habryn, Peter Dahl, Robert van Gent, Seppi Wilhelmi, Seth Hettich, Torsten Marek, and Viraj Alankar。Borg配置語言(BCL)和borgcfg工具是Marcel van Lohuizen, Robert Griesemer製做的。

謝謝咱們的審稿人(尤爲是especially Eric Brewer, Malte Schwarzkopf and Tom Rodeheffer)，以及咱們的牧師Christos Kozyrakis，對這篇論文的反饋。

##參考文獻

[1] O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In Proc. IEEE Int’l Conf. on Cloud Computing Technology and Science (CloudCom), pages 272–277, Singapore, Dec. 2014.

[2] Adaptive Computing Enterprises Inc., Provo, UT. MauiScheduler Administrator’s Guide, 3.2 edition, 2011.

[3] T. Akidau, A. Balikov, K. Bekiro˘glu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom,and S. Whittle. MillWheel: fault-tolerant stream processing at internet scale. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), pages 734–746, Riva del Garda, Italy, Aug.2013.

[4] Y. Amir, B. Awerbuch, A. Barak, R. S. Borgstrom, and A. Keren. An opportunity cost approach for job assignment in a scalable computing cluster. IEEE Trans. Parallel Distrib.Syst., 11(7):760–768, July 2000.

[5] Apache Aurora.http://aurora.incubator.apache.org/, 2014.

[6] Aurora Configuration Tutorial. https://aurora.incubator.apache.org/documentation/latest/configuration-tutorial/,2014.

[7] AWS. Amazon Web Services VM Instances. http://aws.amazon.com/ec2/instance-types/, 2014.

[8] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In Proc. Conference on Innovative Data Systems Research (CIDR), pages 223–234, Asilomar, CA, USA, Jan. 2011.

[9] M. Baker and J. Ousterhout. Availability in the Sprite distributed file system. Operating Systems Review,25(2):95–98, Apr. 1991.

[10] L. A. Barroso, J. Clidaras, and U. H¨olzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan Claypool Publishers, 2nd edition, 2013.

[11] L. A. Barroso, J. Dean, and U. Holzle. Web search for a planet: the Google cluster architecture. In IEEE Micro, pages 22–28, 2003.

[12] I. Bokharouss. GCL Viewer: a study in improving the understanding of GCL programs. Technical report, Eindhoven Univ. of Technology, 2008. MS thesis.

[13] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: scalable and coordinated scheduling for cloud-scale computing. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), Oct. 2014.

[14] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proc. USENIX Symp. on Operating Systems Design and Implementation (OSDI), pages 335–350,Seattle, WA, USA, 2006.

[15] cAdvisor. https://github.com/google/cadvisor, 2014

[16] CFS per-entity load patches. http://lwn.net/Articles/531853, 2013.

[17] cgroups. http://en.wikipedia.org/wiki/Cgroups, 2014.

[18] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. FlumeJava: easy, efficient data-parallel pipelines. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 363–375, Toronto, Ontario, Canada, 2010.

[19] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. ACM Trans. on Computer Systems, 26(2):4:1–4:26, June 2008.

[20] Y. Chen, S. Alspaugh, and R. H. Katz. Design insights for MapReduce from diverse production workloads. Technical Report UCB/EECS–2012–17, UC Berkeley, Jan. 2012.

[21] C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: if you’re late don’t blame us! In Proc. ACM Symp. on Cloud Computing (SoCC), pages 2:1–2:14, Seattle, WA, USA, 2014.

[22] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, Feb. 2012.

[23] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[24] C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware scheduling for heterogeneous datacenters. In Proc. Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 201.

[25] C. Delimitrou and C. Kozyrakis. Quasar: resource-efficient and QoS-aware cluster management. In Proc. Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 127–144, Salt Lake City, UT, USA, 2014.

[26] S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus Grid workloads. In International Conference on Cluster Computing (IEEE CLUSTER), pages 230–238, Beijing, China, Sept. 2012.

[27] S. Di, D. Kondo, and C. Franck. Characterizing cloud applications on a Google data center. In Proc. Int’l Conf. on Parallel Processing (ICPP), Lyon, France, Oct. 2013.

[28] Docker Project. https://www.docker.io/, 2014.

[29] D. Dolev, D. G. Feitelson, J. Y. Halpern, R. Kupferman, and N. Linial. No justified complaints: on fair sharing of multiple resources. In Proc. Innovations in Theoretical Computer Science (ITCS), pages 68–75, Cambridge, MA, USA, 2012.

[30] ElasticSearch. http://www.elasticsearch.org, 2014.

[31] D. G. Feitelson. Workload Modeling for Computer Systems Performance Evaluation. Cambridge University Press, 2014.

[32] Fluentd. http://www.fluentd.org/, 2014.

[33] GCE. Google Compute Engine. http: //cloud.google.com/products/compute-engine/, 2014.

[34] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 29–43, Bolton Landing, NY, USA, 2003. ACM.

[35] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant Resource Fairness: fair allocation of multiple resource types. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), pages 323–326, 2011.

[36] A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Choosy: max-min fair sharing for datacenter jobs with constraints. In Proc. European Conf. on Computer Systems (EuroSys), pages 365–378, Prague, Czech Republic, 2013.

[37] D. Gmach, J. Rolia, and L. Cherkasova. Selling T-shirts and time shares in the cloud. In Proc. IEEE/ACM Int’l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 539–546, Ottawa, Canada, 2012.

[38] Google App Engine. http://cloud.google.com/AppEngine, 2014.

[39] Google Container Engine (GKE). https://cloud.google.com/container-engine/, 2015.

[40] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. In Proc. ACM SIGCOMM, Aug. 2014.

[41] Apache Hadoop Project. http://hadoop.apache.org/, 2009.

[42] Hadoop MapReduce Next Generation – Capacity Scheduler. http: //hadoop.apache.org/docs/r2.2.0/hadoop-yarn/ hadoop-yarn-site/CapacityScheduler.html, 2013.

[43] J. Hamilton. On designing and deploying internet-scale services. In Proc. Large Installation System Administration Conf. (LISA), pages 231–242, Dallas, TX, USA, Nov. 2007.

[44] P. Helland. Cosmos: big data and big challenges. http://research.microsoft.com/en-us/events/ fs2011/helland_cosmos_big_data_and_big\ _challenges.pdf, 2011.

[45] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proc. USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2011.

[46] IBM Platform Computing. http://www-03.ibm.com/ systems/technicalcomputing/platformcomputing/ products/clustermanager/index.html.

[47] S. Iqbal, R. Gupta, and Y.-C. Fang. Planning considerations for job scheduling in HPC clusters. Dell Power Solutions, Feb. 2005.

[48] M. Isaard. Autopilot: Automatic data center management. ACM SIGOPS Operating Systems Review, 41(2), 2007.

[49] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In Proc. ACM Symp. on Operating Systems Principles (SOSP), 2009.

[50] D. B. Jackson, Q. Snell, and M. J. Clement. Core algorithms of the Maui scheduler. In Proc. Int’l Workshop on Job Scheduling Strategies for Parallel Processing, pages 87–102. Springer-Verlag, 2001.

[51] M. Kambadur, T. Moseley, R. Hank, and M. A. Kim. Measuring interference between live datacenter applications. In Proc. Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), Salt Lake City, UT, Nov. 2012.

[52] S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan. An analysis of traces from a production MapReduce cluster. In Proc. IEEE/ACM Int’l Symp. on Cluster, Cloud and Grid Computing (CCGrid), pages 94–103, 2010.

[53] Kubernetes. http://kubernetes.io, Aug. 2014.

[54] Kernel Based Virtual Machine. http://www.linux-kvm.org.

[55] L. Lamport. The part-time parliament. ACM Trans. on Computer Systems, 16(2):133–169, May 1998.

[56] J. Leverich and C. Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. In Proc. European Conf. on Computer Systems (EuroSys), page 4, 2014.

[57] Z. Liu and S. Cho. Characterizing machines and workloads on a Google cluster. In Proc. Int’l Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS), Pittsburgh, PA, USA, Sept. 2012.

[58] Google LMCTFY project (let me contain that for you). http://github.com/google/lmctfy, 2014.

[59] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proc. ACM SIGMOD Conference, pages 135–146, Indianapolis, IA, USA, 2010.

[60] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa. Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. Int’l Symp. on Microarchitecture (Micro), Porto Alegre, Brazil, 2011.

[61] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), pages 330–339, Singapore, Sept. 2010.

[62] P. Menage. Linux control groups. http://www.kernel. org/doc/Documentation/cgroups/cgroups.txt, 2007–2014.

[63] A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das. Towards characterizing cloud backend workloads: insights from Google compute clusters. ACM SIGMETRICS Performance Evaluation Review, 37:34–41, Mar. 2010.

[64] A. Narayanan. Tupperware: containerized deployment at Facebook. http://www.slideshare.net/dotCloud/ tupperware-containerized-deployment-at-facebook, June 2014.

[65] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In Proc. ACM Symp. on Operating Systems Principles (SOSP), pages 69–84, Farminton, PA, USA, 2013.

[66] D. C. Parkes, A. D. Procaccia, and N. Shah. Beyond Dominant Resource Fairness: extensions, limitations, and indivisibilities. In Proc. Electronic Commerce, pages 808–825, Valencia, Spain, 2012.

[67] Protocol buffers. https: //developers.google.com/protocol-buffers/, and https://github.com/google/protobuf/., 2014.

[68] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012.

[69] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013.

[70] B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Rifaat, and C. R. Das. Modeling and synthesizing task placement constraints in Google compute clusters. In Proc. ACM Symp. on Cloud Computing (SoCC), pages 3:1–3:14, Cascais, Portugal, Oct. 2011.

[71] E. Shmueli and D. G. Feitelson. On simulation and design of parallel-systems schedulers: are we doing the right thing? IEEE Trans. on Parallel and Distributed Systems, 20(7):983–996, July 2009.

[72] A. Singh, M. Korupolu, and D. Mohapatra. Server-storage virtualization: integration and load balancing in data centers. In Proc. Int’l Conf. for High Performance Computing, Networking, Storage and Analysis (SC), pages 53:1–53:12, Austin, TX, USA, 2008.

[73] Apache Spark Project. http://spark.apache.org/, 2014.

[74] A. Tumanov, J. Cipar, M. A. Kozuch, and G. R. Ganger. Alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In Proc. ACM Symp. on Cloud Computing (SoCC), San Jose, CA, USA, Oct. 2012.

[75] P. Turner, B. Rao, and N. Rao. CPU bandwidth control for CFS. In Proc. Linux Symposium, pages 245–254, July 2010.

[76] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache Hadoop YARN: Yet Another Resource Negotiator. In Proc. ACM Symp. on Cloud Computing (SoCC), Santa Clara, CA, USA, 2013.

[77] VMware VCloud Suite. http://www.vmware.com/products/vcloud-suite/.

[78] A. Verma, M. Korupolu, and J. Wilkes. Evaluating job packing in warehouse-scale computing. In IEEE Cluster, pages 48–56, Madrid, Spain, Sept. 2014.

[79] W. Whitt. Open and closed models for networks of queues. AT&T Bell Labs Technical Journal, 63(9), Nov. 1984.

[80] J. Wilkes. More Google cluster data. http://googleresearch.blogspot.com/2011/11/ more-google-cluster-data.html, Nov. 2011.

[81] Y. Zhai, X. Zhang, S. Eranian, L. Tang, and J. Mars. HaPPy: Hyperthread-aware power profiling dynamically. In Proc. USENIX Annual Technical Conf. (USENIX ATC), pages 211–217, Philadelphia, PA, USA, June 2014. USENIX Association.

[82] Q. Zhang, J. Hellerstein, and R. Boutaba. Characterizing task usage shapes in Google’s compute clusters. In Proc. Int’l Workshop on Large-Scale Distributed Systems and Middleware (LADIS), 2011.

[83] X. Zhang, E. Tune, R. Hagmann, R. Jnagal, V. Gokhale, and J. Wilkes. CPI2: CPU performance isolation for shared compute clusters. In Proc. European Conf. on Computer Systems (EuroSys), Prague, Czech Republic, 2013.

[84] Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), pages 1393–1404. VLDB Endowment Inc., Sept. 2014.

##勘誤 2015-04-23

自從膠片版定稿後，咱們發現了若干疏忽和歧義。

###用戶視角

SRE乾的比SA(system administration)要多得多：他們是Google生產服務的負責工程師。他們設計和實現軟件，包括自動化系統、管理應用、底層基礎設施服務來保證Google這個量級的高可靠和高性能。

###鳴謝

咱們不當心忽略了Brad Strand, Chris Colohan, Divyesh Shah, Eric Wilcox, and Pavanish Nirula。

###參考文獻

[1] Michael Litzkow, Miron Livny, and Matt Mutka. "Condor - A Hunter of Idle Workstations". In Proc. Int'l Conf. on Distributed Computing Systems (ICDCS) , pages 104-111, June 1988.

[2] Rajesh Raman, Miron Livny, and Marvin Solomon. "Matchmaking: Distributed Resource Management for High Throughput Computing". In Proc. Int'l Symp. on High Performance Distributed Computing (HPDC) , Chicago, IL, USA, July 1998.