Netflix的全週期開發者—運行您構建的內容（中英雙語）

時間 2019-12-12

原文原文鏈接

The year was 2012 and operating a critical service at Netflix was laborious. Deployments felt like walking through wet sand. Canarying was devolving into verifying endurance (「nothing broke after one week of canarying, let’s push it」) rather than correct functionality. Researching issues felt like bouncing a rubber ball between teams, hard to catch the root cause and harder yet to stop from bouncing between one another. All of these were signs that changes were needed.less

2012年的時候在Netflix運營一項關鍵服務很費力的。部署就像在潮溼的沙地上行走。金絲雀部署變成了對耐心的測試（「一週的金絲雀測試都沒問題發生因此咱們繼續推動吧」）而不是正確的功能。研究問題就好像皮球同樣在團隊之間被踢來踢去，很難抓住根源。全部這些跡象代表須要作一些改變了。運維

Fast forward to 2018. Netflix has grown to 125M global members enjoying 140M+ hours of viewing per day. We’ve invested significantly in improving the development and operations story for our engineering teams. Along the way we’ve experimented with many approaches to building and operating our services. We’d like to share one approach, including its pros and cons, that is relatively common within Netflix. We hope that sharing our experiences inspires others to debate the alternatives and learn from our journey.dom

時間快進到2018年。Netflix的全球用戶已增至1.25億，天天觀看時長超過1.4億小時。咱們已經在改善咱們的工程團隊的開發和運營方面投入了大量的資金。在此過程當中，咱們嘗試了許多方法來構建和運行咱們的服務。在此，咱們願意把其中一種咱們內部用得相對廣泛的方案，包括它的優缺點拿出來跟你們分享。咱們但願咱們的經驗分享能給你們一點啓發而且討論出可能的替代方案。ide

一個團隊的旅程

Edge Engineering is responsible for the first layer of AWS services that must be up for Netflix streaming to work. In the past, Edge Engineering had ops-focused teams and SRE specialists who owned the deploy+operate+support parts of the software life cycle. Releasing a new feature meant devs coordinating with the ops team on things like metrics, alerts, and capacity considerations, and then handing off code for the ops team to deploy and operate. To be effective at running the code and supporting partners, the ops teams needed ongoing training on new features and bug fixes. The primary upside of having a separate ops team was less developer interrupts when things were going well.工具

Edge Engineering（邊緣工程）負責AWS服務的第一層，Netflix流媒體必須依靠這些服務才能正常工做。在過去，Edge Engineering有專一運維的團隊以及SRE（網站可靠性工程師）專家，他們負責軟件生命週期的部署+運營+支撐這部分。發佈一個新特性意味着開發人員要在度量、警報和容量考慮事項等方面與運維團隊協調，而後將代碼交給運維團隊進行部署和操做。爲了有效地運行代碼並支持合做夥伴，運維團隊須要持續接受新特性和bug修復方面的培訓。擁有一個單獨的運維團隊的主要好處是，當事情進展順利時，開發人員的干擾更少。oop

When things didn’t go well, the costs added up. Communication and knowledge transfers between devs and ops/SREs were lossy, requiring additional round trips to debug problems or answer partner questions. Deployment problems had a higher time-to-detect and time-to-resolve due to the ops teams having less direct knowledge of the changes being deployed. The gap between code complete and deployed was much longer than today, with releases happening on the order of weeks rather than days. Feedback went from ops, who directly experienced pains such as lack of alerting/monitoring or performance issues and increased latencies, to devs, who were hearing about those problems second-hand.性能

當事情進展不順時，成本就會增長。開發人員和ops/SREs之間的交流和信息交換是有損耗的，須要額外的往返調試問題或回答合做夥伴的問題。部署問題由於運維團隊對所部署內容的更改了解較少。因此檢測和解決問題須要很長的時間。代碼完成與部署之間的鴻溝變得更大，發佈每每是以周爲量級而不是日。反饋從運維發起，這幫人直接經歷了缺乏告警/監控或者性能問題及時延增長這樣的痛苦，而後再傳遞到開發人員這裏時問題已是二手了。學習

To improve on this, Edge Engineering experimented with a hybrid model where devs could push code themselves when needed, and also were responsible for off-hours production issues and support requests. This improved the feedback and learning cycles for developers. But, having only partial responsibility left gaps. For example, even though devs could do their own deployments and debug pipeline breakages, they would often defer to the ops release specialist. For the ops-focused people, they were motivated to do the day to day work but found it hard to prioritize automation so that others didn’t need to rely on them.開發工具

爲了改進這一點，Edge Engineering嘗試了一種混合模型，在這種模型中，開發人員能夠在須要的時候本身推送代碼，同時還負責非工做時間的生產問題和支持請求。這改進了開發人員的反饋和學習週期。但這會出現部分的責任不到位的問題。例如，即便開發人員能夠執行他們本身的部署和調試管道中斷，他們每每也會交給運維處理。對於那些專一於運維的人來講，他們有動力去作天天的工做，可是很難會把無需別人依賴本身的自動化放在優先考慮的位置

In search of a better way, we took a step back and decided to start from first principles. What were we trying to accomplish and why weren’t we being successful?

爲了尋找更好的方法，咱們退了一步，決定從第一性原理開始。咱們想要完成什麼，爲何咱們沒有成功?

軟件生命週期

The purpose of the software life cycle is to optimize 「time to value」; to effectively convert ideas into working products and services for customers. Developing and running a software service involves a full set of responsibilities:

軟件生命週期的目的是優化「價值時間」;爲客戶有效地將想法轉化爲工做產品和服務。開發和運行軟件服務涉及一系列職責:

We had been segmenting these responsibilities. At an extreme, this means each functional area is owned by a different person/role:

咱們一直在劃分這些責任。在極端狀況下，這意味着每一個功能區由不一樣的人/角色擁有:

These specialized roles create efficiencies within each segment while potentially creating inefficiencies across the entire life cycle. Specialists develop expertise in a focused area and optimize what’s needed for that area. They get more effective at solving their piece of the puzzle. But software requires the entire life cycle to deliver value to customers. Having teams of specialists who each own a slice of the life cycle can create silos that slow down end-to-end progress. Grouping differing specialists together into one team can reduce silos, but having different people do each role adds communication overhead, introduces bottlenecks, and inhibits the effectiveness of feedback loops.

這些專門的角色在每個細分領域內創造出了效能，可是卻有可能形成整個生命週期的低效。專家在其聚焦的領域發展專業知識並針對該領域的須要進行優化。他們在解決特定領域的難題上變得愈來愈高效。可是軟件須要整個生命週期來爲客戶提供價值。各自精通生命週期的一小塊的專家團隊反而可能會製造出信息孤島，從而減慢端到端的進度。將不一樣的專家分組到一個團隊中能夠減小信息孤島，可是讓不一樣的人擔任每一個角色會增長溝通開銷，引入瓶頸，並抑制反饋迴環的有效性。

運行構建的內容

To rethink our approach, we drew inspiration from the principles of the devops movement. We could optimize for learning and feedback by breaking down silos and encouraging shared ownership of the full software life cycle:

爲了從新思考咱們的方法，咱們從devops運動的原則中得到了靈感。咱們能夠經過打破信息孤島和鼓勵共享整個軟件生命週期的全部權來優化學習和反饋:

「Operate what you build」 puts the devops principles in action by having the team that develops a system also be responsible for operating and supporting that system. Distributing this responsibility to each development team, rather than externalizing it, creates direct feedback loops and aligns incentives. Teams that feel operational pain are empowered to remediate the pain by changing their system design or code; they are responsible and accountable for both functions. Each development team owns deployment issues, performance bugs, capacity planning, alerting gaps, partner support, and so on.

「運營你開發的東西」經過讓開發系統的團隊也負責系統的運營和支持來踐行devops原則。把這個責任分攤給每一支開發團隊，而不是外化它，這樣就創建直接反饋迴環而且把激勵給統一塊兒來。感覺到運維痛苦的團隊被賦權經過改變系統設計或代碼來治療這種痛苦；他們要負責這兩種職能。每一支開發團隊都要負責部署問題、性能bug、能力規劃、告警差別、夥伴支持等等。

經過開發工具進行擴展

Ownership of the full development life cycle adds significantly to what software developers are expected to do. Tooling that simplifies and automates common development needs helps to balance this out. For example, if software developers are expected to manage rollbacks of their services, rich tooling is needed that can both detect and alert them of the problems as well as to aid in the rollback.

對整個開發生命週期的全部權給軟件開發者顯著增長了負擔。這就須要有簡化和自動化共同開發需求的工具來減輕負擔。比方說，若是軟件開發者預期要管理服務的回滾的話，就要有豐富的工具既能檢測到問題並予以告警，又能輔助進行回滾才行。

Netflix created centralized teams (e.g., Cloud Platform, Performance & Reliability Engineering, Engineering Tools) with the mission of developing common tooling and infrastructure to solve problems that every development team has. Those centralized teams act as force multipliers by turning their specialized knowledge into reusable building blocks. For example:

Netflix建立了集中的團隊(例如，雲平臺、性能和可靠性工程、工程工具)，其任務是開發通用的工具和基礎設施來解決每一個開發團隊都有的問題。這些集中的團隊將他們的專業知識轉化爲可重用的構建塊，從而起到了力量倍增器的做用。例如:

Empowered with these tools in hand, development teams can focus on solving problems within their specific product domain. As additional tooling needs arise, centralized teams assess whether the needs are common across multiple dev teams. When they are, collaborations ensue. Sometimes these local needs are too specific to warrant centralized investment. In that case the development team decides if their need is important enough for them to solve on their own.

有了這些工具在手，開發團隊能夠專一於解決他們特定產品領域中的問題。隨着其餘工具需求的出現，集中化團隊會評估多個開發團隊是否也有這些需求。若是有，接着就要協做。有時，這些地方需求過於具體，沒法保證集中投資。在這種狀況下，開發團隊決定他們的需求是否重要到足以讓他們本身解決問題。

Balancing local versus central investment in similar problems is one of the toughest aspects of our approach. In our experience the benefits of finding novel solutions to developer needs are worth the risk of multiple groups creating parallel solutions that will need to converge down the road. Communication and alignment are the keys to success. By starting well-aligned on the needs and how common they are likely to be, we can better match the investment to the benefits to dev teams across Netflix.

對相似問題在局部與集中投資間進行平衡是咱們的方案當中最棘手的地方。按照咱們的經驗尋找開發需求的新穎解決方案的好處，是值得冒險讓多支團隊同時開發在未來異曲同工的解決方案的。溝通與協調是成功的關鍵。經過協調好需求及其共性，咱們就能更好地將投資與跨開發團隊的好處進行匹配。

全週期開發者

By combining all of these ideas together, we arrived at a model where a development team, equipped with amazing developer productivity tools, is responsible for the full software life cycle: design, development, test, deploy, operate, and support.

把全部這些想法湊到一塊兒，咱們就得出了這麼一個模式，在配備了出色的開發者生產力工具以後，開發團隊將負責整個軟件生命週期：包括設計、開發、測試、部署、運營以及支持。

Full cycle developers are expected to be knowledgeable and effective in all areas of the software life cycle. For many new-to-Netflix developers, this means ramping up on areas they haven’t focused on before. We run dev bootcamps and other forms of ongoing training to impart this knowledge and build up these skills. Knowledge is necessary but not sufficient; easy-to-use tools for deployment pipelines (e.g., Spinnaker) and monitoring (e.g., Atlas) are also needed for effective full cycle ownership.

全週期開發者須要熟悉軟件生命週期各個領域而且高效。對於不少不熟悉Netflix的開發者來講，這意味着要在本身以前不怎麼關注的領域加把勁。咱們開設有dev新兵訓練營及其餘持續培訓形式來灌輸這種知識並培養技能。知識是必要非充分條件；部署管道和監控還須要有易用的工具才能支撐高效的全週期開發運營。

Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like 「how can I automate what is needed to operate this system?」 and 「what self-service tool will enable my partners to answer their questions without needing me to be involved?」 This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.

全週期開發者把工程規範應用到生命週期的各個領域。他們從開發者的角度去評估問題，會提出相似「我如何才能自動化該系統運營所需的東西？」以及「什麼樣的自服務工具能讓個人夥伴回答他們的問題而不須要個人參與？」優先考慮聚焦系統的辦法而不是聚焦於人的辦法，優先考慮自動化而不是手工，這幫助了咱們團隊實現伸縮性。

Moving to a full cycle developer model requires a mindset shift. Some developers view design+development, and sometimes testing, as the primary way that they create value. This leads to the anti-pattern of viewing operations as a distraction, favoring short term fixes to operational and support issues so that they can get back to their 「real job」. But the 「real job」 of full cycle developers is to use their software development expertise to solve problems across the full life cycle. A full cycle developer thinks and acts like an SWE, SDET, and SRE. At times they create software that solves business problems, at other times they write test cases for that, and still other times they automate operational aspects of that system.

轉向全週期開發者模式須要理念的轉變。一些開發者認爲設計+開發，或者有時候測試纔是創造價值的主要手段。這會致使一種反模式，認爲運營是分心的事情，更喜歡對運營和支持問題進行短時間性質的修補以便可以回到本身「真正的工做」上去。可是全週期開發者這項「真正的工做」是利用他們的軟件開發知識去解決全生命週期的問題。全週期開發者要像SWE、SDET以及SRE同樣思考和行動。有時候他們要建立軟件去解決商業問題，有時候他們寫相應的測試用例，還有些時候他們會對系統的運營方面進行自動化。

For this model to succeed, teams must be committed to the value it brings and be cognizant of the costs. Teams need to be staffed appropriately with enough headroom to manage builds and deployments, handle production issues, and respond to partner support requests. Time needs to be devoted to training. Tools need to be leveraged and invested in. Partnerships need to be fostered with centralized teams to create reusable components and solutions. All areas of the life cycle need to be considered during planning and retrospectives. Investments like automating alert responses and building self-service partner support tools need to be prioritized alongside business projects. With appropriate staffing, prioritization, and partnerships, teams can be successful at operating what they build. Without these, teams risk overload and burnout.

這一模式要想取得成功，團隊必須爲它所帶來的價值作奉獻而且要認識到所需的成本。團隊須要預留合理的人手去管理開發和部署，處理生產問題，而且對夥伴的支持請求做出響應。須要投入時間到培訓上。要利用好工具而且投資於工具。須要跟集中化團隊培養合做關係來建立出可重用的組件和解決方案。規劃和回顧階段要考慮到生命週期的各個領域。除了商業項目之外，像自動化告警響應和開發自服務夥伴支持工具這樣的投資須要優先考慮。有了合適的人力、恰當的優先次序，再加上合做關係，團隊就能成功地運營本身開發的東西。沒有這些，團隊就會有負擔太重精疲力竭的風險。

To apply this model outside of Netflix, adaptations are necessary. The common problems across your dev teams are likely similar — from the need for continuous delivery pipelines, monitoring/observability, and so on. But many companies won’t have the staffing to invest in centralized teams like at Netflix, nor will they need the complexity that Netflix’s scale requires. Netflix’s tools are often open source, and it may be compelling to try them as a first pass. However, other open source and SaaS solutions to these problems can meet most companies needs. Start with analysis of the potential value and count the costs, followed by the mindset-shift. Evaluate what you need and be mindful of bringing in the least complexity necessary.

在Netflix以外的地方應用這一模式須要進行必要的調整。開發團隊之間的共同問題多是相似的——好比持續交付管道的需求，好比監控/可觀察性等等。但不少公司並無像Netflix這樣有足夠的人力投資到集中化團隊上，或者也不須要Netflix這種規模致使的複雜性。Netflix的工具每每是開源的，因此一開始你想嘗試一下也正常。不過這些問題其餘的開源和SaaS解決方案也能知足大多數公司的需求。先從分析潛在價值和計算成本開始沒而後再進行觀念轉變。評估你須要什麼，當心不要引入沒必要要的複雜性。

權衡

The tech industry has a wide range of ways to solve development and operations needs (see devops topologies for an extensive list). The full cycle model described here is common at Netflix, but has its downsides. Knowing the trade-offs before choosing a model can increase the chance of success.

技術圈有很豐富的手段來解決開放和運營需求（延伸閱讀：devops拓撲）。這裏描述的全週期模型在Netflix很廣泛，但這種模式也有缺點。在選擇一種模式前先了解其中的利弊能夠提升成功的概率。

With the full cycle model, priority is given to a larger area of ownership and effectiveness in those broader domains through tools. Breadth requires both interest and aptitude in a diverse range of technologies. Some developers prefer focusing on becoming world class experts in a narrow field and our industry needs those types of specialists for some areas. For those experts, the need to be broad, with reasonable depth in each area, may be uncomfortable and sometimes unfulfilling. Some at Netflix prefer to be in an area that needs deep expertise without requiring ongoing breadth and we support them in finding those roles; others enjoy and welcome the broader responsibilities.

在全週期模式下，一我的要管的事情變寬了變多了。而一些開發者偏向於專一成爲比較狹窄的領域的世界級專家，在一些領域咱們也是須要那種類型的專家的。對於那些專家來講，須要一專多能，對每一個領域都懂一些的要求可能會感受不太舒服並且有時候勉爲其難。有些人寧願呆在須要深厚知識不須要持續擴展廣度的領域，咱們也支持他們去找到這樣的角色；有的則享受而且歡迎承擔更廣的責任。

In our experience with building and operating cloud-based systems, we’ve seen effectiveness with developers who value the breadth that owning the full cycle requires. But that breadth increases each developer’s cognitive load and means a team will balance more priorities every week than if they just focused on one area. We mitigate this by having an on-call rotation where developers take turns handling the deployment + operations + support responsibilities. When done well, that creates space for the others to do the focused, flow-state type work. When not done well, teams devolve into everyone jumping in on high-interrupt work like production issues, which can lead to burnout.

根據咱們開發和運營基於雲的系統的經驗，咱們見識過哪些重視擁有全週期所需的廣度的開發者的效能。可是這種廣度增長了每一位開發者的認知負荷，這意味着團隊每週將比僅關注一個領域要平衡更多的優先事項。爲此咱們採起了隨時待命的輪轉來緩解這一點：即讓開發者輪流分擔部署+運營+支持責任。作得很差的狀況下，就會出現人人都在當救火隊員去處理生產問題等高中斷的狀況，致使全部人精疲力竭。

Tooling and automation help to scale expertise, but no tool will solve every problem in the developer productivity and operations space. Netflix has a 「paved road」 set of tools and practices that are formally supported by centralized teams. We don’t mandate adoption of those paved roads but encourage adoption by ensuring that development and operations using those technologies is a far better experience than not using them. The downside of our approach is that the ideal of 「every team using every feature in every tool for their most important needs」 is near impossible to achieve. Realizing the returns on investment for our centralized teams’ solutions requires effort, alignment, and ongoing adaptations.

工具和自動化有助於擴展專業知識，但沒有一項工具能解決開發者生產力和運營領域的每個問題。Netflix有集中化團隊支撐的現成的一套工具和實踐。咱們不強求其餘團隊必定要用這些，可是經過確保開發和運營採用這些技術的體驗要比不用好得多來鼓勵他們採用。咱們的辦法很差之處在於「每一支團隊將每一項工具的每個功能用到其最重要的需求」上這個理想幾乎是不可能實現的。須要意識到咱們集中化團隊解決方案的投資回報須要努力、協調以及持續適配。

結論

The path from 2012 to today has been full of experiments, learning, and adaptations. Edge Engineering, whose earlier experiences motivated finding a better model, is actively applying the full cycle developer model today. Deployments are routine and frequent, canaries take hours instead of days, and developers can quickly research issues and make changes rather than bouncing the responsibilities across teams. Other groups are seeing similar benefits. However, we’re cognizant that we got here by applying and learning from alternate approaches. We expect tomorrow’s needs to motivate further evolution.

從2012年走到今天經歷了種種實驗、學習和適配的過程。Edge Engineering的早期經歷刺激了尋找更好模式的需求，今後全週期開發者模式就被咱們積極地應用到今天。部署是平常，進行得很頻繁，金絲雀行動只須要數小時而不是很多天了，開發者能夠迅速調研問題做出變動而不是在團隊之間踢皮球。其餘的團隊也看到了相似的好處。然而，咱們認識到咱們是經過應用替代方案並從中學習才走到今天的。咱們預期未來的需求還會推進進一步的演進。

Interested in seeing this model in action? Want to be a part of exploring how we evolve our approaches for the future? Consider joining us.