SRE和DevOps

時間 2019-12-10

標籤 sre devops 简体版

原文原文鏈接

前言

在搜索SRE和DevOps相關概念的過程當中偶然發現Google Cloud的Blog專門製做了這樣一篇文章，國內雖然有很多翻譯但並無徹底作到翻譯術語中的「信，雅，達」，這裏轉載Google官方的文章和YouTube視頻，同時也選擇了網友精心翻譯的文章並把視頻搬運至bilibili也就是B站方便你們瀏覽，相信你們能夠對SRE和DevOps有更深刻的理解。html

SRE vs. DevOps: competing standards or close friends?

更新歷史

2019年06月25日 - 初稿git

閱讀原文 - https://wsgzao.github.io/post...github

擴展閱讀面試

SRE vs. DevOps: competing standards or close friends? - https://cloud.google.com/blog...
DevOps 和 SRE - https://blog.alswl.com/2018/0...promise

英文原文

SRE vs. DevOps: competing standards or close friends?app

Seth Vargo: Staff Developer Advocate
Liz Fong-Jones: Site Reliability Engineer
May 9, 2018less

Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they're not so different after all.ide

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.工具

1. The difference between DevOps and SRE

It’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.

The DevOps movement began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group's priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement does not explicitly define how to succeed in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:

DevOps	SRE
Reduce organization silos	Share ownership with developers by using the same tools and techniques across the stack
Accept failure as normal	Have a formula for balancing accidents and failures against new releases
Implement gradual change	Encourage moving quickly by reducing costs of failure
Leverage tooling & automation	Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts that bring long-term value to the system
Measure everything	Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.

DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.

2. SLIs, SLOs, and SLAs

The SRE discipline collaboratively decides on a system's availability targets and measures availability with input from engineers, product owners and customers.

It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system's uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer's code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.

SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
SLOs are targets for the cumulative success of SLIs over a window of time (like "last 30 days" or "this quarter"), agreed-upon by stakeholders

The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.

SLIs, SLOs and SLAs tie back closely to the DevOps pillar of "measure everything" and one of the reasons we say class SRE implements DevOps.

3. Risk and error budgets

We focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.

Maximizing a system's stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won't notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an "error budget." When error budgets are depleted, the focus shifts from feature development to improving reliability.

As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.

Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.

4. Toil and toil budgets

An important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of "normal" is constantly changing.

Toil is not simply "work I don't like to do." For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.

The SRE discipline aims to reduce toil by focusing on the "engineering" component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it's realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE's time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.

Toil and toil budgets are closely related to the DevOps pillars of "measure everything" and "reduce organizational silos."

5. Customer Reliability Engineering (CRE)

Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.

In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.

Thus, in 2016, we launched the CRE program as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.

CRE aligns with the DevOps pillars of "reduce organization silos" by forcing collaboration across organizations, and it also closely relates to the concepts of "accepting failure as normal" and "measure everything" by creating a shared responsibility among all stakeholders in the form of shared SLOs.

Looking forward with SRE

We are working on some exciting new content across a variety of mediums to help showcase how users can adopt DevOps and SRE on Google Cloud, and we cannot wait to share them with you. What SRE topics are you interested in hearing about? Please give us a tweet or watch our videos.

Posted in:

中文翻譯

中文翻譯原文爲繁體中文，我轉化爲簡體中文，視頻替換爲B站

[[好文翻譯] 你在找的是 SRE 還是 DevOps？](https://medium.com/kkstream/%...

Neil Wei in KKStream
Aug 3, 2018

敝社這半年來開始大舉徵才，其中不乏 DevOps 和 SRE 的職缺，然而 HR (或其餘部門的同事) 對於二者的相異之處並不瞭解，甚至認爲 SRE 和傳統維運單位同樣，只是換個名字，從管機房到管雲端而已，究竟二者到底有什麼差異呢？

這對前來的面試的應徵者會有負面的影響，好像連咱們本身要找什麼樣的人都不清楚似的。因而，花了點時間跟 HR 介紹二者的差別，也在支援了 SRE 團隊四個月後留下這篇翻譯文加一點點心得。

請先記得…
SRE is a DevOps (香蕉是一種水果)

DevOps is NOT a SRE (水果不是香蕉)

DevOps 並非一個 "工做職稱"，SRE 纔是

《本文已取得原做者之一 Seth Vargo 贊成翻譯刊登》

原文網址：https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html?m=1

正文開始

Site Reliability Engineering (SRE) 和 DevOps 是目前至關熱門的開發與維運文化，有着很高的類似程度。然而，早期有些人會把 SRE 視爲和 DevOps 不一樣的實踐方式，認爲二者不同，必需選擇其一來執行，可是如今你們更傾向二者其實其實很類似。

究竟 SRE 和 DevOps 有什麼相同點呢？在年初，Google 的工程師 (Liz Fong-Jones 與 Seth Vargo) 準備了一系列的影片去解答這些問題以及嘗試跳出來去減小社羣間的意見分歧，本篇文章總結了影片中所涵蓋到的主題，以及如何實際去建置一個更加可靠的系統。

1. SRE 和 DevOps 的差別

在開始以前，先了解一下 SRE 和 DevOps 有什麼相同之處？又有什麼相異之處？

DevOps 文化的興起是由於在早期 (約十年前)，有許多開發者對於本身的程式是怎麼跑在真實世界，其實所知有限。開發者要作的事情就是將程式打包好，而後扔給維運部門後，本身的工做週期就結束了，而維運部門會負責將程式安裝與部署到全部生產環境的機器上，同時也要想盡各類辨法與善用各類工具，確保這些程式持續正常地執行，即便維運部門徹底不瞭解這些程式的實做細節。

這樣的工做模式很容易形成兩個部門之間的對立，各自的部門都有本身的目標，而各自的目標和公司商業需求可能會不一致。DevOps 的出現是爲了帶來一種新的軟體開發文化，用以下降開發與維運之間的鴻溝。

然而，DevOps 的本質並非教導你們怎麼作纔會成功，而是訂定一些基本原則讓你們各自發揮，以程式設計的術語來講，DevOps 比較像是一個抽象類別 (abstract class)，或是介面 (interface)，定義了這種文化該有什麼樣的行爲，實做則是靠各個部門成員一塊兒決定，只要符合這個「介面」，就能夠說是 DevOps 文化的實踐。

SRE 一詞由 Google 提出，是 Google 在這十多年間爲了解決內部日漸龐大的系統而制定出一連串的規範和實做，和 DevOps 不一樣的是，它實做了 DevOps 的所定義的抽象方法，並且規範了更多關於如何用軟體工程的方法與從維運的角度出發，以達成讓系統穩定的目的。簡單來講，SRE 實做了 DevOps 這個介面 (interface)，如下列出五點 DevOps 定義的介面以及 SRE 如何實做：

DevOps：減小組織之間的穀倉效應
SRE：在整個開發週期中，和開發團隊使用相同的工具以及一塊兒分享與全部權。(注：Infra as code, configuration as code)

DevOps：接受失效，視失效爲開發週期中的一個元素

SRE： 對於新的版本，創建一套能夠量化的指標去衡量 "意外" 和 "失效"

DevOps： 逐漸改變

SRE：鼓勵團隊透過下降排除故障的成原本達成速交付的目的 (就是不須要一次作到最好，而是逐漸改變)

DevOps：善用工具和自動化

SRE：鼓勵團隊把本身今年的工做自動化，最小化」工人智慧」要作的事，把精力放在中長期的系統改善。

DevOps：任何事都是能夠被量測的

SRE：相信維運是軟體工程的範籌，規範關於可用性，運行時間 (uptime)，停機時間 (outages)，哪些是苦工等量測值。

若是你已經認同 DevOps 是一個 "介面 (interface)"，那麼以程式語言的角度來講就是：

class SRE implements DevOps

雖然實際上二者之間仍有需多獨立的原則，SRE 並不是徹底 1:1 實做了 DevOps 的全部的概念，但最終他們兩個的結論是相同的，也和程式語言相同，類別在繼承介面以後，能夠作更多的延伸，也能夠實做更多不一樣的介面，SRE 包含了更多細節是 DevOps 本來所沒有定義的。

在軟體開發和維運的領域中，DevOps 和 SRE 並不是互相競爭誰纔是業界標準，相反地，二者都是爲了減小組職之間的隔閡與更快更好的軟體所設計出來的方法，若是你想看更多細節的話，How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) 這本書值得一看。

2. SLIs, SLOs, and SLAs

SRE 的原則之一是針對不一樣的職務，給出不一樣的量測值。對於工程師，PM，和客戶來講，整個系統的可用程度是多少，以及該如何測量，都有不一樣的呈現方式。

若是沒法衡量一個系統的運行時間與可用程度的話，是很是難以維運已經上線的系統，經常會形成維運團隊持續處在一個救火隊的狀態，而最終找到問題的根源時，可能會是開發團隊寫的 code 出了問題。

若是沒法定出運行時間與可用程度的量測方法的話，開發團隊每每不會將「穩定度」視爲一個潛在的問題，這個問題已經困擾了 Google 好多年，這也是爲何要發展出 SRE 原則的動機之一。

SRE 確保每個人都知道怎麼去衡量可靠度以及當服務失效時該作什麼事。這會細到當問題發生時，從 VP 或是 CxO，至最組織內部的每個相關員工，都有作己該作的事。每個「人」，該作什麼「事」都被規範清楚，SRE 會和全部的相關人員溝通，去決定出 Service Level Indicators (SLIs) 與 Service Level Objectives (SLOs)。

SLIs 定義了和系統「迴應時間」相關的指標，例如迴應時間，每秒的吞吐量，請求量，等等，經常會將這個指標轉化爲比率或平均值。
SLOs 則是和相關人員討論後，得出的一個時間區間，指望 SLIs 所能維持必定水準的數字，例如「每月 SLIs 要有如何的水準」，比較偏內部的指標。

該影片也討論到了 Service Level Agreements (SLAs)，即便這不是 SRE 天天所關心的數字。做爲一個線上服務的提供者，SLA 是對客戶的承諾，確保服務持續運行的百分比，一般是和客戶「談」出來的，每一年 (或每個月) 的停機時間不得低於幾分鐘。

SLI, SLO, SLA 的概念和 DevOps 所提的「任何事均可以被量測」很是類似，這也就是爲何會說 class SRE implements DevOps 的緣由之一了。

3. 風險和犯錯預算

對於風險，咱們會用犯錯預算來評估，犯錯預算是一個量化的值，用來描述服務天天 (或每個月) 能夠失效的時間，若服務的 SLAs 是 99.9%，那麼開發團隊就等於有 0.1％的犯錯預算通能夠用。這個值是一個和 Product Owner 和開發團隊談過以後取得平衡的值，如下的影片也講到了爲何 0 犯錯預算並非一個適合的值。

致力於將一個系統的可用程度維持在 100% 是一件會累死你又無心義的事情，不切實際的目標會限制了開發團隊推出新功能到使用者手上速度，並且使用者多半也不會注意到這件事 (例如可靠度是 99.999999%)，由於他們的 ISP 業者，3G/4G 網路，或是家裏的 WiFi 可能都小於這個數字。致力維持一個 100% 不間斷的服務會嚴重限制開發團隊將新功能交付出去的時間。爲了要達成這個嚴酷的限制，開發人員每每會選擇不要修 bug，不要增長功能，不要改進系統，反之，應該要保留一些彈性讓開發團隊能夠自由發揮。

SRE 的原則之一就是計算出能夠容忍的「犯錯預算」，一旦這個預算耗盡，才應該開始將重點放在可靠性的改善而非持續開發新功能。

如第二個影片提到的，這個文化能讓管理階層買單是最重要的事，由於 SLIs 是你們一塊兒訂出來的，若是不照遊戲規則走的話，SRE 又會淪爲持續爲了讓系統維持必定的穩定度了而一直作苦力的事，可是沒人知道 (由於沒有訂標準)，最終這個服務必定會失敗。風險和犯錯預算會將犯錯視爲正常的事，而改善的方式之一是讓新功能持續且小規模的發佈，這也和 DevOps 的原則相符合。

4. 雜事和雜事預算

另外一個 SRE 的原則是雜事的控管，如何減小雜事？何謂雜事？

維運中須要手動性操做的、重複的，能夠被自動化的
或是一次性，沒有持久價值的工做，都是雜事。

然而雜事並非「我不想作的事」，舉例來講，公司會有許多常常性的事務，一再的發生，例如開會，溝通，回 email，這些都不是雜事。

反之，像是天天手動登入某臺機器，取得某個檔案後作後續的處理，而後作成報告寄出來，這種就是雜事，由於他是手動，重複，能夠被自動化的。

SRE 的原則是嘗試使用軟體工程的方法消除這些事情，當 SRE 發現事情能夠被自動化後，便會着手執行自動化流程的開發，避免以後再作同樣的事情，雖然使雜事最小化很重要，但實際上，這是不可能徹底消除的，Google 致力於將 SRE 的平常雜事縮小到 50% 如下，使得 SRE 成員能夠將時間發費在更有意義的事情上，每季的回顧也都會檢視成果。

然而雜事也並不是徹底是壞事，對於新進成員來講，先參與這事例行事務有助於瞭解這個服務該作些什麼事情，這是相對低風險與低壓力的，可是長遠來看，任何一個工程師都不應一直在作雜事。

雜事管理也和 DevOps 的原則 — 任何事都是可被測量與減小組織之間的穀倉效應相符。

5. 客戶可靠性工程 (Customer Reliability Engineering, CRE)

我的以爲這個主題對目前而言稍微走遠了，就不逐句翻譯。

大意如何將 SRE 的概念傳達出去，讓 GCP 的客戶知道該怎麼正確的使用 GCP 的各項服務以及推廣 SRE 的風氣。

我的後記

其實目前敝社漸漸轉型中，的確處在一個從傳統開發與維運轉互相獨立，到目前漸漸實作 DevOps 文化的路上，在支援了 SRE 部門 4 個月後，參與了不少現實面會碰到的挑戰，也和你們一塊兒制定自動化流程與改善目前現有的雜事，也漸漸朝 DevOps 的文化前進中，但願讓你們能夠知道：

SRE 是軟體工程，不應只是維運人員或是系統管理員。
DevOps 並非一個職稱，SRE 纔是，就像你不會到市場菜攤跟老闆說我要買 "青菜"，並且會說要買高麗菜仍是小白菜吧！

不過理想老是完美的，仍是要面對現實，咱們的公司不叫 Google，大部份的人也進不去 Google，Google 的 SRE 可能比大多數公司的軟體開發工程師還要會寫 code，比網路工程師還要懂網路，比維運工程師還要懂維運，在咱們周圍的環境所開的 SRE 職缺，其實不少都不是想象中的這樣美好，雜事 / 手動的事可能仍是佔大多數，部門間仍是存在隔閡，不會寫 code 的 SRE 可能也不少，維運仍是佔平常工做的多數等現況。

傳統維運人員或 IT 網管人員若想往 SRE 發展的話，也必需改變一下思惟，跳脫溫馨圈，在這個什麼都 as code，什麼都 as a service 的年代，不寫 code 就等著等淘汰了。

改變是緩慢並且須要慢慢培養的，就讓咱們… 咦…P0 事件發生了！先這樣啦！