Gwen Shapira, who at the time was an engineer at Cloudera and now is spreading the Kafka gospel, asked a question on Twitter that got me thinking.html
Gwen Shapira曾在Cloudera作工程師,如今宣傳Kafka,他在Twitter問了如下問題,使我有所思考。node
I need to improve my proficiency in distributed systems theory. Where do I start? Any recommended books?
我想在分佈式理論上有所提高。應該從哪開始?有推薦的書?
— Gwen (Chen) Shapira (@gwenshap) August 7, 2014
My response of old might have been 「well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…」,
我第一反應是「能夠看:FLP論文、paxos論文、Byzantine將軍論文」,
and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed.
我推薦的主要閱讀材料,若是你貿然去讀,你至少要閱讀6個月纔會有感受。
But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program).
由此可知,推薦一噸的理論論文讓你閱讀,這是瞭解分佈式系統的錯誤的方式。(除非你在讀博士)
Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context.
論文通常是深奧、複雜的,並且須要一系列學習和豐富的經驗才能感受到其貢獻、才能其放到對應的場景(以理解和應用)。
What good is requiring that level of expertise of engineers?
工程師瞭解分佈式理論有什麼好處?ios
And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory;
很不幸,幾乎沒有好的引導文章,來總結、提煉、場景化 分佈式系統理論中的重要結論和想法;
particularly material that does so without condescending.
特別是 通俗易懂的引導文章 更沒有。
Considering that gap lead me to another interesting question:
考慮這樣的空白區域,讓我想問另外一個問題:web
What distributed systems theory should a distributed systems engineer know?
一個分佈式系統工程師應該瞭解什麼樣的分佈式系統理論?算法
A little theory is, in this case, not such a dangerous thing.
這種狀況下,瞭解一點點理論並非壞事。
So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer.
我平常工做是一個分佈式系統工程師,我認爲適合個人基本概念,下面會給出這些基本概念。
Let me know what you think I missed!
你認爲我缺失的請告知我!api
These four readings do a pretty good job of explaining what about building distributed systems is challenging.
下面四個讀物解釋了構建分佈式系統會遇到的困難。
Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections
這些讀物都勾勒了一些列 抽象而非技術 的困難,分佈式系統工程師必需要克服這些困難。這些讀物的後面章節有更詳細的研究。安全
Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.
Distributed Systems for Fun and Profit 是一本小書,它想覆蓋分佈式系統中的一些基本問題,包括 時鐘所起的做用、不一樣策略的複製。app
Notes on distributed systems for young bloods - not theory, but a good practical counterbalance to keep the rest of your reading grounded.
Notes on distributed systems for young bloods - 非理論,而是一個很好的實踐,以讓你落到實處。cors
A Note on Distributed Systems - a classic paper on why you can’t just pretend all remote interactions are like local objects.
A Note on Distributed Systems - 一個經典論文,關於 爲何你不能僞裝全部遠程交互像本地對象同樣。less
The fallacies of distributed computing - 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.
The fallacies of distributed computing 分佈式計算的8個錯誤的推論,以提醒系統設計者。
You should know about _safety and liveness properties_:
你應該知道 安全 和 活力:
Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:
分佈式系統工程師面對的許多困難能夠歸結爲如下兩個緣由:
There is a very deep relationship between what, if anything, processes share about their knowledge of _time_, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented.
進程間怎麼共用時鐘、什麼樣的失敗能夠檢測、什麼樣的算法和原語能夠被正確實現,這三者之間有很深的聯繫。
Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.
通常狀況下,咱們假設不一樣節點絕對沒法共用時鐘(時刻值或流過了多少時間)
You should know:
你應該知道:
A system that tolerates some faults without degrading must be able to act as though those faults had not occurred.
一個系統容忍一些錯誤而沒有降級 必須能當成 就像這些錯誤沒有發生過同樣。
This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption.
這意味着系統的一部分要冗餘地工做(一樣的功能部署多個節點),冗餘是絕對必要的,冗餘通常會帶來性能和資源的消耗。
This is the basic tension of adding fault tolerance to a system.
這就是給一個系統添加冗餘的基本矛盾。
You should know:
你應該知道:
(多數派中有一個是主節點,其他爲從節點,以主節點接收到的寫請求序列爲準[串行],主節點單方面的要求從們接受字節的寫請求序列[從節點不得反抗、不得有異議:從節點是非惡意的、遵照全局規則的、非拜占庭的])
There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:
在分佈式系統中,不多有約定的基本構建塊,更多的是處於造成中的基本構建塊。有應該知道下面的問題是什麼,而且從哪能找到他們的解決方案:
廣播 - 同時發送消息給集羣
鏈式複製 (將節點們放進一個虛擬鏈表中,從而能夠乾淨的確保寫請求的一致性和順序 ).
Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:
有些事實只須要主觀理解(不須要關注證實).
(一個異步系統中,假設節點崩潰後中止而不是奔潰後又恢復;一、要確保結果老是正確的,二、每次寫請求可以在有限時間內返回結果。這兩點無法同時知足:這就是FLP結論)
The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:
最重要的、應該不斷重複的實踐是:讀新的、真實的系統的描述,並評價他們設計的決定。 下面是建議的系統:
If you tame all the concepts and techniques on this list, I’d like to talk to you about engineering positions working with the menagerie of distributed systems we curate at Cloudera.
若是你馴服了這個列表中的全部概念和技術,我很樂意和你聊聊Cloudera的分佈式系統工程師職位。