分佈式系統領域有哪些經典論文

謝邀！五一快樂！
分佈式系統在互聯網時代，尤爲是大數據時代到來以後，成爲了每一個程序員的必備技能之一。分佈式系統從上個世紀80年代就開始有了很多出色的研究和論文，我在這裏只列舉最近15年範圍之內我以爲有重大影響意義的15篇論文（15 within 15）。
1. The Google File System: 這是分佈式文件系統領域劃時代意義的論文，文中的多副本機制、控制流與數據流隔離和追加寫模式等概念幾乎成爲了分佈式文件系統領域的標準，其影響之深遠經過其5000+的引用就可見一斑了，Apache Hadoop鼎鼎大名的HDFS就是GFS的模仿之做；
2. MapReduce: Simplified Data Processing on Large Clusters：這篇也是Google的大做，經過Map和Reduce兩個操做，大大簡化了分佈式計算的複雜度，使得任何須要的程序員均可以編寫分佈式計算程序，其中使用到的技術值得咱們好好學習：簡約而不簡單！Hadoop也根據這篇論文作了一個開源的MapReduce；
3. Bigtable: A Distributed Storage System for Structured Data：Google在NoSQL領域的分佈式表格系統，LSM樹的最好使用範例，普遍使用到了網頁索引存儲、YouTube數據管理等業務，Hadoop對應的開源系統叫HBase（我在前公司任職時也開發過一個相應的系統叫BladeCube，性能較HBase有數倍提高）；
4. The Chubby lock service for loosely-coupled distributed systems：Google的分佈式鎖服務，基於Paxos協議，這篇文章相比於前三篇可能知道的人就少了，可是其對應的開源系統zookeeper幾乎是每一個後端同窗都接觸過，其影響力其實不亞於前三篇；
5. Finding a Needle in Haystack: Facebook's Photo Storage：facebook的在線圖片存儲系統，目前來看是對小文件存儲的最好解決方案之一，facebook目前經過該系統存儲了超過300PB的數據，一個師兄就在這個團隊工做，聽過不少有意思的事情（我在前公司的時候開發過一個相似的系統pallas，不只支持副本，還支持Reed Solomon-LRC，性能也有較多優化）；
6. Windows Azure Storage: a highly available cloud storage service with strong consistency：windows azure的整體介紹文章，是一篇很好的描述雲存儲架構的論文，其中經過分層來同時保證可用性和一致性的思路在現實工做中也給了我不少啓發；
7. GraphLab: A New Framework for Parallel Machine Learning：CMU基於圖計算的分佈式機器學習框架，目前已經成立了專門的商業公司，在分佈式機器學習上頗有兩把刷子，其單機版的GraphChi在百萬維度的矩陣分解都只須要2~3分鐘；
8. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing：其實就是 Spark，目前這兩年最流行的內存計算模式，經過RDD和lineage大大簡化了分佈式計算框架，一般幾行scala代碼就能夠搞定原來上千行MapReduce代碼才能搞定的問題，大有取代MapReduce的趨勢；
9. Scaling Distributed Machine Learning with the Parameter Server：百度少帥李沐大做，目前大規模分佈式學習各家公司主要都是使用ps，ps具有良好的可擴展性，使得大數據時代的大規模分佈式學習成爲可能，包括Google的深度學習模型也是經過ps訓練實現，是目前最流行的分佈式學習框架，豆瓣的開源系統paracell也是ps的一個實現；
10. Dremel: Interactive Analysis of Web-Scale Datasets：Google的大規模（近）實時數據分析系統，號稱能夠在3秒相應1PB數據的分析請求，內部使用到了查詢樹來優化分析速度，其開源實現爲Drill，在工業界對實時數據分析也是比價有影響力；
11. Pregel: a system for large-scale graph processing: Google的大規模圖計算系統，至關長一段時間是Google PageRank的主要計算系統，對開源的影響也很大（包括GraphLab和GraphChi）；
12. Spanner: Google's Globally-Distributed Database：這是第一個全球意義上的分佈式數據庫，Google的出品。其中介紹了不少一致性方面的設計考慮，簡單起見，還採用了GPS和原子鐘確保時間最大偏差在20ns之內，保證了事務的時間序，一樣在分佈式系統方面具備很強的借鑑意義；
13. Dynamo: Amazon’s Highly Available Key-value Store：Amazon的分佈式NoSQL數據庫，意義至關於BigTable對於Google，於BigTable不一樣的是，Dynamo保證CAP中的AP，C經過vector clock作弱保證，對應的開源系統爲Cassandra；
14. S4: Distributed Stream Computing Platform：Yahoo出品的流式計算系統，目前最流行的兩大流式計算系統之一（另外一個是storm），Yahoo的主要廣告計算平臺；
15. Storm @Twitter：這個系統很少說，開啓了流式計算的新紀元，幾乎是全部公司流式計算的首選，絕對值得關注；
最近一兩年時間主要精力放到了機器學習上，分佈式系統的研究不太多了，現階段就列這15篇文章吧，覆蓋了分佈式系統的主要領域。若是想起來有遺漏再來補充。Good luck！html

----------------------------------------------分割線-------------------------------------------------------
評論裏邊和提到的兩篇論文也挺不錯的，一併補充在這裏。
1. Large-scale cluster management at Google with Borg；
2. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business；

編輯於 2015-05-08

1. 背景知識
體系結構

系統和網絡
通訊：RPC、RMI、MOM。。

進程和線程：
用戶態、內核態；輕量級進程；協程；Actor。。

分佈式相關問題
同步和互斥：保證相互衝突的併發進程能夠共享資源
Double checked Locking、Immutable Value、Future 。。

事件分離和分發：Reactor、Proactor。。。

選舉：從進程集中選出一個進程執行特別的任務

2. 分佈式理論
數據結構
B樹
log merge tree
merkle tree
一致性hash
DHT
vector clock
lock-free data structure
....

CAP、BASE
CAP: Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web
BASE an Acid Alternative

狀態、時序

Time Clocks and the Ordering of Events in a Distributed Systemreact

Virtual Time and Global States of Distributed Systems ios

Distributed Snapshots: Determining Global States of a Distributed Systemgit

2PC、3PC 、Paxos ...
A brief history of Consensus- 2PC and Transaction Commit
Paxos Made Simple.
Paxos Made Practical
Paxos made live . An engineering perspective.

一致性、事務
Life beyond Distributed Transactions: an Apostate’s Opinion
Impossibility of distributed consensus with one faulty process.
Consensus on Transaction Commit.
Uniform consensus is harder than consensus

3. 分佈式系統
分佈式基礎設施
消息隊列
RabbitMQ 、ZeroMQ...

分佈式鎖服務、協調
The Chubby lock service for loosely-coupled distributed systems
Zookeeper

集羣Monitoring

The ganglia distributed monitoring system:design, implementation, and 程序員

experiencegithub

Chukwa: A large-scale monitoring systemweb

分佈式存儲系統
分佈式文件系統
The Google file system.
Lustre
Cepth
Panasas

分佈式塊存儲
Sheepdob
Parallax
Petal

分佈式k-v存儲系統
Dynamo: Amazon’s highly available key-value store

分佈式表格系統
Amazon DynamoDB
Bigtable: A Distributed Storage System for Structured Data.

分佈式數據庫
Spanner: Google's Globally-Distributed Database

分佈式計算
Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters

內存計算
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

流式計算
S4: Distributed Stream Computing Platform
Twitter Storm

圖計算
GraphLab: A New Framework for Parallel Machine Learning
Pregel: a system for large-scale graph processing

4. 分佈式應用
圖片、視頻等
Finding a Needle in Haystack: Facebook's Photo Storage

搜索
Web search for a planet: The Google cluster architecture

IM

編輯於 2015-05-02

分佈式系統是一個很大的領域，裏面包含不少方向。
既然你都要讀paper了，應該也有必定基礎了。

伊利諾伊大學的Advanced Distributed Systems 裏把各個方向重要papers（updated Spring 2015）列舉出來，能夠參考一下（我只列舉main papers，optional本身能夠去看）
https://courses.engr.illinois.edu/cs525/sched.htm

Before, There Were Clouds

Historical reflections: The rise, fall, and resurrection of software as a service, M. Campbell-Kelly, CACM, May 2009.
Above the clouds (see the latest version of the paper on the site), M. Armbrust et al, Berkeley RADLAB, 2009.

•Larry Ellison's Rant on Cloud Computing (Youtube video)

You can join the Googlegroups on Cloud Computing算法

Cloud Computing Continued

MapReduce: Simplified Data Processing on Large Clusters, J. Dean et al, OSDI 2004 (Google)
Grid: a new infrastructure for 21st century science, I. Foster, Physics Today, 2002 (Argonne)

P2P Systemsspring

The Gnutella protocol specification v 0.4sql

P2P Systems (contd.)

Chord: a scalable peer-to-peer lookup service for Internet applications, I. Stoica et al, SIGCOMM 2001
Pastry: scalable, distributed object location and routing for large-scale peer-to-peer systems, A. Rowstron et al, Middleware 2001.
Kelips, I. Gupta et al, IPTPS 2003

Key-value Stores and NoSQL

Others: MongoDB

Basic Distributed Algorithms Fundamentals and Sensor Networks

Time, clocks and the ordering of events in a distributed system, L. Lamport, Communications ACM 1978
Distributed snapshots: determining global states of distributed systems, Chandy and Lamport, ACM TOCS 1985
Impossibility of distributed consensus with one faulty process, Fischer, Lynch and Patterson, Journal ACM 1985

Paxos and CommitingPlease don't review the first paper

(Indy will briefly present this paper) Paxos Made Simple, L. Lamport. Indy's slides: [ppt] [pdf]

Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, Iulian Moraru, David G. Andersen, Michael Kaminsky, SoCC 2014
Low-latency multi-datacenter databases using replicated commit, H. Mahmoud et al, VLDB 2013.

Cloud Programming

Hive - a warehousing solution over a map-reduce framework A. Thusoo et al, VLDB 2009
Storm (use the wiki or other web resources)

Stream Processing

Adaptive Stream Processing using Dynamic Batch Sizing, Tathagata Das, Yuan Zhong, Ion Stoica, Scott Shenker, SoCC 2014
Stream: The Stanford data stream management system,A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Technical Report, Stanford University, 2004.

Somewhat Consistent

GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks, Jiaqing Du, Calin Iorgulescu, Amitabha Roy, Willy Zwaenepoel, SoCC 2014
A Self-Configurable Geo-Replicated Cloud Storage System, Masoud Saeida Ardekani, and Douglas B. Terry,OSDI 2014

Litmus Tests

Salt: Combining ACID and BASE in a Distributed Database, Chao Xie, Chunzhi Su, Manos Kapritsos, Yang Wang, Navid Yaghmazadeh, Lorenzo Alvisi, and Prince Mahajan, OSDI 2014
Extracting More Concurrency from Distributed Transactions, Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, Jinyang Li, OSDI 2014

Adaptivity

Starﬁsh: a self-tuning system for big data analytics, H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, S. Babu, CIDR 2011.
Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process, Liuhua Chen, Haiying Shen, Karan Sapra, SoCC 2014

Blowing Hot and Cold: Storage

f4: Facebook’s Warm BLOB Storage System,
Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, Sanjeev Kumar, OSDI 2014
Pelican: A Building Block for Exascale Cold Data Storage, Shobana Balakrishnan, Richard Black, Austin Donnelly, Paul England, Adam Glass, Dave Harper, and Sergey Legtchenko, Aaron Ogus, Eric Peterson and Antony Rowstron, OSDI 2014.

Reliability

Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, T. Do et al, SOCC 2013
Heading Off Correlated Failures through Independence-as-a-Service, Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, Bryan Ford, OSDI 2014.

A Touch of Sensor Nets

Directed diffusion: A scalable and robust communication paradigm for sensor networks, C. Intanagonwiwat et al, Mobicom 2000
A review of current routing protocols for ad hoc mobile wireless networks, E.M. Royer et al, IEEE Personal Communications 1999

Graph Processing

LFGraph: Simple and Fast Distributed Graph Analytics, I. Hoque, I. Gupta, TRIOS 2013
GraphX: Graph Processing in a Distributed Dataflow Framework, Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica, OSDI 2014.

Latency is King

Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency, Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, Steven D. Gribble, SoCC 2014
PriorityMeister: Tail Latency QoS for Shared Networked Storage, Timothy Zhu, Alexey Tumanov, Michael A. Kozuch, Mor Harchol-Balter, Gregory R. Ganger, SoCC 2014

There's a P2P App for That

Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility, A. Rowstron et al, SOSP 2001
Ivy: A Read/Write Peer-to-Peer File System, Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen, OSDI 2002

Process it In-network

TAG: A Tiny Aggregation service for ad-hoc sensor networks, S. Madden, et al, OSDI 2002
Synopsis diffusion for robust aggregation in sensor networks, S. Nath et al, ACM TOSN, 2008.

How does it Really Behave?

Measurement, modeling, and analysis of a peer-to-peer file-sharing workload
Krishna P. Gummadi et al, SOSP 2003
Understanding availability, R. Bhagwan et al, IPTPS 2003
Measurement and Modeling of a Large-scale Overlay for Multimedia Streaming, L. Vu, I. Gupta, J. Liang, K. Nahrstedt, QShine 2007
An Evaluation of Amazon's Grid Computing Services: EC2, S3 and SQS, Simson Garfinkel, Harvard TechRep., 2007
What do Real-Life Hadoop Workloads Look Like, Cloudera Blog

Low Fees Required - Probabilistic Membership

A gossip-based failure detection service, R. van Renesse et al, Middleware 1998
SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol, A. Das et al, DSN 2002
On scalable and efficient distributed failure detectors, I. Gupta et al, PODC 2001

Cluster Scheduling

The Power of Choice in Data-Aware Cluster Scheduling, Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, Ion Stoica, OSDI 2014.
Reservation-based scheduling: if you're late don't blame us! Carlo Curino, Djellel Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, Sriram Rao, SoCC 2014

Distributed Machine Learning

Project Adam: Building an Efficient and Scalable Deep Learning Training System, Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman, OSDI 2014
Scaling Distributed Machine Learning with the Parameter Server, Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, Bor-Yiing Su, OSDI 2014

Now Emerging

Apache Hadoop YARN: Yet Another Resource Negotiator, V. K. Vavilapalli, A. C Murthy et al, SOCC 2013
C-Hint: An Effective and Reliable Cache Management for RDMA-Accelerated Key-Value Stores, Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan, SoCC 2014

So Much Data!

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems, Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, Anang D. Satria, SoCC 2014
Heterogeneity and dynamicity of clouds at scale: Google trace analysis, C. Reiss et al, SoCC 2012

Spreading the Rumor

Bimodal multicast, K Birman et al, ACM TOCS 1999
Epidemic algorithms for replicated database maintenance, A. Demers et al, PODC 1987.

How do Networks Look?

Exploring complex networks, Steven Strogatz, Nature 2001
Scaling properties of the Internet graph, A. Akella et al, PODC 2003
Mapping the Gnutella network, M. Ripeanu et al, IEEE Computing Journal 2002

編輯於 2015-11-04

經典的意思是通過時間驗證的。
排名第一的

的回答列舉了他本身選擇的最近15年經典，並且其中不少都是10年之後的文章。不能否認，這些是目前分佈式比較熱門的話題，但我以爲其中能稱得上經典的只有一小部分（1，2，3，12，13）。其餘文章不能說寫的很差，但我的認爲離經典還差一些。

讀經典是爲了掌握這個領域最基本的思想，知其然，更要知其因此然。好比chubby，讀實現以前，難道不更應該看看paxos算法自己是什麼？

其實美國比較好的大學的研究生分佈式系統課應該都會有reading list，這些差很少就是經典了。
好比cmu的：15-712 Syllabus。若是你要選30篇，70年代至今分佈式最經典的文章，大概就是這些了。你會看到上面好多文章是很老的。爲何還要看？由於想法被繼承了，這些文章能夠幫你瞭解因此然。固然上面有些文章其餘同窗也提到了（好比leslie lamport的paxos等）。

對db感興趣的，能夠看看這個：Reading List // 15-799 :: Advanced Topics in Database Systems (Fall 2013) reynold xin 維護了這個rxin/db-readings · GitHub

發佈於 2015-08-08

我以爲分佈式系統這一塊其實沒有一個很是清晰的知識圖譜，更多的是人們遇到了不一樣的問題，給出了不一樣的解決方案。因此要說很是經典和基礎的文章很難。湊巧的是這學期上了一門咱們系陳康老師分佈式系統導論的課，每節課講一兩篇論文，頗有意思，收穫很大。因此在這裏分享一下課程中涉及到的論文，未必切合題主要求，僅供參考。
1. GFS。google三駕馬車之一，分佈式文件系統。毋庸置疑，這應該是分佈式系統領域最經典的文章，幾乎全部分佈式、存儲和大數據相關的topic都要提到它。
2. BigTable。google三駕馬車之一，經典的分佈式key/value store。個人理解這類應用爲一個簡化版的數據庫。在實現上相似於操做系統的多級頁表。
3. Dynamo。Dynamo是Amazon開發的一套分佈式key/value store，可是從設計到屬性都和Bigtable相差很遠。裏面首次提出著名的DHT（分佈式哈希表），能夠在系統增減節點時遷移代價更低。
3. MapReduce。分佈式計算框架。也是google三駕馬車之一。把全部的分佈式操做抽象成Map和Reduce兩類，使得編程很是簡單。只須要實現這兩個接口就好了。這應該是最先地最有影響力的提出了分佈式計算框架，把程序員從裸寫mpi程序中解放出來。
4. Spark。分佈式計算框架。如今也是大數據時代的寵兒，應該和MapReduce是應用的最廣的兩個計算框架了。MapReduce每一輪迭代都是在硬盤上，Spark是在內存中，因此速度可能快上兩個數量級。
5. Dryad。是微軟出的一個分佈式計算框架，提出的時間很早，惋惜影響力不如前二者。它提供的接口是把分佈式計算流程抽象成一個有向無環圖，程序員實現每一個節點的計算和邊的數據傳輸便可。比MapReduce複雜，可是也更靈活。
6. Raft。Raft是14年提出的一個一致性協議。用來取代Paxos，由於後者實在太複雜，太難以理解。（lamport表示大家都是渣渣）。分佈式系統一個經典的模型就是副本狀態機，Raft就是用來維護這個副本狀態機的一致性的。
MIT 6.824的課程實驗：6.824 Home Page: Spring 2016，基本就是以raft爲基礎進行展開的。7. Time Vector Clock。分佈式系統裏面很難找到一個全局的時間，由於各個機器的時間是不一致的。因此lamport他們就提出了一個向量時鐘的概念，來表示分佈式系統裏面各個事件的相對順序。8. Distributed Snapshot。分佈式系統快照。這也是很是經典的一個分佈式問題，由於分佈式系統作快照的時候，各個機器不一樣步，加上有些信息在網絡上飛，因此如何獲得一個正確的快照是一個很難的問題。這篇文章提出一個能夠理論證實是正確的解。9. Concurrency Control & Transaction。嚴格來講這不是論文，是微軟出的一本書，concurrency control and recovery in database systems。可是引用已經破了5000。從理論上介紹了什麼是事務（transaction），以及如何保證事務的可順序化和可恢復性。10. 2 phase lock。這也是上面那本書中的內容，2pl是一個協議，遵循該協議能夠確保事務的可順序化，不會出現多個事務同時操做致使結果不正確的現象。11. OCC，樂觀控制協議。如何不上鎖，又能實現多個事務同時處理的正確性。也是數據庫領域的經典文章。12. 2 phase commit，兩階段提交協議。這是分佈式事務環境下，如何確保多個機器上事務同時提交或者失敗的一個協議。13. Byzantine容錯。Raft和Paxso的環境是全部的機器都是按照正確的邏輯運行，只是有可能失效；Byzantine算法的環境是有些機器可能被劫持，故意擾亂正常的操做。Byzantine算法是解決這種環境下的一致性協議問題。14. Memory Coherence in Shared Virtual Memory Systems。這個應該歸到分佈式一致性領域的問題。只不過應用場景在於分佈式共享內存。提供一個統一的接口，使得全部的機器看到的是同一個內存空間，而實際上有一個虛擬內存到物理內存的映射。須要重點考慮的是各個機器一致性的問題。這裏用到的是順序一致性15. Lazy release consistency for software distributed shared memory。和上面的問題同樣，都是分佈式共享內存，只不過使用了釋放一致性。16. Bayou，是一個手機訂會議室的系統。可是以這個系統爲例，實現了分佈式系統裏很是重要的一個概念，最終一致性。咱們如今生活中碰到的一些現象，好比微信不一樣的人看到的聊天記錄順序不同，頗有可能就是由於最終一致性。

分佈式系統領域有哪些經典論文

0 個回答