[MapReduce] Google三駕馬車：GFS、MapReduce和Bigtable

時間 2019-11-18

標籤 mapreduce google 三駕馬車 gfs bigtable 欄目 Hadoop 简体版

原文原文鏈接

　　聲明：此文轉載自博客開發團隊的博客，尊重原創工做。該文適合學分佈式系統以前，做爲背景介紹來讀。html

　　談到分佈式系統，就不得不提Google的三駕馬車:Google FS[1],MapReduce[2],Bigtable[3]。linux

　　雖然Google沒有公佈這三個產品的源碼，可是他發佈了這三個產品的詳細設計論文。並且，Yahoo資助的Hadoop也有按照這三篇論文的開源Java實現:Hadoop對應MapReduce, Hadoop Distributed File System (HDFS)對應Google FS,Hbase對應Bigtable。不過在性能上Hadoop比Google要差不少，參見表1。數據庫

　　表1：Hbase和BigTable性能比較(來源於http://wiki.apache.org/lucene-hadoop/Hbase/PerformanceEvaluation)apache

Experiment編程	HBase20070916緩存	BigTable架構
random reads併發	272app	1212負載均衡
random reads (mem)	Not implemented	10811
random writes	1460	8850
sequential reads	267	4425
sequential writes	1278	8547
Scans	3692	15385

　　如下分別介紹這三個產品：

1. Google FS

　　GFS是一個可擴展的分佈式文件系統，用於大型的、分佈式的、對大量數據進行訪問的應用。它運行於廉價的普通硬件上，提供容錯功能。

　　圖1 GFS Architecture

　　（1）GFS的結構

　　1. GFS的結構圖見圖1，由一個master和大量的chunkserver構成，

　　2. 不像Amazon Dynamo的沒有主的設計，Google設置一個主來保存目錄和索引信息，這是爲了簡化系統結果，提升性能來考慮的，可是這就會形成主成爲單點故障或者瓶頸。爲了消除主的單點故障Google把每一個chunk設置的很大(64M)，這樣，因爲代碼訪問數據的本地性，application端和master的交互會減小，而主要數據流量都是Application和chunkserver之間的訪問。

　　3. 另外，master全部信息都存儲在內存裏,啓動時信息從chunkserver中獲取。提升了master的性能和吞吐量，也有利於master當掉後，很容易把後備j機器切換成master。

　　4. 客戶端和chunkserver都不對文件數據單獨作緩存，只是用linux文件系統本身的緩存

　　「The master stores three major types of metadata: the file and chunk namespaces, the mapping from files to chunks, and the locations of each chunk’s replicas.」

　　「Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement and replication decisions using global knowledge. However,we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations.」

　　「Neither the client nor the chunkserver caches file data.Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues.(Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.」

　　（2）GFS的複製

　　GFS典型的複製到3臺機器上，參看圖2

　　圖2 一次寫操做的控制流和數據流

　　(3) 對外的接口

　　和文件系統相似，GFS對外提供create, delete,open, close, read, 和 write 操做。另外，GFS還新增了兩個接口snapshot and record append，snapshot。有關snapshot的解釋：

　　「Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost.

Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.」

2. MapReduce

　　MapReduce是針對分佈式並行計算的一套編程模型。

　　講到並行計算，就不能不談到微軟的Herb Sutter在2005年發表的文章」 The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software」[6]，主要意思是經過提升cpu主頻的方式來提升程序的性能很快就要過去了，cpu的設計方向也主要是多核，超線程等併發上。可是之前的程序並不能自動的獲得多核的好處，只有編寫併發程序，才能真正得到多核的好處。分佈式計算也是同樣。

　　圖3 MapReduce Execution overview

　　1）MapReduce是由Map和reduce組成,來自於Lisp，Map是影射，把指令分發到多個worker上去，Reduce是規約，把Map的worker計算出來的結果合併。（參見圖3）

　　2）Google的MapReduce實現使用GFS存儲數據。

　　3）MapReduce可用於Distributed Grep,Count of URL Access Frequency,ReverseWeb-Link Graph,Distributed Sort,Inverted Index

3. Bigtable

　　就像文件系統須要數據庫來存儲結構化數據同樣，GFS也須要Bigtable來存儲結構化數據。

　　1）BigTable 是創建在 GFS ，Scheduler ，Lock Service 和 MapReduce 之上的。

　　2）每一個Table都是一個多維的稀疏圖

　　3）爲了管理巨大的Table，把Table根據行分割，這些分割後的數據統稱爲：Tablets。每一個Tablets大概有 100-200 MB，每一個機器存儲100個左右的 Tablets。底層的架構是：GFS。因爲GFS是一種分佈式的文件系統，採用Tablets的機制後，能夠得到很好的負載均衡。好比：能夠把常常響應的表移動到其餘空閒機器上，而後快速重建。