典型分佈式系統分析：Bigtable

時間 2019-11-16

原文原文鏈接

　　本文是典型分佈式系統分析的第三篇，分析的是Bigtable，一個結構化的分佈式存儲系統。html

　　Bigtable做爲一個分佈式存儲系統，和其餘分佈式系統同樣，須要保證可擴展、高可用與高性能。與此同時，Bigtable還有應用普遍的特色（wide applicability），既能知足對延時敏感的、面向終端用戶的應用需求，又能hold住高吞吐需求的批處理程序。算法

　　不過，通讀完整篇論文，會發現，Bigtable這個系統是創建在不少其餘google的產品上的，如GFS、Chubby。GFS爲Bigtable提供了可伸縮、高可靠、高可用的數據存儲服務；而Chubby保證了Bigtable中元數據的高可用、強一致。這種設計思想，跟以前分析過的GFS，以及本人日常使用到的MongoDB不太同樣，在GFS、MongoDB中，元數據服務器通常有兩重功能：維護元數據、集中調度；而Bigtable中的master只負責調度。數據庫

　　本文地址：http://www.javashuo.com/article/p-seltsvwt-hv.htmlbootstrap

Bigtable的定義

　　Bigtable是06年的論文，當時仍是關係型數據庫一統江湖。所以，網上有人說，Bigtable較難以理解，由於Bigtable有一些術語與關係型數據路相似，如row、column、table，可是內部實現、使用方式又與傳統關係型數據庫差別很是之大。不過如今是2018年了，NoSQL已經應用很是普遍，所以至少如今看起來仍是比較容易讀懂的。緩存

A Bigtable is a sparse, distributed, persistent multidimensional sorted map.服務器

　　上面是Bigtable的定義，特色是sparse、distributed、multidimensional、sorted map，此外，還要加上一個關鍵字：structured。網絡

　　在文章understanding-hbase-and-bigtable中有對這幾個關鍵字的詳細解釋與舉例。下面結合論文中的例子來分析一下這幾個術語：數據結構

　　圖中，是一個存儲網頁的例子，Bigtable是一個有序的字典(key value pair)，key是 (row:string, column:string, time:int64)， value則是任意的string。架構

　　在網頁存儲這個例子中，row是URL（倒過來的URL，爲了讓同一個網站的網頁儘可能存放在一塊兒）。column則是由colune family：qualifier組成，上圖中，contens、anchor都是colume family，一個colume family下面能夠包含一個到多個colume。time則是不一樣時刻的版本，基於time，bigtable提供了不一樣的垃圾回收策略：only last n、only new enough。app

　　Bigtable是結構化（Structured）數據，colume family在定義表（table）的時候就須要建立，相似關係型數據庫。colume family通常數量較少，但colume family下面的colume是動態添加的，數量能夠不少。針對上面的例子，有的文章可能只有一個做者，有的文章可能好幾個做者，雖然都有anchor這colume family，可是所包含的colume數量是不同的，這也是稱之爲Sparse的緣由。

Bigtable存儲

　　Bigtable是一個分佈式存儲，可伸縮性（scalability）是首先須要解決的問題，那麼Bigtable是如何分片（partition）的呢。

　　tablet是Bigtable中數據分片和負載均衡的基本單位（the unit of distribution and load balancing.），大小約爲100M到200M，其概念等價於GFS、MongoDB中的chunk。簡單來講，就是由連續的若干個row組成的一個區塊，BIgtable維護的是tablet到tablet server的映射關係，當須要遷移數據的時候，也是與tablet爲單位。

　　tablet採用的是range-based的分片方式，相近的row會被劃分在同一個tablet裏面，range based對於範圍查詢是很是友好的，好比上面網頁存儲的例子，同一個網站的網頁會被儘可能放在一塊兒。可是range based容易在寫入的時候流量導入到同一個tablet，須要額外的split來達到均衡。

　　tablet內部採用了相似LSM（log-Structured merge）Tree的存儲方式，有一個memtable與多個sstable（sorted string table）組成，以下入所示：

　　上圖分解出了哪些數據是維護在內存中，哪些是持久化到GFS。能夠看到memtable是內存中的數據結構，而write ahead log、sstable則會持久化到GFS。

　　對於memtable，理解比較簡單，就是一個有序的dict，memtable的數據量到達必定狀況下的時候就會以sstable的形式寫入到GFS。

　　sstable定義以下：

a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings

　　所以也是順序存儲的，sstable是bigtable數據物理存儲的基本單位。在sstable內部，一個sstable包含多個block（64kb爲單位），block index放在sstable末尾，open sstable的時候block index會被加載到內存，二分查找block index就能找到須要的block，加速磁盤讀取。在特殊狀況下，sstable也是能夠強制放在內存的。

　　寫操做較爲簡單，寫到memtable就能夠了。而對於讀操做，則須要merge memtable與SSTable中的數據：

A valid read operation is executed on a merged view of the sequence of SSTables and the memtable.Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can beformed efciently.

　　因爲寫入是在內存中，那麼查詢的時候，對於某個key，有可能在memtable中，也有可能在sstable中，並且在哪個sstable中仍是不必定的。舉個簡單的例子，假設一個tablet包含memtable和兩個sstable（第一個sstable比第二個sstable先生成）

第一個sstable
a
k
z
第二個sstable
b
g
y
memtable
c
k
w

　　查找任何一個key時，須要以（memtable、第二個stable、第一個sstable）的順序查找。好比對於key k，在memtable中找到就能夠返回了（雖然第一個sstable也有一個k）；對於key g，首先找memtable不命中，而後在第二個sstable命中；對於key m，則查找完全部sstable以後才能知道都不會命中。爲了加速查找過程，採用了兩種技術，compaction、bloom filter，前者減小了一次查找讀取sstable的量，後者能夠避免在key不存在的時候，無需檢查memtable與sstable。

　　compaction有幾個層次：

minor compaction: When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.
merging compaction: reads the contents of a few SSTables and the memtable, and writes out a new SSTable.
major compaction: A merging compaction that rewrites all SSTables into exactly one SSTable

　　對於LSM-tree的詳細介紹，能夠參考DDIA（design data-intensive applications）

Bigtable系統架構

　　在論文的build blocks部分，提到了Bigtable使用到的其餘組件（服務），其中最重要的就是GFS與Chubby，而Bigtable內部又分爲三部分：Master，tablet server, client。所以總體架構以下圖（來自slideshare）

Chubby vs master

　　在Bigtable中，Chubby提供瞭如下功能：

to ensure that there is at most one active master at any time; 　　--》任意時刻只有一個master
to store the bootstrap location of Bigtable data (see Section5.1); --》元數據的起始位置
to discover tablet servers and finalize tablet server deaths; --》tablet server的生命週期監控
to store Bigtable schema information (the column family information for each table);
and to store access control lists.

　　前三點，在一個獨立的分佈式存儲系統（GFS MongoDB）中，應該都是由元數據服務器提供，但在Bigtable中，這部分功能都已到了Chubby，簡化了master自己的設計。

　　那master的職責就主要是：

assigning tablets to tablet servers,
detecting the addition and expiration of tablet servers
balancing tablet-server load
garbage collection of files in GFS.
In addition, it handles schema changes such as table and column family creations.

　　在經典論文翻譯導讀之《Google File System》一文中，做者總結到：

分佈式文件系統經常使用的架構範式就是「元數據總控+分佈式協調調度+分區存儲」。

能夠看出這個範式裏的兩個角色——協調組件、存儲組件。協調組件負責了元數據總控+分佈式協調調度，各存儲組件做爲一個分區，負責實際的存儲結構和本地數據讀寫

　　在Bigtable中，Chubby負責了元數據總控，master負責分佈式協調調度。

元數據管理 tablet location

　　上面提到，Chubby負載元數據總控，那全部tablets的位置信息全都放在Chubby上？顯然是不現實的。

　　事實上，系統採用了相似B+樹的三層結構來維護tablet location信息

The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contain the location of all tablets in a special METADATA table.Each METADATA tablet contains the location of a set of user tablets.

　　可見，Chubby中存儲的只是root tablet的位置信息，數據量不多。在Root tablet裏面，維護的是METADATA tablets的位置信息；METADATA tablet存儲的則是應用的tablet的位置信息。

　　系統也作了一些工做，來減輕存儲METADATA tablets 的 tablet server的負擔，首先METADATA tablet對應的sstable存儲在內存中，無需磁盤操做。其次，bigtable client會緩存元數據信息，並且會prefetch元數據信息，減小交互。

The client library caches tablet locations.
further reduce this cost in the common case by having the client library prefetch tablet locations

單點master

　　在上圖中能夠看出，Bigtable中，master是無狀態的單點，無狀態是指master自己沒有須要持久化的數據。而單點須要面對的問題是單點故障（single point of failure）

　　首先，master的負載並不高，最重要的緣由是，Bigtable client並不與master直接交互（這歸功於master並不維護系統元數據）。而tablets的管理，如建立、遷移，自己就不是高頻操做。

　　其次，即便master fail（因爲crash或者network partition），系統會建立新的master，並在內存中恢復元數據（tablets到tablet server的映射、還沒有分配的tablets）。步驟以下：

The master grabs a unique master lock in Chubby, which prevents concurrent master instantiations.
The master scans the servers directory in Chubby to find the live servers.
The master communicates with every live tablet server to discover what tablets are already assigned to each server.
The master scans the METADATA table to learn the set of tablets. Whenever this scan encounters a tablet that is not already assigned, the master adds the tablet to the set of unassigned tablets, which makes the tablet eligible for tablet assignment.

　　注意第三四步，元數據既來自tablet server，又來自METADATA table。一方面是存在有一些還沒有分配的tablets（如遷移產生的、talets server故障致使的），這部分只存在於METADATA table；另外一方面，tablet server中必定是當前時刻的準確信息。

Bigtable lessons

　　做爲一個劃時代的、開創性的、應用普遍的分佈式系統，Bigtable不管在設計、實現、應用中都會遇到不少問題，有不少指的思考、借鑑的地方，他山之石能夠攻玉。Bigtable本身總結以下：

（1）萬萬沒想到的失敗和異常

　　除了你們耳熟能詳的網絡分割（network partition）和節點故障（fail stop）模型，Bigtable還遇到了：

memory and network corruption,
large clock skew,
hung machines,
extended and asymmetric network partitions,
bugs in other systems that we are using (Chubby for example),
oveflow of GFS quotas
and planned and unplanned hardware maintenance.

（2）三思然後行，不要過分設計

Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used.

　　先搞懂需求背後用戶但願解決的真正問題，有時候需求是假象，須要先挖掘本質

（3）監控的重要性

the importance of proper system-level monitoring(i.e., monitoring both Bigtable itself, as well as the client processes using Bigtable).

　　不能贊成更多，特別是如今服務化、微服務甚囂塵上，沒有完善的監控讓系統的運維苦不堪言。特別是做爲各類框架、引擎，完善的監控更是不可或缺。

（4）簡化設計

The most important lesson we learned is the value of simple designs.

　　在Google三大件（MapReduce、GFS、Bigtable）都提到了這一點，simple design意味着更好的維護性，更少的邊界條件。不過坦白的說，沒有涉及過複雜的系統，是很難體會到"Simple is Better Than Complex"。

雜項

事務支持

　　分佈式系統中，分佈式事務會影響到性能、可用性，所以大多隻提供單行原子性操做，bigtable中也是如此

Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row)　　

locality group

　　client指定多個colume family造成一個group，locality group單獨存成一個sstable，並且locality group還能夠強制保存在內存中，如前面提到的METADATA tablets。

　　group使用單獨的sstable存儲就使得Bigtable事實上使用了colume based storage，這對於批處理程序或者OLAP很是有用。

Bigtable locality groups realize similar compression and disk read performance benefits observed for other systems that organize data on disk using column-based rather than row-based storage

Merged commit log

　　爲了減輕GFS的負擔，加快commit log 寫入的速度，tablet server並非爲每個tablets維護一個commit log，而是一個tablet server上的全部tablets公用一個commit file。

　　但公用的commit log在tablets recover的時候就不又好了，假設某個Tablet server故障，其上維護的諸多tablets會被遷移到其餘tablet server上，多個目標tablet server都須要讀取這個commit log文件來恢復tablets的狀態。顯然，都來讀取這個文件是不切實際的，bigtable採起了先對commit log並行歸併排序的算法，讓關聯的數據集中。

We avoid duplicating log reads by first sorting the commit log entries in order of the keys <table; row name; log sequence number>.
In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read.