[ZooKeeper] 1 基本概念

時間 2019-12-09

原文原文鏈接

ZooKeeper: A Distributed Coordination Service for Distributed Applicationshtml

ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.node

Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.ios

ZooKeeper ：分佈式應用的分佈式協調服務

ZooKeeper 是一個爲分佈式應用程序而設計的分佈式開源的協調服務。它提供了一組簡單的原語，使得分佈式應用能夠在它基礎上實現更高層次的服務，以知足同步、配置維護、分組及命名等要求。它的設計易於編程開發，而且採用了相似於你們所熟悉的文件系統的目錄樹結構的數據模型。它運行在 Java 環境下，並提供了 Java 和 C 的接口。

衆所周知，協調服務很難保證正確性，特別容易出現條件競爭和死鎖。而 ZooKeeper 的設計目的就是爲了減輕分佈式應用的開發難度，從而不用再從頭開始構建協調服務。

☛ ZooKeeper 是 Google Chubby 的一個開源實現，也是 Hadoop 和 HBase 的重要組件，它提供了一項基本服務：分佈式鎖服務，後來擴展出其它的使用方法：配置維護、組服務、分佈式消息隊列和分佈式通知/協調等。

Design Goalsweb

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.算法

The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.數據庫

ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.apache

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.編程

Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.緩存

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.安全

ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.

設計目標

簡單。ZooKeeper 容許各分佈式進程之間能夠經過一個共享的層次型命名空間來相互協調，該命名空間的組織就像一個標準的文件系統，包括若干註冊的數據節點。這些節點在 ZooKeeper 中被稱爲 znodes，相似於文件和目錄。與用於存儲的傳統文件系統不一樣的是，ZooKeeper 的數據是保存在內存當中的，意味着 ZooKeeper 能夠實現高通量和低延遲。

ZooKeeper 的實現着重於高性能、高可用和嚴格的順序訪問。性能方面的特色決定了 ZooKeeper 可用於大型的分佈式系統，從可靠性方面來書，它能夠避免發生單點故障，嚴格的順序訪問控制則保證了能夠在客戶端實現複雜的同步原語。

複製。就像它所協調的分佈式進程，ZooKeeper 自己也能夠經過若干主機（稱爲集羣）進行復制。

組成 ZooKeeper 服務的各個服務器之間必須能夠相互通訊。它們維護一個狀態信息的內存映像，以及在持久化存儲中維護着事務日誌和快照。因此只要大部分服務器正常工做，這個 ZooKeeper 服務就是可用的。

多個客戶端能夠同時鏈接到一個 ZooKeeper 服務器。由客戶端維護着這個 TCP 鏈接，經過這個鏈接，客戶端能夠發送請求、接收響應、獲取監視事件以及發送心跳。若是這個鏈接斷了，客戶端就會鏈接到另外一臺 ZooKeeper 服務器。

順序。ZooKeeper 會爲每次更新標識一個數字，表示全部 ZooKeeper 事務的順序。後續的操做能夠利用這個順序實現更高層次的抽象功能，好比同步原語。

高效。ZooKeeper 特別適合於以讀取佔主導的工做負載中。ZooKeeper 能夠運行在數千臺機器上，而且當讀寫比例接近10:1時性能最佳。

ZooKeeper 所提供的服務主要是經過：數據結構（znode）+原語（關於該數據結構的一些操做）+ watcher 機制三個部分來實現的。

Data model and the hierarchical namespace

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.

數據模型和分層命名空間

ZooKeeper 的命名空間與標準的文件系統很是類似。一個命名空間就是一系列的由「/」分隔的路徑。命名空間裏的每一個節點都是由一個路徑（Unicode字符串）惟一標識的。

Nodes and ephemeral nodes

Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.

Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.

The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].

節點和臨時節點

與標準文件系統不一樣的是，ZooKeeper 命名空間中的每一個節點均可以包含與自己相關或者與子節點相關的數據。兼具文件和目錄兩種特色。（ZooKeeper 是用來存儲協調數據的，例如狀態信息、配置、位置信息等，因此每一個節點上存儲的數據一般都很小，在字節到千字節之間。）爲簡單起見，下文咱們將以 znode 表示 ZooKeeper 數據節點。

Znodes 維護着一個 stat 結構，包括數據修改、ACL 修改以及時間戳的版本號，用於緩存驗證和協調更新。每一次的 znode 數據更新，版本號都會隨之增長。當一個客戶端接收數據的同時也會獲得該數據的版本。

一個命名空間裏，每一個 znode 數據的讀寫都是原子性的。讀取操做是獲取全部與 znode 相關的數據字節，寫入操做則是替換全部數據。另外，每一個節點都有一個訪問控制列表（Access Control List，ACL），規定了特定用戶的權限，限定特定用戶對目標節點能夠執行的操做。

ZooKeeper 也有臨時節點（ephemeral node）的概念，這些節點的生命週期依賴於建立它們的 session ，一旦 session 結束了，臨時節點也將被自動刪除。雖然每一個臨時節點都會綁定一個客戶端會話，可是它們對全部客戶端仍是可見的。另外，臨時節點不容許有子節點。

節點的類型在建立時就已經肯定，而且不能改變。

持久化節點的生命週期不依賴於 session，只有在客戶端顯式執行刪除操做時才能被刪除。

順序節點是在建立節點時，在請求的路徑末尾添加一個遞增的計數。該計數對於此節點的父節點是惟一的。格式爲「%10d」。

監視器是客戶端在節點上設置 watch，當節點狀態發生改變時，將會觸發 watch 所綁定的操做，且 watch 只能觸發一次，以後就被刪除掉。

☛ 每一個 znode 由三部分組成：

stat：狀態信息，描述 znode 的版本、權限等信息；
data：與該 znode 關聯的數據；
children：該 znode 下的子節點；

節點類型包括：

PERSISTENT：持久化節點
PERSISTENT_SEQUENTIAL：持久化順序編號節點
EPHEMERAL：臨時節點
EPHEMERAL_SEQUENTIAL：臨時順序編號節點

Conditional updates and watches

ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].

條件更新和監視點

ZooKeeper 支持 watches（監視點） 的概念。客戶端能夠在一個 znode 上設置一個監視點，當 znode 發生改變時，監視點將被觸發並刪除，是一次性的觸發器。而當監視點被觸發時，客戶端就會收到 znode 發生改變的通知。而且，若是客戶端與 ZooKeeper 服務器之間的鏈接中斷了，客戶端會收到一個本地通知。【待定】

ZooKeeper 能夠爲全部的讀操做（exists、getChildren、etData）設置 watch。理論上，客戶端接收 watch 事件的時間要快於其看到 watch 對象狀態變化的時間。

watch 是由客戶端所鏈接的 ZooKeeper 服務器在本地進行維護，所以 watch 很容易進行設置、管理和分派。watch 分爲如下兩種：

data watches：當前節點數據的 watch，由 getData 和 exists 負責設置；
child watches：當前節點的子節點的 watch，由 getChildren 負責設置；

因此函數 getData、exists 和 getChildren 具備雙重做用，註冊觸發事件和函數自己的功能。分別重載 process(Event event) 和 processResult() 來實現。

設置 watch	watch 觸發器
	create		delete		setData
	znode	child	znode	child	znode
exists	NodeCreated		NodeDeleted		NodeDataChanged
getData			NodeDeleted		NodeDataChanged
getChildren		NodeChildrenChanged	NodeDeleted	NodeDeletedChanged

Guarantees

ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

保證

ZooKeeper 是很是高效簡單的。由於它的目標是構建更加複雜服務（例如同步）的基礎，因此它提供了一系列的保證：

順序一致性 —— 來自客戶端的更新，將會嚴格按照其發送的順序被應用到 ZooKeeper 中。
原子性 —— 更新不論是成功仍是失敗，其結果都是一致的，沒有部分的結果。即全部事務請求的處理結果在整個集羣中全部機器上的應用狀況是一致的，要麼整個集羣中全部機器都成功應用了某一事務，要麼都沒有應用，必定不會出現部分機器應用了該事務，而另外一部分機器沒有應用的狀況。
單系統映像 —— 無論客戶端鏈接到哪臺服務器，它們都將獲得相同的服務，看到的數據模型都是一致的。
可靠性 —— 一旦服務器端應用了一個更新事務，並完成對客戶端的響應，那麼該事務所引發的服務端狀態變動將會一直保留下來，直到客戶端再次覆蓋更新。
實時性 —— 在必定時間範圍內能夠保證客戶端獲取的系統狀態是最新的。即 ZooKeeper 並非一種強一致性，只能保證順序一致性和最終一致性，即僞實時性。

時間和版本

（1）zxid

使 ZooKeeper 節點狀態改變的每一個操做都將使節點接收到一個 zxid 格式的時間戳，該時間戳是全局有序的，便是惟一標識。若是 zxid1 小於 zxid2，那麼 zxid1 對應的事務應發生在 zxid2 對應的事務以前。

czxid：節點建立時間對應的 zxid 格式時間戳
mzxid：節點修改時間對應的 zxid 格式時間戳
pzxid：該節點或其子節點的最近一次建立/刪除時間對應的 zxid 格式時間戳

zxid 是一個64位數字，其高32位是 epoch 用來標識 leader 關係是否改變，每次 leader 被選舉出來，就會產生一個新的 epoch。低32位是個遞增的計數。

（2）version

對節點的每一個操做都將使這個節點的版本號增長。

version：節點數據版本號
cversion：子節點版本號
aversion：節點擁有的 ACL 版本號

節點屬性

屬性	描述
czxid	節點建立時間對應的 zxid 格式時間戳
mzxid	節點修改時間對應的 zxid 格式時間戳
ctime	節點建立時間
mtime	節點修改時間
version	節點數據版本號
cversion	子節點版本號
aversion	節點擁有的 ACL 版本號
ephemeralOwner	若是此節點爲臨時節點，該值表示節點擁有者的會話 id，不然爲 0
dataLength	節點數據長度
numChildren	節點擁有的子節點個數
pzxid	該節點或其子節點的最近一次建立/刪除時間對應的 zxid 格式時間戳

Simple API

One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:

create：creates a node at a location in the tree
delete：deletes a node
exists：tests if a node exists at a location
get data：reads the data from a node
set data：writes data to a node
get children：retrieves a list of children of a node
sync：waits for data to be propagated

簡單的 API

ZooKeeper 的設計目標之一就是提供一組簡單的編程接口，結果它就只支持以下操做：

方法	描述
create	在樹中某個位置建立一個節點（父節點必須存在）
delete	刪除一個節點（znode 沒有子節點）
exists	在某個位置檢查是否存在一個節點，並獲取它的元數據
get data	從一個節點讀取數據，getACL、getChildren、getData
set data	設置一個節點數據，setACL、setData
get children	獲取一個節點的子節點列表
sync	等待數據傳播（同步到其餘節點）

Implementation

ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.

As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

實現

下圖顯示了 ZooKeeper 服務的高層次組件。除了請求處理器（Request Processor）外，構成 ZooKeeper 服務的每臺服務器本身都有一份各個組件的備份。

複製數據庫（Replicated Database）是一個包含整個數據樹的內存數據庫。更新操做都會記錄到磁盤以便恢復，而寫入操做在應用到內存數據庫以前會先被序列化到磁盤。
每一個 ZooKeeper 服務器均可覺得客戶端提供服務，而客戶端只會鏈接到一臺服務器來提交請求。讀取請求是由每一個服務器數據庫的本地副本提供服務，而對於會改變服務狀態的請求 —— 寫請求，則是由一個約定的協議進行處理。

該約定協議規定，全部客戶端的寫請求都統一發送到一臺服務器上，該服務器稱爲 leader，其他的 ZooKeeper 服務器則稱爲 followers，follower 會從 leader 接收消息提議並贊成實施。在 leader 發生故障時，協議的消息層（Messaging Layer）則會關注 leader 的更換，並同步到其餘 followers。

ZooKeeper 採用了一個自定義的原子性消息協議。因爲消息層是原子性的，因此 ZooKeeper 能夠保證本地副本不會產生分歧。當 leader 接收一個寫請求時，它會計算出寫入操做實施後的系統狀態，捕獲該新狀態並將其轉換成一個事務。

Performance

性能

ZooKeeper release 3.2
2Ghz Xeon + 2個 SATA 15K RPM 磁盤
一個磁盤用於 ZooKeeper 的日誌記錄，快照則寫入 OS 設備
「N Servers」表明 ZooKeeper 集羣中服務器的個數
大約30臺服務器模擬客戶端
ZooKeeper 集羣設置成 leader 不容許鏈接客戶端

Reliability

可靠性

一臺 follower 故障及恢復
另外一臺 follower 故障及恢復
leader 故障
兩臺 followers 故障及恢復
另外一臺 leader 故障

ZooKeeper 安全機制

ACL（Access Control List），ZooKeeper 提供一套完善的 ACL 權限控制機制，包括三種模式：

權限模式，Schema，開發人員經常使用。

IP：經過 ip 地址粒度進行權限控制，支持按網段分配權限，例如 192.168.1.*
Diges：最經常使用的權限控制模式，相似於"username:password"形式的權限標識，並對其進行 SHA-1 加密算法和 BASE64 編碼兩次編碼處理。
World：最開放的權限控制模式，做爲一種特殊的 diges。
Super：超級用戶模式，能夠進行任意操做。