ZooKeeper: A Distributed Coordination Service for Distributed Applicationshtml
ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming. It is designed to be easy to program to, and uses a data model styled after the familiar directory tree structure of file systems. It runs in Java and has bindings for both Java and C.node
Coordination services are notoriously hard to get right. They are especially prone to errors such as race conditions and deadlock. The motivation behind ZooKeeper is to relieve distributed applications the responsibility of implementing coordination services from scratch.ios
Design Goalsweb
ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.算法
The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.數據庫
ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.apache
The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.編程
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.緩存
ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.安全
ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.
簡單。ZooKeeper 容許各分佈式進程之間能夠經過一個共享的層次型命名空間來相互協調,該命名空間的組織就像一個標準的文件系統,包括若干註冊的數據節點。這些節點在 ZooKeeper 中被稱爲 znodes,相似於文件和目錄。與用於存儲的傳統文件系統不一樣的是,ZooKeeper 的數據是保存在內存當中的,意味着 ZooKeeper 能夠實現高通量和低延遲。
ZooKeeper 的實現着重於高性能、高可用和嚴格的順序訪問。性能方面的特色決定了 ZooKeeper 可用於大型的分佈式系統,從可靠性方面來書,它能夠避免發生單點故障,嚴格的順序訪問控制則保證了能夠在客戶端實現複雜的同步原語。
複製。就像它所協調的分佈式進程,ZooKeeper 自己也能夠經過若干主機(稱爲集羣)進行復制。
組成 ZooKeeper 服務的各個服務器之間必須能夠相互通訊。它們維護一個狀態信息的內存映像,以及在持久化存儲中維護着事務日誌和快照。因此只要大部分服務器正常工做,這個 ZooKeeper 服務就是可用的。
多個客戶端能夠同時鏈接到一個 ZooKeeper 服務器。由客戶端維護着這個 TCP 鏈接,經過這個鏈接,客戶端能夠發送請求、接收響應、獲取監視事件以及發送心跳。若是這個鏈接斷了,客戶端就會鏈接到另外一臺 ZooKeeper 服務器。
順序。ZooKeeper 會爲每次更新標識一個數字,表示全部 ZooKeeper 事務的順序。後續的操做能夠利用這個順序實現更高層次的抽象功能,好比同步原語。
高效。ZooKeeper 特別適合於以讀取佔主導的工做負載中。ZooKeeper 能夠運行在數千臺機器上,而且當讀寫比例接近10:1時性能最佳。
ZooKeeper 所提供的服務主要是經過:數據結構(znode)+原語(關於該數據結構的一些操做)+ watcher 機制 三個部分來實現的。
The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
ZooKeeper 的命名空間與標準的文件系統很是類似。一個命名空間就是一系列的由「/」分隔的路徑。命名空間裏的每一個節點都是由一個路徑(Unicode字符串)惟一標識的。
Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory. (ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.) We use the term znode to make it clear that we are talking about ZooKeeper data nodes.
Znodes maintain a stat structure that includes version numbers for data changes, ACL changes, and timestamps, to allow cache validations and coordinated updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data it also receives the version of the data.
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Ephemeral nodes are useful when you want to implement [tbd].
與標準文件系統不一樣的是,ZooKeeper 命名空間中的每一個節點均可以包含與自己相關或者與子節點相關的數據。兼具文件和目錄兩種特色。(ZooKeeper 是用來存儲協調數據的,例如狀態信息、配置、位置信息等,因此每一個節點上存儲的數據一般都很小,在字節到千字節之間。)爲簡單起見,下文咱們將以 znode 表示 ZooKeeper 數據節點。
Znodes 維護着一個 stat 結構,包括數據修改、ACL 修改以及時間戳的版本號,用於緩存驗證和協調更新。每一次的 znode 數據更新,版本號都會隨之增長。當一個客戶端接收數據的同時也會獲得該數據的版本。
一個命名空間裏,每一個 znode 數據的讀寫都是原子性的。讀取操做是獲取全部與 znode 相關的數據字節,寫入操做則是替換全部數據。另外,每一個節點都有一個訪問控制列表(Access Control List,ACL),規定了特定用戶的權限,限定特定用戶對目標節點能夠執行的操做。
ZooKeeper 也有臨時節點(ephemeral node)的概念,這些節點的生命週期依賴於建立它們的 session ,一旦 session 結束了,臨時節點也將被自動刪除。雖然每一個臨時節點都會綁定一個客戶端會話,可是它們對全部客戶端仍是可見的。另外,臨時節點不容許有子節點。
節點的類型在建立時就已經肯定,而且不能改變。
持久化節點的生命週期不依賴於 session,只有在客戶端顯式執行刪除操做時才能被刪除。
順序節點是在建立節點時,在請求的路徑末尾添加一個遞增的計數。該計數對於此節點的父節點是惟一的。格式爲「%10d」。
監視器是客戶端在節點上設置 watch,當節點狀態發生改變時,將會觸發 watch 所綁定的操做,且 watch 只能觸發一次,以後就被刪除掉。
☛ 每一個 znode 由三部分組成:
ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification. These can be used to [tbd].
ZooKeeper 支持 watches(監視點) 的概念。客戶端能夠在一個 znode 上設置一個監視點,當 znode 發生改變時,監視點將被觸發並刪除,是一次性的觸發器。而當監視點被觸發時,客戶端就會收到 znode 發生改變的通知。而且,若是客戶端與 ZooKeeper 服務器之間的鏈接中斷了,客戶端會收到一個本地通知。【待定】
ZooKeeper 能夠爲全部的讀操做(exists、getChildren、etData)設置 watch。理論上,客戶端接收 watch 事件的時間要快於其看到 watch 對象狀態變化的時間。
watch 是由客戶端所鏈接的 ZooKeeper 服務器在本地進行維護,所以 watch 很容易進行設置、管理和分派。watch 分爲如下兩種:
設置 watch | watch 觸發器 | ||||
create | delete | setData | |||
znode | child | znode | child | znode | |
exists | NodeCreated | NodeDeleted | NodeDataChanged | ||
getData | NodeDeleted |
NodeDataChanged |
|||
getChildren | NodeChildrenChanged | NodeDeleted |
NodeDeletedChanged |
ZooKeeper is very fast and very simple. Since its goal, though, is to be a basis for the construction of more complicated services, such as synchronization, it provides a set of guarantees. These are:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
ZooKeeper 是很是高效簡單的。由於它的目標是構建更加複雜服務(例如同步)的基礎,因此它提供了一系列的保證:
屬性 | 描述 |
czxid | 節點建立時間對應的 zxid 格式時間戳 |
mzxid | 節點修改時間對應的 zxid 格式時間戳 |
ctime | 節點建立時間 |
mtime | 節點修改時間 |
version | 節點數據版本號 |
cversion | 子節點版本號 |
aversion | 節點擁有的 ACL 版本號 |
ephemeralOwner | 若是此節點爲臨時節點,該值表示節點擁有者的會話 id,不然爲 0 |
dataLength | 節點數據長度 |
numChildren | 節點擁有的子節點個數 |
pzxid | 該節點或其子節點的最近一次建立/刪除時間對應的 zxid 格式時間戳 |
One of the design goals of ZooKeeper is provide a very simple programming interface. As a result, it supports only these operations:
方法 | 描述 |
create | 在樹中某個位置建立一個節點(父節點必須存在) |
delete | 刪除一個節點(znode 沒有子節點) |
exists | 在某個位置檢查是否存在一個節點,並獲取它的元數據 |
get data | 從一個節點讀取數據,getACL、getChildren、getData |
set data | 設置一個節點數據,setACL、setData |
get children | 獲取一個節點的子節點列表 |
sync | 等待數據傳播(同步到其餘節點) |
ZooKeeper Components shows the high-level components of the ZooKeeper service. With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.
ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.
下圖顯示了 ZooKeeper 服務的高層次組件。除了請求處理器(Request Processor)外,構成 ZooKeeper 服務的每臺服務器本身都有一份各個組件的備份。
複製數據庫(Replicated Database)是一個包含整個數據樹的內存數據庫。更新操做都會記錄到磁盤以便恢復,而寫入操做在應用到內存數據庫以前會先被序列化到磁盤。
每一個 ZooKeeper 服務器均可覺得客戶端提供服務,而客戶端只會鏈接到一臺服務器來提交請求。讀取請求是由每一個服務器數據庫的本地副本提供服務,而對於會改變服務狀態的請求 —— 寫請求,則是由一個約定的協議進行處理。
該約定協議規定,全部客戶端的寫請求都統一發送到一臺服務器上,該服務器稱爲 leader,其他的 ZooKeeper 服務器則稱爲 followers,follower 會從 leader 接收消息提議並贊成實施。在 leader 發生故障時,協議的消息層(Messaging Layer)則會關注 leader 的更換,並同步到其餘 followers。
ZooKeeper 採用了一個自定義的原子性消息協議。因爲消息層是原子性的,因此 ZooKeeper 能夠保證本地副本不會產生分歧。當 leader 接收一個寫請求時,它會計算出寫入操做實施後的系統狀態,捕獲該新狀態並將其轉換成一個事務。
Performance
Reliability