ZooKeeper-基礎介紹

時間 2019-11-07

原文原文鏈接

What is ZooKeeper?

ZooKeeper爲分佈式應用設計的高性能（使用在大的分佈式系統）、高可用（防止單點失敗）、嚴格地有序訪問（客戶端能夠實現複雜的同步原語）的協同服務。 html

ZooKeeper提供的服務包括：maintaining configuration information, naming, providing distributed synchronization, and providing group services.java

Design Goals

ZooKeeper is simple. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. node

ZooKeeper is replicated. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.ios

ZooKeeper is ordered. ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.apache

ZooKeeper is fast. It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1.服務器

The ZooKeeper Data Model

ZooKeeper在內存中維護一個由ZNode構成的層次樹。ZNode能夠類比成文件和目錄。每一個ZNode維護的數據結構中包括(1) version numbers for data changes(2)ACL that restricts who can do what.(3)timestamps.session

The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases. For instance, whenever a client retrieves data, it also receives the version of the data. And when a client performs an update or a delete, it must supply the version of the data of the znode it is changing. If the version it supplies doesn't match the actual version of the data, the update will fail. version number爲 -1，則能夠匹配任意znode的版本。數據結構

（1）數據訪問是原子的。read一個znode的數據，只可能把該znode中的數據所有讀取出來或者失敗。write將替換掉znode中的全部數據或者失敗。不存在部分讀取和部分寫入（ZooKeeper不支持append操做）。app

（2）使用絕對路徑來訪問一個znode，路徑使用java.lang.String來描述。less

（3）每一個ZNode最多存儲1MB的數據，but the data should be much less than that on average。ZooKeeper was not designed to be a general database or large object store. Instead, it manages coordination data（configuration, status information, rendezvous）。若是須要存儲大的數據，通常將數據存儲到NFS或者HDFS，將指針存儲到ZooKeeper中。

Ephemeral ZNodes & Persistent ZNodes

ZooKeeper中有兩種類型的znode：ephemeral or persistent。ZooKeeper實例的create方法在建立znode的時候可使用CreateMode指定znode的類型。

CreateMode能夠指定的類型以下。

Sequence numbers can be used to impose a global ordering on events in a distributed system and may be used by the client to infer the ordering.

ZooKeeper Watches

ZooKeeper的讀操做getData(), getChildren(), exists() 能夠決定是否設置watch，寫操做create() ，delete()， setData()能夠觸發watch。ACL的操做與watch無關。

ZooKeeper的Watcher對象有兩個功能：（1）通知ZooKeeper狀態的改變（2）通知ZNode的改變。

ZooKeeper對watch的定義：A watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes.

One-time trigger

One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.
Sent to the client

This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event. Network delays or other factors may cause different clients to see watches and return codes from updates at different times. The key point is that everything seen by the different clients will have a consistent order.
The data for which the watch was set

This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

Watch Triggers

具體的說明以下：

• A watch set on an exists operation will be triggered when the znode being watched is created, deleted, or has its data updated.

• A watch set on a getData operation will be triggered when the znode being watched is deleted or has its data updated. No trigger can occur on creation because the znode must already exist for the getData operation to succeed.

• A watch set on a getChildren operation will be triggered when a child of the znode being watched is created or deleted, or when the znode itself is deleted. You can tell whether the znode or its child was deleted by looking at the watch event type: NodeDeleted shows the znode was deleted, and NodeChildrenChanged indicates that it was a child that was deleted.

What ZooKeeper Guarantees about Watches

With regard to watches, ZooKeeper maintains these guarantees:

Watches are ordered with respect to other events, other watches, and asynchronous replies. The ZooKeeper client libraries ensures that everything is dispatched in order.

A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.

The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.

Things to Remember about Watches

Watches are one time triggers; if you get a watch event and you want to get notified of future changes, you must set another watch.

Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper. Be prepared to handle the case where the znode changes multiple times between getting the event and setting the watch again. (You may not care, but at least realize it may happen.)

A watch object, or function/context pair, will only be triggered once for a given notification. For example, if the same watch object is registered for an exists and a getData call for the same file and that file is then deleted, the watch object would only be invoked once with the deletion notification for the file.

When you disconnect from a server (for example, when the server fails), you will not get any watches until the connection is reestablished. For this reason session events are sent to all outstanding watch handlers. Use session events to go into a safe mode: you will not be receiving events while disconnected, so your process should act conservatively in that mode.

Operations

delete和setData須要指定znode的版本號（-1能夠匹配任意znode的版本號，經過exists方法的返回值能夠獲得版本號），版本號不匹配，那麼操做將失敗。更新操做是非阻塞的。

ZooKeeoer中的read操做有可能讀取不到最新的數據，client使用sync，則能夠獲得最新的數據。

multi操做： batch together multiple primitive operations into a single unit that either succeeds or fails in its entirety。

APIs

ZooKeeper提供Java和C的語言支持，提供同步和異步的API。

同步的exists

異步的exists

ACLs

A znode is created with a list of ACLs, which determine who can perform certain operations on it.

ZooKeeper提供的認證方案：

、

Clients may authenticate themselves after establishing a ZooKeeper session. Authentication is optional, although a znode’s ACL may require an authenticated client, in which case the client must authenticate itself to access the znode. Here is an example of using the digest scheme to authenticate with a username and password:

zk.addAuthInfo("digest", "tom:secret".getBytes());、

An ACL is the combination of an authentication scheme, an identity for that scheme, and a set of permissions. For example, if we wanted to give a client with the IP address 10.0.0.1 read access to a znode, we would set an ACL on the znode with the ip scheme, an ID of 10.0.0.1, and READ permission. In Java, we would create the ACL object as follows:

new ACL(Perms.READ,
　　new Id("ip", "10.0.0.1"));

Implementation

ZooKeeper的兩種模式：

Standalone Mode：一個ZooKeeper的Server，測試環境下使用，不提供高可用和可靠性。

Replicated Mode：經過複製，實現高可用。只要ZooKeeper集羣中超過半數Server可用，那麼整個ZooKeeper服務就是可用的。Server的數量等於N，活着的Server數量須要大於等於N/2 + 1。假設N = 5，5/2 + 1 = 3，容許兩個Server失敗。假設N = 6，6/2 + 1 = 4，也容許兩個Server失敗。所以推薦使用奇數個Server構成ensemble（最少3個Server）。ZooKeeper runs in replicated mode on a cluster of machines called an ensemble.

ZooKeeper的核心理念：

all it has to do is ensure that every modification to the tree of znodes is replicated to a majority of the ensemble. If a minority of the machines fail, then a minimum of one machine will survive with the latest state. The other remaining replicas will eventually catch up with this state.

ZooKeeper使用Zab協議實現以上的理念，Zab協議包含兩個階段：

If the leader fails, the remaining machines hold another leader election and continue as before with the new leader. If the old leader later recovers, it then starts as a follower. Leader election is very fast, around 200 ms according to one published result, so performance does not noticeably degrade during an election. All machines in the ensemble write updates to disk before updating their in-memory copies of the znode tree. Read requests may be serviced from any machine, and because they involve only a lookup from memory, they are very fast.

Consistency

A follower may lag the leader by a number of updates.（所以ZooKeeper集羣中將Server命名爲Leader和Follower是比較恰當的）。This is a consequence of the fact that only a majority and not all members of the ensemble need to have persisted a change before it is committed.

Sessions

Zookeeper客戶端擁有ZooKeeper集羣中的服務器列表，客戶端啓動會鏈接列表中的一個Server。客戶端鏈接上Server之後，Server爲該客戶端建立一個Session。客戶端經過向Server發送ping請求（心跳）來維持Session的存活。

Time

ZooKeeper中基礎的時間單位是tick time，其餘的時間參數基於tick time設置（例如Session timeout）。

States

ZooKeeper對象在其生命週期擁有不一樣的State，新建的ZooKeeper實例的狀態是CONNECTING，與Server創建起鏈接後，變成CONNECTED狀態。若是ZooKeeper實例調用close方法或者session timeout，則變成CLOSED狀態。經過調用getState方法能夠獲取到狀態。

The most performance-critical part of ZooKeeper is the transaction log.

partial failure: when we don’t even know if an operation failed

參考：

https://zookeeper.apache.org/doc/current/index.html

Hadoop權威指南第4版

看了有收穫的文章：

https://blog.csdn.net/u012152619/article/details/53053634

https://blog.csdn.net/liu857279611/article/details/70495413