【原創】大數據基礎之Zookeeper（1）介紹、安裝及使用

時間 2019-11-21

標籤原創數據基礎 zookeeper 介紹安裝使用欄目 Zookeeper 简体版

原文原文鏈接

zookeeper3.4.11html

http://zookeeper.apache.org/java

一簡介

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.node

zookeeper是一箇中心化的服務，能夠用來：1 維護配置信息；2 命名服務；3 提供分佈式同步；4 提供分組服務；shell

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system.The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.apache

zookeeper提供一個共享的命名空間，這個命名空間很是像一個文件系統（由文件和目錄組成），命名空間由znode組成；與文件系統不一樣的是，zookeeper的數據是放在內存的，並且zookeeper的「目錄」（znode）也能夠寫數據；api

1 部署視圖

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.

zookeeper服務中的每臺服務器都知道其餘服務器的地址，而且任意兩臺服務器之間都互相鏈接；zookeeper服務器除了內存數據以外，還會在外部存儲存放transaction logs和sampshots；只要zookeeper集羣中的大多數服務器存活，即 (n+1)/2，zookeeper集羣就能夠提供服務；bash

zookeeper客戶端只鏈接到其中一臺zookeeper服務器，客戶端會維護一個到服務器端的TCP長鏈接，經過這個鏈接來發送請求，接收響應，獲取watch事件，發送心跳等；若是當前鏈接斷掉，客戶端會嘗試鏈接到另一臺zookeeper服務器；服務器

2 數據視圖

The name space provided by ZooKeeper is much like that of a standard file system. A name is a sequence of path elements separated by a slash (/). Every node in ZooKeeper's name space is identified by a path.
Unlike is standard file systems, each node in a ZooKeeper namespace can have data associated with it as well as children. It is like having a file-system that allows a file to also be a directory.
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted.
ZooKeeper supports the concept of watches. Clients can set a watch on a znodes. A watch will be triggered and removed when the znode changes. When a watch is triggered the client receives a packet saying that the znode has changed. And if the connection between the client and one of the Zoo Keeper servers is broken, the client will receive a local notification.

zookeeper提供的命名空間很是像文件系統，有幾點區別：1 「目錄」也能夠存放數據（zookeeper中沒有目錄和文件的概念，全部的節點都是znode，便可以存放數據又能夠有子節點）；2 支持Ephemeral節點即臨時節點，臨時節點的生命週期與客戶端的session保持一致；3 支持watch，即訂閱，當znode變化時對應的watch事件會被觸發；session

3 數據更新過程

Every ZooKeeper server services clients. Clients connect to exactly one server to submit irequests. Read requests are serviced from the local replica of each server database. Requests that change the state of the service, write requests, are processed by an agreement protocol.
As part of the agreement protocol all write requests from clients are forwarded to a single server, called the leader. The rest of the ZooKeeper servers, called followers, receive message proposals from the leader and agree upon message delivery. The messaging layer takes care of replacing leaders on failures and syncing followers with leaders.

每個zookeeper服務器都會響應客戶端鏈接，每一個客戶端只鏈接到一臺服務器，客戶端的讀請求直接由當前鏈接的服務器查詢本地數據來響應，客戶端的寫請求都會被重定向到做爲leader的服務器處理，除了leader以外的服務器稱爲follower（另外還有一種observer後邊會講到）；app

4 Session狀態機

A ZooKeeper client establishes a session with the ZooKeeper service by creating a handle to the service using a language binding. Once created, the handle starts of in the CONNECTING state and the client library tries to connect to one of the servers that make up the ZooKeeper service at which point it switches to the CONNECTED state. During normal operation will be in one of these two states. If an unrecoverable error occurs, such as session expiration or authentication failure, or if the application explicitly closes the handle, the handle will move to the CLOSED state.

客戶端創建鏈接時首先會進入CONNECTING 狀態，鏈接成功以後進入CONNECTED狀態，當鏈接關閉以後會進入CLOSED狀態；

5 Guarantees

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.

Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.

Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

保證：串行一致性；原子性；持久性；實時性；

二安裝

1 配置

$ZOOKEEPER_HOME/conf/zoo.cfg

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=zoo1:2888:3888
server.2=zoo2:2888:3888
server.3=zoo3:2888:3888

tickTime the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.

dataDir the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.

clientPort the port to listen for client connections

initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader.

syncLimit limits how far out of date a server can be from a leader.

zookeeper核心的幾個配置就是以上5個；

The entries of the form server.X list the servers that make up the ZooKeeper service. When the server starts up, it knows which server it is by looking for the file myid in the data directory. That file has the contains the server number, in ASCII.

Peers use the former port to connect to other peers. Such a connection is necessary so that peers can communicate, for example, to agree upon the order of updates. More specifically, a ZooKeeper server uses this port to connect followers to the leader. When a new leader arises, a follower opens a TCP connection to the leader using this port. Because the default leader election also uses TCP, we currently require another port for leader election. This is the second port in the server entry.

zookeeper集羣中全部服務器都配置在server.X中，當一臺服務器啓動時，他首先會到數據目錄中找到myid文件，找到本身的id，而後會同集羣中的其餘服務器創建鏈接，這裏的鏈接有兩個，一個用於選舉，一個用於通信；

文件細節：

The snapshot files stored in the data directory are fuzzy snapshots in the sense that during the time the ZooKeeper server is taking the snapshot, updates are occurring to the data tree. The suffix of the snapshot file names is the zxid, the ZooKeeper transaction id, of the last committed transaction at the start of the snapshot. Thus, the snapshot includes a subset of the updates to the data tree that occurred while the snapshot was in process. The snapshot, then, may not correspond to any data tree that actually existed, and for this reason we refer to it as a fuzzy snapshot. Still, ZooKeeper can recover using this snapshot because it takes advantage of the idempotent nature of its updates. By replaying the transaction log against fuzzy snapshots ZooKeeper gets the state of the system at the end of the log.

The Log Directory contains the ZooKeeper transaction logs. Before any update takes place, ZooKeeper ensures that the transaction that represents the update is written to non-volatile storage. A new log file is started when the number of transactions written to the current log file reaches a (variable) threshold. The threshold is computed using the same parameter which influences the frequency of snapshotting (see snapCount above). The log file's suffix is the first zxid written to that log.

zookeeper數據目錄中有兩個文件，一個是snapshot，即快照文件，一個是transaction log，即日誌文件，zookeeper會將全部的寫操做所有記錄到日誌文件中，同時會按期將內存中的數據寫到快照文件中，經過這兩個文件，zookeeper能夠保證數據不丟失同時重啓後能夠快速恢復數據；

snapshot文件命名爲snapshot.$last_zxid，後綴爲快照文件中最後一條日誌的id，好比snapshot.3e0d57d547

transaction log文件命名爲：log.$first_zxid，後綴爲日誌文件中第一條日誌的id，好比log.3e0d4bffd7

2 啓動

bin/zkServer.sh start

進程

hadoop 150952 4.9 1.6 22934232 1103420 ? Sl Mar14 11954:12 /$JAVA_HOME/bin/java -Dzookeeper.log.dir=/$ZOOKEEPER_HOME -Dzookeeper.root.logger=INFO,CONSOLE -cp /$ZOOKEEPER_HOME/bin/../build/classes:/$ZOOKEEPER_HOME/bin/../build/lib/*.jar:/$ZOOKEEPER_HOME/bin/../lib/slf4j-log4j12-1.6.1.jar:/$ZOOKEEPER_HOME/bin/../lib/slf4j-api-1.6.1.jar:/$ZOOKEEPER_HOME/bin/../lib/netty-3.7.0.Final.jar:/$ZOOKEEPER_HOME/bin/../lib/log4j-1.2.16.jar:/$ZOOKEEPER_HOME/bin/../lib/jline-0.9.94.jar:/$ZOOKEEPER_HOME/bin/../zookeeper-3.4.6.jar:/$ZOOKEEPER_HOME/bin/../src/java/lib/*.jar:/$ZOOKEEPER_HOME/bin/../conf: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /$ZOOKEEPER_HOME/bin/../conf/zoo.cfg

數據目錄結構

[user@zk_server_1 version-2]$ ls /$ZOOKEEPER_DATA_DIR/
myid version-2 zookeeper_server.pid
[user@zk_server_1 version-2]$ ls /$ZOOKEEPER_DATA_DIR/version-2/
acceptedEpoch log.3e0d4bffd7 log.3e0d57d549 log.3e0d6255f7 log.3e0d6df7e9 log.3e0d790349 snapshot.3e0d4bffd5 snapshot.3e0d57d547 snapshot.3e0d6255f2 snapshot.3e0d6df7e7 snapshot.3e0d790347
currentEpoch log.3e0d4d1a04 log.3e0d58a803 log.3e0d63629b log.3e0d6f0d5c log.3e0d7a4523 snapshot.3e0d4d1a02 snapshot.3e0d58a801 snapshot.3e0d636299 snapshot.3e0d6f0d5a snapshot.3e0d7a4521
log.3e0d430aed log.3e0d4e724a log.3e0d59a04c log.3e0d64cc1f log.3e0d6ff7ba log.3e0d7bc17d snapshot.3e0d4e7248 snapshot.3e0d59a04a snapshot.3e0d64cc1d snapshot.3e0d6ff7b8 snapshot.3e0d7bc17b
log.3e0d43dadd log.3e0d4ff8ce log.3e0d5ad0a0 log.3e0d65fd11 log.3e0d714d93 snapshot.3e0d43dadb snapshot.3e0d4ff8cc snapshot.3e0d5ad09d snapshot.3e0d65fd0f snapshot.3e0d714d91
log.3e0d454827 log.3e0d5150a7 log.3e0d5bb95f log.3e0d67015a log.3e0d724f8d snapshot.3e0d454825 snapshot.3e0d5150a5 snapshot.3e0d5bb95b snapshot.3e0d670158 snapshot.3e0d724f8b
log.3e0d465a2e log.3e0d52156c log.3e0d5ce712 log.3e0d67fa44 log.3e0d731ca5 snapshot.3e0d465a2c snapshot.3e0d52156a snapshot.3e0d5ce710 snapshot.3e0d67fa42 snapshot.3e0d731ca3
log.3e0d47ac00 log.3e0d531018 log.3e0d5dae9f log.3e0d6932ab log.3e0d748b2d snapshot.3e0d47abfe snapshot.3e0d531016 snapshot.3e0d5dae9d snapshot.3e0d6932ad snapshot.3e0d748b2b
log.3e0d48746a log.3e0d548417 log.3e0d5f18d5 log.3e0d6a0150 log.3e0d75de64 snapshot.3e0d487468 snapshot.3e0d548415 snapshot.3e0d5f18d8 snapshot.3e0d6a014e snapshot.3e0d75de62
log.3e0d49eff7 log.3e0d55620a log.3e0d5fdcc3 log.3e0d6b5212 log.3e0d76e90e snapshot.3e0d49eff5 snapshot.3e0d556208 snapshot.3e0d5fdcc1 snapshot.3e0d6b5210 snapshot.3e0d76e90c
log.3e0d4af170 log.3e0d56bd2c log.3e0d60d48a log.3e0d6cc57c log.3e0d780e44 snapshot.3e0d4af16e snapshot.3e0d56bd2a snapshot.3e0d60d485 snapshot.3e0d6cc57a snapshot.3e0d780e42

三使用

1 ZooKeeper Commands: The Four Letter Words

$ echo mntr | nc localhost 2185

conf
- Print details about serving configuration.
cons
- List full connection/session details for all clients connected to this server. Includes information on numbers of packets received/sent, session id, operation latencies, last operation performed, etc...
crst
- Reset connection/session statistics for all connections.
dump
- Lists the outstanding sessions and ephemeral nodes. This only works on the leader.
envi
- Print details about serving environment
ruok
- Tests if server is running in a non-error state. The server will respond with imok if it is running. Otherwise it will not respond at all.
srst
- Reset server statistics.
srvr
- Lists full details for the server.
stat
- Lists brief details for the server and connected clients.
wchs
- Lists brief information on watches for the server.
wchc
- Lists detailed information on watches for the server, by session. This outputs a list of sessions(connections) with associated watches (paths). Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully.
wchp
- Lists detailed information on watches for the server, by path. This outputs a list of paths (znodes) with associated sessions. Note, depending on the number of watches this operation may be expensive (ie impact server performance), use it carefully.
mntr
- Outputs a list of variables that could be used for monitoring the health of the cluster.

示例：

-bash-4.1$ echo stat|nc $zk_server_1 2181
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Clients:
/zk_client_1:59407[1](queued=0,recved=55,sent=55)
/zk_client_2:40094[0](queued=0,recved=1,sent=0)
/zk_client_3:60926[1](queued=0,recved=115,sent=115)
/zk_client_4:59288[1](queued=0,recved=56,sent=56)
/zk_client_5:14155[1](queued=0,recved=115,sent=115)
/zk_client_6:18602[1](queued=0,recved=115,sent=115)

Latency min/avg/max: 0/0/146
Received: 6294
Sent: 6376
Connections: 6
Outstanding: 0
Zxid: 0x3e0d1cccd4
Mode: follower
Node count: 26394

Zxid Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.

The ZooKeeper service can be monitored in one of two primary ways; 1) the command port through the use of 4 letter words and 2) JMX.

zookeeper中數據的任何變化都有一個惟一的zxid，而且是有序的，即先發生的變化的zxid必定小於後發生的變化的zxid；

zookeeper能夠經過兩種方式監控：4字命令和 jmx；

2 命令行客戶端

$ bin/zkCli.sh -server 127.0.0.1:2181

[zkshell: 0] help
ZooKeeper host:port cmd args
        get path [watch]
        ls path [watch]
        set path data [version]
        delquota [-n|-b] path
        quit
        printwatches on|off
        createpath data acl
        stat path [watch]
        listquota path
        history
        setAcl path acl
        getAcl path
        sync path
        redo cmdno
        addauth scheme auth
        delete path [version]
        setquota -n|-b val path

示例：

[zkshell: 12] get /zk_test
my_data
cZxid = 5
ctime = Fri Jun 05 13:57:06 PDT 2009
mZxid = 5
mtime = Fri Jun 05 13:57:06 PDT 2009
pZxid = 5
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0
dataLength = 7
numChildren = 0

czxid

The zxid of the change that caused this znode to be created.
mzxid

The zxid of the change that last modified this znode.
pzxid

The zxid of the change that last modified children of this znode.
ctime

The time in milliseconds from epoch when this znode was created.
mtime

The time in milliseconds from epoch when this znode was last modified.
version

The number of changes to the data of this znode.
cversion

The number of changes to the children of this znode.
aversion

The number of changes to the ACL of this znode.
ephemeralOwner

The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength

The length of the data field of this znode.
numChildren

The number of children of this znode.

znode節點的各個屬性如上；

3 java客戶端

org.apache.zookeeper.ZooKeeper

This is the main class of ZooKeeper client library. To use a ZooKeeper service, an application must first instantiate an object of ZooKeeper class. All the iterations will be done by calling the methods of ZooKeeper class. The methods of this class are thread-safe unless otherwise noted.
Once a connection to a server is established, a session ID is assigned to the client. The client will send heart beats to the server periodically to keep the session valid.
The application can call ZooKeeper APIs through a client as long as the session ID of the client remains valid.
If for some reason, the client fails to send heart beats to the server for a prolonged period of time (exceeding the sessionTimeout value, for instance), the server will expire the session, and the session ID will become invalid. The client object will no longer be usable. To make ZooKeeper API calls, the application must create a new client object.
If the ZooKeeper server the client currently connects to fails or otherwise does not respond, the client will automatically try to connect to another server before its session ID expires. If successful, the application can continue to use the client.
The ZooKeeper API methods are either synchronous or asynchronous. Synchronous methods blocks until the server has responded. Asynchronous methods just queue the request for sending and return immediately. They take a callback object that will be executed either on successful execution of the request or on error with an appropriate return code (rc) indicating the error.
Some successful ZooKeeper API calls can leave watches on the "data nodes" in the ZooKeeper server. Other successful ZooKeeper API calls can trigger those watches. Once a watch is triggered, an event will be delivered to the client which left the watch at the first place. Each watch can be triggered only once. Thus, up to one event will be delivered to a client for every watch it leaves.
A client needs an object of a class implementing Watcher interface for processing the events delivered to the client. When a client drops the current connection and re-connects to a server, all the existing watches are considered as being triggered but the undelivered events are lost. To emulate this, the client will generate a special event to tell the event handler a connection has been dropped. This special event has EventType None and KeeperState Disconnected.

四其餘

The format of snapshot and log files does not change between standalone ZooKeeper servers and different configurations of replicated ZooKeeper servers. Therefore, you can pull these files from a running replicated ZooKeeper server to a development machine with a stand-alone ZooKeeper server for trouble shooting.

Using older log and snapshot files, you can look at the previous state of ZooKeeper servers and even restore that state. The LogFormatter class allows an administrator to look at the transactions in a log.

snapshot和log文件的格式是一致的，因此能夠很容易的將線上集羣的數據文件拷貝到本地來查找問題或重現錯誤；

A server might not be able to read its database and fail to come up because of some file corruption in the transaction logs of the ZooKeeper server. You will see some IOException on loading ZooKeeper database. In such a case, make sure all the other servers in your ensemble are up and working. Use "stat" command on the command port to see if they are in good health. After you have verified that all the other servers of the ensemble are up, you can go ahead and clean the database of the corrupt server. Delete all the files in datadir/version-2 and datalogdir/version-2/. Restart the server.

一旦zookeeper數據目錄中的某些文件有損壞，zookeeper可能因爲沒法讀取數據致使啓動失敗，這時能夠先檢查一下其餘服務器（集羣中的大多數）是否正常，若是正常的話，能夠將有文件損壞的數據目錄文件清空後重啓zookeeper，重啓後zookeeper會自動同步最新的數據；

Although ZooKeeper performs very well by having clients connect directly to voting members of the ensemble, this architecture makes it hard to scale out to huge numbers of clients. The problem is that as we add more voting members, the write performance drops. This is due to the fact that a write operation requires the agreement of (in general) at least half the nodes in an ensemble and therefore the cost of a vote can increase significantly as more voters are added.

We have introduced a new type of ZooKeeper node called an Observer which helps address this problem and further improves ZooKeeper's scalability. Observers are non-voting members of an ensemble which only hear the results of votes, not the agreement protocol that leads up to them. Other than this simple distinction, Observers function exactly the same as Followers - clients may connect to them and send read and write requests to them. Observers forward these requests to the Leader like Followers do, but they then simply wait to hear the result of the vote. Because of this, we can increase the number of Observers as much as we like without harming the performance of votes.

因爲zookeeper的對寫操做的投票機制，因此集羣規模持續擴大會伴隨着投票成本的上升，這時zookeeper引入了一種新的節點類型即observer，observer不參與投票，只負責同步數據和相應客戶端需求；

五更多

1 爲何zookeeper集羣由奇數（2n+1）個節點組成？

這裏有兩個緣由，一個是爲了防止腦裂，只有n+1個節點存活集羣才能正常工做，即2n+1個節點只能有1個集羣正常工做；一個是爲了數據可靠性，每一份數據只要有n+1個節點保存成功，則數據不會丟失，這一點能夠嚴格的經過數據方法來證實，最簡單的是反證法，假設有一個數據在n+1個節點上保存成功，可是丟失，說明保存成功的n+1個節點目前都不可用，可是若是隻有n個節點存活則集羣沒法正常工做，能夠推斷假設錯誤；