zookeeper3.4.11html
http://zookeeper.apache.org/java
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.node
zookeeper是一箇中心化的服務,能夠用來:1 維護配置信息;2 命名服務;3 提供分佈式同步;4 提供分組服務;shell
ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system.The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can acheive high throughput and low latency numbers.apache
zookeeper提供一個共享的命名空間,這個命名空間很是像一個文件系統(由文件和目錄組成),命名空間由znode組成;與文件系統不一樣的是,zookeeper的數據是放在內存的,並且zookeeper的「目錄」(znode)也能夠寫數據;api
zookeeper服務中的每臺服務器都知道其餘服務器的地址,而且任意兩臺服務器之間都互相鏈接;zookeeper服務器除了內存數據以外,還會在外部存儲存放transaction logs和sampshots;只要zookeeper集羣中的大多數服務器存活,即 (n+1)/2,zookeeper集羣就能夠提供服務;bash
zookeeper客戶端只鏈接到其中一臺zookeeper服務器,客戶端會維護一個到服務器端的TCP長鏈接,經過這個鏈接來發送請求,接收響應,獲取watch事件,發送心跳等;若是當前鏈接斷掉,客戶端會嘗試鏈接到另一臺zookeeper服務器;服務器
zookeeper提供的命名空間很是像文件系統,有幾點區別:1 「目錄」也能夠存放數據(zookeeper中沒有目錄和文件的概念,全部的節點都是znode,便可以存放數據又能夠有子節點);2 支持Ephemeral節點即臨時節點 ,臨時節點的生命週期與客戶端的session保持一致;3 支持watch,即訂閱,當znode變化時對應的watch事件會被觸發;session
每個zookeeper服務器都會響應客戶端鏈接,每一個客戶端只鏈接到一臺服務器,客戶端的讀請求直接由當前鏈接的服務器查詢本地數據來響應,客戶端的寫請求都會被重定向到做爲leader的服務器處理,除了leader以外的服務器稱爲follower(另外還有一種observer後邊會講到);app
客戶端創建鏈接時首先會進入CONNECTING 狀態,鏈接成功以後進入CONNECTED狀態,當鏈接關閉以後會進入CLOSED狀態;
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
保證:串行一致性;原子性;持久性;實時性;
$ZOOKEEPER_HOME/conf/zoo.cfg
tickTime=2000 dataDir=/var/lib/zookeeper clientPort=2181 initLimit=5 syncLimit=2 server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888
tickTime the basic time unit in milliseconds used by ZooKeeper. It is used to do heartbeats and the minimum session timeout will be twice the tickTime.
dataDir the location to store the in-memory database snapshots and, unless specified otherwise, the transaction log of updates to the database.
clientPort the port to listen for client connections
initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper servers in quorum have to connect to a leader.
syncLimit limits how far out of date a server can be from a leader.
zookeeper核心的幾個配置就是以上5個;
The entries of the form server.X list the servers that make up the ZooKeeper service. When the server starts up, it knows which server it is by looking for the file myid in the data directory. That file has the contains the server number, in ASCII.
Peers use the former port to connect to other peers. Such a connection is necessary so that peers can communicate, for example, to agree upon the order of updates. More specifically, a ZooKeeper server uses this port to connect followers to the leader. When a new leader arises, a follower opens a TCP connection to the leader using this port. Because the default leader election also uses TCP, we currently require another port for leader election. This is the second port in the server entry.
zookeeper集羣中全部服務器都配置在server.X中,當一臺服務器啓動時,他首先會到數據目錄中找到myid文件,找到本身的id,而後會同集羣中的其餘服務器創建鏈接,這裏的鏈接有兩個,一個用於選舉,一個用於通信;
文件細節:
The snapshot files stored in the data directory are fuzzy snapshots in the sense that during the time the ZooKeeper server is taking the snapshot, updates are occurring to the data tree. The suffix of the snapshot file names is the zxid, the ZooKeeper transaction id, of the last committed transaction at the start of the snapshot. Thus, the snapshot includes a subset of the updates to the data tree that occurred while the snapshot was in process. The snapshot, then, may not correspond to any data tree that actually existed, and for this reason we refer to it as a fuzzy snapshot. Still, ZooKeeper can recover using this snapshot because it takes advantage of the idempotent nature of its updates. By replaying the transaction log against fuzzy snapshots ZooKeeper gets the state of the system at the end of the log.
The Log Directory contains the ZooKeeper transaction logs. Before any update takes place, ZooKeeper ensures that the transaction that represents the update is written to non-volatile storage. A new log file is started when the number of transactions written to the current log file reaches a (variable) threshold. The threshold is computed using the same parameter which influences the frequency of snapshotting (see snapCount above). The log file's suffix is the first zxid written to that log.
zookeeper數據目錄中有兩個文件,一個是snapshot,即快照文件,一個是transaction log,即日誌文件,zookeeper會將全部的寫操做所有記錄到日誌文件中,同時會按期將內存中的數據寫到快照文件中,經過這兩個文件,zookeeper能夠保證數據不丟失同時重啓後能夠快速恢復數據;
snapshot文件命名爲snapshot.$last_zxid,後綴爲快照文件中最後一條日誌的id,好比snapshot.3e0d57d547
transaction log文件命名爲:log.$first_zxid,後綴爲日誌文件中第一條日誌的id,好比log.3e0d4bffd7
bin/zkServer.sh start
進程
hadoop 150952 4.9 1.6 22934232 1103420 ? Sl Mar14 11954:12 /$JAVA_HOME/bin/java -Dzookeeper.log.dir=/$ZOOKEEPER_HOME -Dzookeeper.root.logger=INFO,CONSOLE -cp /$ZOOKEEPER_HOME/bin/../build/classes:/$ZOOKEEPER_HOME/bin/../build/lib/*.jar:/$ZOOKEEPER_HOME/bin/../lib/slf4j-log4j12-1.6.1.jar:/$ZOOKEEPER_HOME/bin/../lib/slf4j-api-1.6.1.jar:/$ZOOKEEPER_HOME/bin/../lib/netty-3.7.0.Final.jar:/$ZOOKEEPER_HOME/bin/../lib/log4j-1.2.16.jar:/$ZOOKEEPER_HOME/bin/../lib/jline-0.9.94.jar:/$ZOOKEEPER_HOME/bin/../zookeeper-3.4.6.jar:/$ZOOKEEPER_HOME/bin/../src/java/lib/*.jar:/$ZOOKEEPER_HOME/bin/../conf: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /$ZOOKEEPER_HOME/bin/../conf/zoo.cfg
數據目錄結構
[user@zk_server_1 version-2]$ ls /$ZOOKEEPER_DATA_DIR/
myid version-2 zookeeper_server.pid
[user@zk_server_1 version-2]$ ls /$ZOOKEEPER_DATA_DIR/version-2/
acceptedEpoch log.3e0d4bffd7 log.3e0d57d549 log.3e0d6255f7 log.3e0d6df7e9 log.3e0d790349 snapshot.3e0d4bffd5 snapshot.3e0d57d547 snapshot.3e0d6255f2 snapshot.3e0d6df7e7 snapshot.3e0d790347
currentEpoch log.3e0d4d1a04 log.3e0d58a803 log.3e0d63629b log.3e0d6f0d5c log.3e0d7a4523 snapshot.3e0d4d1a02 snapshot.3e0d58a801 snapshot.3e0d636299 snapshot.3e0d6f0d5a snapshot.3e0d7a4521
log.3e0d430aed log.3e0d4e724a log.3e0d59a04c log.3e0d64cc1f log.3e0d6ff7ba log.3e0d7bc17d snapshot.3e0d4e7248 snapshot.3e0d59a04a snapshot.3e0d64cc1d snapshot.3e0d6ff7b8 snapshot.3e0d7bc17b
log.3e0d43dadd log.3e0d4ff8ce log.3e0d5ad0a0 log.3e0d65fd11 log.3e0d714d93 snapshot.3e0d43dadb snapshot.3e0d4ff8cc snapshot.3e0d5ad09d snapshot.3e0d65fd0f snapshot.3e0d714d91
log.3e0d454827 log.3e0d5150a7 log.3e0d5bb95f log.3e0d67015a log.3e0d724f8d snapshot.3e0d454825 snapshot.3e0d5150a5 snapshot.3e0d5bb95b snapshot.3e0d670158 snapshot.3e0d724f8b
log.3e0d465a2e log.3e0d52156c log.3e0d5ce712 log.3e0d67fa44 log.3e0d731ca5 snapshot.3e0d465a2c snapshot.3e0d52156a snapshot.3e0d5ce710 snapshot.3e0d67fa42 snapshot.3e0d731ca3
log.3e0d47ac00 log.3e0d531018 log.3e0d5dae9f log.3e0d6932ab log.3e0d748b2d snapshot.3e0d47abfe snapshot.3e0d531016 snapshot.3e0d5dae9d snapshot.3e0d6932ad snapshot.3e0d748b2b
log.3e0d48746a log.3e0d548417 log.3e0d5f18d5 log.3e0d6a0150 log.3e0d75de64 snapshot.3e0d487468 snapshot.3e0d548415 snapshot.3e0d5f18d8 snapshot.3e0d6a014e snapshot.3e0d75de62
log.3e0d49eff7 log.3e0d55620a log.3e0d5fdcc3 log.3e0d6b5212 log.3e0d76e90e snapshot.3e0d49eff5 snapshot.3e0d556208 snapshot.3e0d5fdcc1 snapshot.3e0d6b5210 snapshot.3e0d76e90c
log.3e0d4af170 log.3e0d56bd2c log.3e0d60d48a log.3e0d6cc57c log.3e0d780e44 snapshot.3e0d4af16e snapshot.3e0d56bd2a snapshot.3e0d60d485 snapshot.3e0d6cc57a snapshot.3e0d780e42
$ echo mntr | nc localhost 2185
示例:
-bash-4.1$ echo stat|nc $zk_server_1 2181
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Clients:
/zk_client_1:59407[1](queued=0,recved=55,sent=55)
/zk_client_2:40094[0](queued=0,recved=1,sent=0)
/zk_client_3:60926[1](queued=0,recved=115,sent=115)
/zk_client_4:59288[1](queued=0,recved=56,sent=56)
/zk_client_5:14155[1](queued=0,recved=115,sent=115)
/zk_client_6:18602[1](queued=0,recved=115,sent=115)Latency min/avg/max: 0/0/146
Received: 6294
Sent: 6376
Connections: 6
Outstanding: 0
Zxid: 0x3e0d1cccd4
Mode: follower
Node count: 26394
Zxid Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.
The ZooKeeper service can be monitored in one of two primary ways; 1) the command port through the use of 4 letter words and 2) JMX.
zookeeper中數據的任何變化都有一個惟一的zxid,而且是有序的,即先發生的變化的zxid必定小於後發生的變化的zxid;
zookeeper能夠經過兩種方式監控:4字命令 和 jmx;
$ bin/zkCli.sh -server 127.0.0.1:2181[zkshell: 0] help ZooKeeper host:port cmd args get path [watch] ls path [watch] set path data [version] delquota [-n|-b] path quit printwatches on|off createpath data acl stat path [watch] listquota path history setAcl path acl getAcl path sync path redo cmdno addauth scheme auth delete path [version] setquota -n|-b val path
示例:
[zkshell: 12] get /zk_test my_data cZxid = 5 ctime = Fri Jun 05 13:57:06 PDT 2009 mZxid = 5 mtime = Fri Jun 05 13:57:06 PDT 2009 pZxid = 5 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0 dataLength = 7 numChildren = 0
czxid
The zxid of the change that caused this znode to be created.
mzxid
The zxid of the change that last modified this znode.
pzxid
The zxid of the change that last modified children of this znode.
ctime
The time in milliseconds from epoch when this znode was created.
mtime
The time in milliseconds from epoch when this znode was last modified.
version
The number of changes to the data of this znode.
cversion
The number of changes to the children of this znode.
aversion
The number of changes to the ACL of this znode.
ephemeralOwner
The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength
The length of the data field of this znode.
numChildren
The number of children of this znode.
znode節點的各個屬性如上;
org.apache.zookeeper.ZooKeeper
The format of snapshot and log files does not change between standalone ZooKeeper servers and different configurations of replicated ZooKeeper servers. Therefore, you can pull these files from a running replicated ZooKeeper server to a development machine with a stand-alone ZooKeeper server for trouble shooting.
Using older log and snapshot files, you can look at the previous state of ZooKeeper servers and even restore that state. The LogFormatter class allows an administrator to look at the transactions in a log.
snapshot和log文件的格式是一致的,因此能夠很容易的將線上集羣的數據文件拷貝到本地來查找問題或重現錯誤;
A server might not be able to read its database and fail to come up because of some file corruption in the transaction logs of the ZooKeeper server. You will see some IOException on loading ZooKeeper database. In such a case, make sure all the other servers in your ensemble are up and working. Use "stat" command on the command port to see if they are in good health. After you have verified that all the other servers of the ensemble are up, you can go ahead and clean the database of the corrupt server. Delete all the files in datadir/version-2 and datalogdir/version-2/. Restart the server.
一旦zookeeper數據目錄中的某些文件有損壞,zookeeper可能因爲沒法讀取數據致使啓動失敗,這時能夠先檢查一下其餘服務器(集羣中的大多數)是否正常,若是正常的話,能夠將有文件損壞的數據目錄文件清空後重啓zookeeper,重啓後zookeeper會自動同步最新的數據;
Although ZooKeeper performs very well by having clients connect directly to voting members of the ensemble, this architecture makes it hard to scale out to huge numbers of clients. The problem is that as we add more voting members, the write performance drops. This is due to the fact that a write operation requires the agreement of (in general) at least half the nodes in an ensemble and therefore the cost of a vote can increase significantly as more voters are added.
We have introduced a new type of ZooKeeper node called an Observer which helps address this problem and further improves ZooKeeper's scalability. Observers are non-voting members of an ensemble which only hear the results of votes, not the agreement protocol that leads up to them. Other than this simple distinction, Observers function exactly the same as Followers - clients may connect to them and send read and write requests to them. Observers forward these requests to the Leader like Followers do, but they then simply wait to hear the result of the vote. Because of this, we can increase the number of Observers as much as we like without harming the performance of votes.
因爲zookeeper的對寫操做的投票機制,因此集羣規模持續擴大會伴隨着投票成本的上升,這時zookeeper引入了一種新的節點類型即observer,observer不參與投票,只負責同步數據和相應客戶端需求;
1 爲何zookeeper集羣由奇數(2n+1)個節點組成?
這裏有兩個緣由,一個是爲了防止腦裂,只有n+1個節點存活集羣才能正常工做,即2n+1個節點只能有1個集羣正常工做;一個是爲了數據可靠性,每一份數據只要有n+1個節點保存成功,則數據不會丟失,這一點能夠嚴格的經過數據方法來證實,最簡單的是反證法,假設有一個數據在n+1個節點上保存成功,可是丟失,說明保存成功的n+1個節點目前都不可用,可是若是隻有n個節點存活則集羣沒法正常工做,能夠推斷假設錯誤;