前段時間看了S4流計算引擎,裏面使用到了zookeeper進行集羣管理,因此也就花了點時間研究了下zookeeper,不求看懂全部源碼,但求瞭解其實現機制和原理,清楚其基本使用。這也是爲後續hadoop,gridgain的分佈式計算的產品。javascript
首先就是收集一些前人的一些學習資料和總結內容,方便本身快速入門。 html
這裏羅列了幾篇不錯的文章: java
http://www.ibm.com/developerworks/cn/opensource/os-cn-zookeeper/ (介紹了zookeeper能用來幹嗎)node
http://zookeeper.apache.org/doc/r3.3.2/zookeeperOver.html (官方文檔,大體介紹zookeeper)linux
看了這兩篇文章,基本能夠對zookeeper有了一個感性的認識,它是一個什麼?git
zookeeper功能點:github
統一命名空間(Name Service)算法
配置推送 (Watch)chrome
集羣管理(Group membership)express
在zookeeper中實現了一個相似file system系統的數據結構,好比/zookeeper/status。 每一個節點都對應於一個znode節點。
znode節點的數據結構模型:
znode的數據結構內容:
czxid
The zxid of the change that caused this znode to be created.
mzxid
The zxid of the change that last modified this znode.
ctime
The time in milliseconds from epoch when this znode was created.
mtime
The time in milliseconds from epoch when this znode was last modified.
version
The number of changes to the data of this znode.
cversion
The number of changes to the children of this znode.
aversion
The number of changes to the ACL of this znode.
ephemeralOwner
The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
dataLength
The length of the data field of this znode.
numChildren
The number of children of this znode.
說明: zxid (ZooKeeper Transaction Id,每次請求對應一個惟一的zxid,若是zxid a < zxid b ,則能夠保證a必定發生在b以前)。
針對樹狀結構的處理,來看一下客戶端使用的api :
String create(String path, byte data[], List<ACL> acl, CreateMode createMode) void create(String path, byte data[], List<ACL> acl, CreateMode createMode, StringCallback cb, Object ctx) void delete(String path, int version) void delete(String path, int version, VoidCallback cb, Object ctx) Stat setData(String path, byte data[], int version) void setData(String path, byte data[], int version, StatCallback cb, Object ctx) Stat setACL(String path, List<ACL> acl, int version) void setACL(String path, List<ACL> acl, int version, StatCallback cb, Object ctx) Stat exists(String path, Watcher watcher) Stat exists(String path, boolean watch) void exists(String path, Watcher watcher, StatCallback cb, Object ctx) void exists(String path, boolean watch , StatCallback cb, Object ctx) byte[] getData(String path, Watcher watcher, Stat stat) byte[] getData(String path, boolean watch , Stat stat) void getData(String path, Watcher watcher, DataCallback cb, Object ctx) void getData(String path, boolean watch , DataCallback cb, Object ctx) List<String> getChildren(String path, Watcher watcher) List<String> getChildren(String path, boolean watch ) void getChildren(String path, Watcher watcher, ChildrenCallback cb, Object ctx) void getChildren(String path, boolean watch , ChildrenCallback cb, Object ctx) List<String> getChildren(String path, Watcher watcher, Stat stat) List<String> getChildren(String path, boolean watch , Stat stat) void getChildren(String path, Watcher watcher, Children2Callback cb, Object ctx) void getChildren(String path, boolean watch , Children2Callback cb, Object ctx)
說明:每一種按同步仍是異步,添加指定watcher仍是默認watcher又分爲4種。默認watcher能夠在ZooKeeper zk = new ZooKeeper(serverList, sessionTimeout, watcher)中進行指定。若是包含boolean watch的讀方法傳入true則將默認watcher註冊爲所關注事件的watch。若是傳入false則不註冊任何watch
CreateMode主要有幾種:
PERSISTENT (持續的,相比於EPHEMERAL,不會隨着client session的close/expire而消失)
PERSISTENT_SEQUENTIAL
EPHEMERAL (短暫的,生命週期依賴於client session,對應session close/expire後其znode也會消失)
EPHEMERAL_SEQUENTIAL (SEQUENTIAL意爲順序的)
AsyncCallback異步callback,根據操做類型的不一樣,也分幾類:
StringCallback
VoidCallback
StatCallback
DataCallback (getData請求)
ChildrenCallback
Children2Callback
對應的ACL這裏有篇不錯的文章介紹,http://rdc.taobao.com/team/jm/archives/947
zookeeper爲解決數據的一致性,使用了Watcher的異步回調接口,將服務端znode的變化以事件的形式通知給客戶端,主要是一種反向推送的機制,讓客戶端能夠作出及時響應。好比及時更新後端的可用集羣服務列表。
這裏有篇文章介紹Watcher/Callback比較詳細,能夠參考下:
http://luzengyi.blog.163.com/blog/static/529188201064113744373/
http://luzengyi.blog.163.com/blog/static/529188201061155444869/
若是想更好的理解Watcher的使用場景,能夠了解下使用Watcher機制實現分佈式的Barrier , Queue , Lock同步。
Barrier例子:
public class Barrier implements Watcher { private static final String addr = "10.20.156.49:2181"; private ZooKeeper zk = null; private Integer mutex; private int size = 0; private String root; public Barrier(String root, int size){ this.root = root; this.size = size; try { zk = new ZooKeeper(addr, 10 * 1000, this); mutex = new Integer(-1); Stat s = zk.exists(root, false); if (s == null) { zk.create(root, new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); } } catch (Exception e) { e.printStackTrace(); } } public synchronized void process(WatchedEvent event) { synchronized (mutex) { mutex.notify(); } } public boolean enter(String name) throws Exception { zk.create(root + "/" + name, new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL); while (true) { synchronized (mutex) { List<String> list = zk.getChildren(root, true); if (list.size() < size) { mutex.wait(); } else { return true; } } } } public boolean leave(String name) throws KeeperException, InterruptedException { zk.delete(root + "/" + name, 0); while (true) { synchronized (mutex) { List<String> list = zk.getChildren(root, true); if (list.size() > 0) { mutex.wait(); } else { return true; } } } } }
測試代碼:
public class BarrierTest { public static void main(String args[]) throws Exception { for (int i = 0; i < 3; i++) { Process p = new Process("Thread-" + i, new Barrier("/test/barrier", 3)); p.start(); } } } class Process extends Thread { private String name; private Barrier barrier; public Process(String name, Barrier barrier){ this.name = name; this.barrier = barrier; } @Override public void run() { try { barrier.enter(name); System.out.println(name + " enter"); Thread.sleep(1000 + new Random().nextInt(2000)); barrier.leave(name); System.out.println(name + " leave"); } catch (Exception e) { e.printStackTrace(); } } }
經過該Barrier,能夠協調不一樣任務之間的同步處理,這裏主要仍是利用了Watcher機制的反向推送,避免客戶端的循環polling動做,只要針對有事件的變化作一次響應。
我不羅嗦,taobao有幾篇文章已經介紹的很詳細。
http://rdc.taobao.com/blog/cs/?p=162 (paxos 實現)
http://rdc.taobao.com/blog/cs/?p=261 (paxos算法介紹續)
http://rdc.taobao.com/team/jm/archives/448 (zookeeper代碼解析)
zookeeper集羣對server進行了歸類,可分爲:
Leader
Follower
Obserer
說明:
1. Leader/Follower會經過選舉算法進行選擇,能夠看一下http://zookeeper.apache.org/doc/r3.3.2/recipes.html 裏的Leader Election章節。
2. Observer主要是爲提高zookeeper的性能,observer和follower的主要區別就是observer不參與Leader agreement vote處理。只提供讀節點的處理,相似於master/slave的讀請求。 (http://zookeeper.apache.org/doc/r3.3.2/zookeeperObservers.html)
server.1:localhost:2181:3181:observer
[ljh@ccbu-156-49 bin]$ echo stat | nc localhost 2181 Zookeeper version: 3.3.3--1, built on 06/24/2011 13:12 GMT Clients: /10.16.4.30:34760[1](queued=0,recved=632,sent=632) /127.0.0.1:43626[0](queued=0,recved=1,sent=0) /10.16.4.30:34797[1](queued=0,recved=2917,sent=2917) Latency min/avg/max: 0/0/33 Received: 3552 Sent: 3551 Outstanding: 0 Zxid: 0x200000003 Mode: follower ##當前模式 Node count: 8
官方文檔中,有舉了幾個應用場景,就是使用zookeeper提供分佈式鎖機制,從而實現分佈式的一致性處理。
典型的幾個場景:
Barrier
Queue
Lock
2PC
能夠參看一下: http://zookeeper.apache.org/doc/r3.3.2/recipes.html
zookeeper基本是基於API和console進行znode的操做,並無一個比較方便的操做界面,這裏也發現了taobao 伯巖寫的一個工具,能夠比較方便的查詢zookeeper信息。
工具的開發語言主要是node.js(最近比較火),其標榜的是無阻塞的api使用。其原理主要是基於google的V8(chrome的javascript的解析器,C語言編寫),node.js自己是基於js語法進行開發,經過V8解析爲C語言的執行代碼
其標榜的無阻塞I/O實現,那可想而知就是linux系統下的select/poll的I/O模型。有興趣的能夠看下node.js的官網,下載一個玩玩。
文檔地址: http://www.blogjava.net/killme2008/archive/2011/06/06/351793.html
代碼地址: https://github.com/killme2008/node-zk-browser
經過git下載源碼後,須要安裝下node.js的幾個模塊express, express-namespace, zookeeper。 node.js下有個比較方便的模塊管理器npm,相似於redhat的rpm,ubuntu的apt-get。
安裝模塊:
npm install -g express
幾個界面: