5105 pa2 Distributed Hash Table based on Chord

 

 


 

1 Design document

1.1 System overviewhtml

We implemented a Book Finder System using a distributed hash table (DHT) based on the Chord protocol. Using this system, the client can set and get the book’s information (title and genre for simplicity) using the DHT. The whole system is composed of 1 supernode, 1 client (or more clients), and several compute nodes.java

 

The client will deal with Set and Get request. It will get a node address from the supernode first. Then it sends the request to the node via Thrift.node

The supernode could listen to requests from the client, and record the information of nodes. When a node wants to join the DHT, it contacts the SuperNode using Thrift. The SuperNode will then return one of the nodes information. The client will contact the SuperNode, and SuperNode will return node information randomly chosen.算法

The nodes will store the book information and keep the DHT table. Each node will maintain a predecessor pointer, a successor pointer, and a FingerTable for fast searching in chord ring. The FingerTable will be updated when a new node is added into the system. Also, the data will be stored on different node based on the hash value of BookTitle. When client do a get/set operation, the operation will be forwarded to appropriate node based on searching result in FingerTable.數據結構

 

 

pa2要求實現一個分佈式key-value hash table,存儲圖書名和對應的類別供用戶查詢(就set、get兩種操做)。DHT基於Chord協議。Chord的原理能夠參考https://program-think.blogspot.com/2017/09/Introduction-DHT-Kademlia-Chord.html,這裏也介紹一下:併發

DHT是去中心化的,節點數量也是會動態變化的(但做業裏不考慮這個)。在一致性哈希中,咱們把哈希值惟一肯定,以後映射到一個圓環中。 那麼,當分佈式出現崩潰或新增節點的狀況時,只會影響圓環中的一部分。app

在Chord協議中,全部的node(每臺node有一個unique node ID)和數據(<key=hash(DAT), val=DAT>)都用m位的哈希函數映射到圓環上(圓環表示chord空間。設M位的哈希函數,那麼環上的取值範圍就是0 ~ (2^m)-1)。每一個數據都存儲到它順時針方向的第一個node上。每一個node都有指針指向上一個/下一個的ip地址(相似雙向循環鏈表)。如圖所示:dom

按照目前這個思路,設client鏈接到了某個node,想set或get一條數據,就直接順着這個環遍歷node就能夠了。可是一個一個遍歷太慢啦。有什麼方法能夠跳過一些確定找不到所需數據的node呢?分佈式

FingerTable就是這樣一個路由算法,能夠實如今log(N)的時間內找到目標節點。「Finger Table」是一個列表,最多包含 m 項(m 就是散列值的比特數),每一項都是節點 ID。假設當前節點的 ID 是 n,那麼表中第 i 項的值是:successor( (n + 2^i) mod 2^m ) 。在查找的時候,當收到請求(key)時,就到「Finger Table」中找到【最大的且不超過 key】的那一項,而後把 key 轉發給這一項對應的節點。以下圖就是m=7時的一個DHT(範圍0-127),以及Node #80的Finger Table:函數

  

下面來看看具體實現叭

整個系統由若干client,一個supernode,若干個node組成。You will need to implement multi-threaded server for a node as there can be multiple clients in the system at the same time.

  • SuperNode負責存儲全部node的信息和generate node ID,在建立DHT時全部node都要聯繫SuperNode來得到node ID等信息。
  • 每一個node維護本身的FingerTable,存儲本身負責的數據,而且有successor, predecessor兩個指針。node還負責與client進行交互。若client要找的數據不在本地,node會查FingerTable表,並把請求forward到對應的node。
  • client進行操做時,首先聯繫SuperNode,由SuperNode返回一個node。而後client鏈接這個node進行後續的操做。

 

 

1.2 Assumptions

We made these assumptions in this system:

  • Each Node can either run on same or different machine on its own port.
  • More than 2 Thrift Interface files are needed (for SuperNode and Nodes).
  • The Nodes will act as client of SuperNode for joining phase for forming DHT and will act as a server for handling requests from the client.
  • SuperNode should not maintain any state about the DHT (only the list of nodes)
  • The genre will be updated if a client sets a different genre for a book title.
  • The system does not need to be persistent in this project.
  •  The number of nodes for DHT can be set when the SuperNode starts as a parameter. (This will let the SuperNode know the DHT is ready)
  • No node failures or nodes leaving the DHT after they've joined.

 

1.3 Component design

1.3.1 Common part

In the common part, we defined a Thrift struct Address, which is used to describe a node (IP, port, NodeID).

 

1.3.2 Client

The client accepts a task from users.

client啓動時rpc到SuperNode,得到一個node,以後用戶的set / get操做都rpc到這個node上進行。

To initiate it, input  「java -cp ".:/usr/local/Thrift/*" Client <serverIP> <serverPort>」, for example,  「java -cp ".:/usr/local/Thrift/*" Client csel-kh1250-01 9090」.

To set a book with the genre, input Set 「Book_title」 「Genre」 (「」 are required), for example, Set 「Harry Potter」 「Magic」. The server will return the trace and the node who stores this book.

To reset a book with a genre, just input Set 「Book_title」 「Genre」 (「」 are required) again, for example, Set 「Harry Potter」 「Magic and Children」.

To set books and genres with a file. Input Set 「filename」, for example, Set 「../shakespeares.txt」.

To get books’ genres, input Get 「Book_title」, for example, Get 「Harry Potter」. If the book exists, the server will return trace with the node who stores this book, and its genre. If the book does not exist, the server will return an error message.

To quit the program, type in Quit.

 

1.3.3 Supernode

To initiate the supernode, input java -cp ".:/usr/local/Thrift/*" SuperNode Port NodeNumber, for example, 「java -cp ".:/usr/local/Thrift/*" SuperNode 9090 5」.

Also, we support user-defined chord length. The command-line will be java -cp ".:/usr/local/Thrift/*" SuperNode Port NodeNumber ChordLen. Then the range of chord ring will be 0-(2^ChordLen). ChordLen will be 7 if it is not defined explicitly.

SuperNode上維護瞭如下數據結構:

  • (CopyOnWriteArrayList至關於一個併發性能更好的vector http://www.javashuo.com/article/p-uesorcoi-gb.html。Vector是增刪改查方法都加了synchronized,保證同步,可是每一個方法執行的時候都要去得到鎖,性能就會大大降低。而CopyOnWriteArrayList 只是在增刪改上加鎖,可是讀不加鎖,在讀方面的性能就好於Vector,CopyOnWriteArrayList支持讀多寫少的併發狀況。)
  • CopyOnWriteArrayList  CurrentNodes:存儲全部的node列表
  • CopyOnWriteArrayList  PossibleKeys:存儲全部可用的node ID
  • ConcurrentHashMap  NodeKeys:存儲全部的node和對應的ID

The supernode implements the following methods:

1. Join(IP, Port): When a node wants to join the DHT, the SuperNode will then return one of nodes information to implement updatedht process. If the SuperNode is busy in join process of another node, it will return a 「NACK」 to the requesting node and let the node wait.        Node加入時會rpc調用SuperNode的Join()。Join()會返回以前加入過的node中最新的那個,新node就加入到這個node的後面(若是新node是第一個進來的,返回"Empty")。

2. PostJoin(IP, Port): After the node is done to join the DHT, it should notify the SuperNode about it. The supernode will then add the node information to the nodelist and allow other nodes to join the DHT.       (node在Join()以後會進行一些操做來加入chord,加入完成後rpc調用SuperNode的PostJoin()。SuperNode會記下這個節點的ip地址、id等信息,而後返回true結束加入過程。)

另外,當一個node執行了Join()後,SuperNode上會加鎖,其餘node試圖Join()時就只能忙等(不停收到"NACK")。PostJoin()以後纔會解鎖。這樣保證同時只有一個節點在加入。

3. GetNode(): The client will contact the SuperNode and it will return node information randomly chosen. The client may contact SuperNode only once when the client is running or every time when the client sends a request for testing purpose.       Client經過rpc調用SuperNode的GetNode()來獲取一個node

4. GetPossibleKey(), after the node call join, it can call the GetPossibleKey() to get the key that can be assigned to it.       node在Join()以後,調用該rpc函數來獲取本身的node ID

5. GetHashPara(), the node can get the parameter of hash function from it.       

 

1.3.4 Node

To initiate the node, input  「java -cp ".:/usr/local/Thrift/*" Node <serverIP> <serverPort> <nodePort>」, for example,  「java -cp ".:/usr/local/Thrift/*" Node csel-kh1250-01 9090 9092」.

To make the supernode ready for client, it should get NodeNumber Nodes (For example, 5 postjoin calls).

Node上維護瞭如下數據結構:

  • Address  Successor:指向該node的後繼node
  • Address  Predecessor:指向該node的前驅node
  • ConcurrentHashMap Dict:本地的Hash Table
  • FingerItem[] FingerTable:本地的FingerTable

其中FingerItem類包括以下信息:

public class FingerItem{    //For FingerTable[i],
    Address Node;           // Node = Successor(FingerTable[i].start)
    long ivx; long ivy;     // [ivx, ivy) = [FingerTable[i].start, FingerTable[i+1].start)
    long start;             //  start = (NodeInfo.ID+(long)(Math.pow(2,i-1))) % (long)(Math.pow(2,ChordLen)))
    public FingerItem(Address _Node, long _ivx, long _ivy, long _start) {
        Node=_Node;
        ivx=_ivx;
        ivy=_ivy;
        start=_start;
    }
    public static long FingerCalc(long _n, int _k, int _m) {
        if (_k == _m + 1)
            return (_n);
        else
            return ((_n + (long) (Math.pow(2, _k - 1))) % (long) (Math.pow(2, _m)));
    }
    public void PrintItem(int i){
        System.out.println("FingerItem"+i+" "+Node+" "+ivx+"_"+ivy+" : "+start);
    }
}

The Node will contain an interface for the client and other nodes. Following calls to the Node are implemented:

1. Set(Book_title, Genre, chain): When a client wants to set a book title and a genre, it contacts the Node using Thrift. The node will check whether it needs to store information locally or not. If it is not the node for the book title, it will forward the request to other nodes recursively. The chain is used for tracing involved nodes.        首先對key進行hash。若是該key存在本地,則直接set本地。不然調用FindSuccessor(hash_of_key)找到負責該hash值的node(也就是該點的successor,順時針方向第一個),並rpc調用該node上的Set()

2. Get(Book_title, chain): When a client wants to know a genre with a book title, it contacts the Node using Thrift. The Node will check whether it is the node for the book title. If not, it will forward the request to other nodes recursively. The chain is used for tracing involved nodes.        首先對key進行hash。若是該key存在本地,則直接get本地。不然調用FindSuccessor(hash_of_key)找到負責該hash值的node(也就是該點的successor,順時針方向第一個),並rpc調用該node上的Get()

3. UpdateDHT(Address NN, int IDX, List<Long> chain): On the new node, The IDX item in FingerTable will be updated to node NN. Then the new node will contact the involved nodes in the finger table to let them update DHT. When the node finishes calling to all nodes, it will let the SuperNode know that it is done to join by calling PostJoin().        

4. HashKey(String key, int MODbit): calculate the hash value of key string.        

5. FindSuccessor(ID): Find successor of an arbitrary point ID in chord ring.        調用FindPredecessor(ID)找到ID在chord環上的Predecessor,而後rpc調用該Predecessor上的GetSuccessor(),這樣也就找到了ID點的successor。

6. FindPredecessor(MID): Find predecessor of an arbitrary point ID in chord ring.        找到MID在chord環上的Predecessor。若當前node不負責MID所在的這一區域(不知足node.ID < MID < node.Successor.ID),則須要FindClosetPrecedingFinger(MID)找到MID的前驅,而後在這個node上再找。

7. FindClosetPrecedingFinger(MID): Find the closet predecessor of an arbitrary point #MID on chord ring, by finding in Finger Table        找到在當前node的FingerTable中,距離MID最近的Predecessor(node.ID < FingerTable[i].ID < MID)。以下圖:

8. InSet(long _x, long _i, long _y, int ll, int rr): Check whether _i is in the (_x, _y) set in chord ring clockwise order.        檢查在Chord環上是否知足tx<ti<ty

9. InitNode(): Initialize all the variables in the new node. Find its predecessor and successor, and call UpdateDHT().        調用SuperNode的Join()通知當前node的加入,等SuperNode通知能夠加入後,SuperNode會返回一個node的信息,表示當前node加到這個node的後面。而後加入當前node:初始化Predecessor、successor等信息(相似雙向循環鏈表中加入節點),初始化本身的FingerTable(根據自身的前驅的FingerTable信息),而後rpc調用前驅節點的UpdateDHT()來更新它們的FingerTable。最後調用PostJoin()通知SuperNode本身加入完成了。

10. GetSuccessor(), GetPredecessor(): 返回當前節點的Successor/Predecessor.        

11. SetSuccessor(), SetPredecessor(): 設置當前節點的Successor/Predecessor.       


2 User document

2.1 How to compile

We have written a make script to compile the whole project.

cd pa2/src

./make.sh

2.2 How to run the project

1.    Run supernode
cd pa2/src/
java -cp ".:/usr/local/Thrift/*" SuperNode Port NodeNumber
        <Port>: The port of super
        <NodePort>: The port of node
            <NodeNumber>: Number of nodes
    Eg. java -cp ".:/usr/local/Thrift/*" SuperNode 9090 5
      2.  Run compute node
    Start compute node on 5 different machines.
    cd pa2/src/
    java -cp ".:/usr/local/Thrift/*" Node <serverIP> <serverPort> <nodePort>
        <ServerIP>: The ip address of server
        <ServerPort>: The port of server
        <NodePort>: The port of node
    Eg:    java -cp ".:/usr/local/Thrift/*" Node csel-kh1250-01 9090 9092
      3. Run client
    cd pa2/src/
    java -cp ".:/usr/local/Thrift/*" Client <serverIP> <serverPort>
<ServerIP>: The ip address of supernode
        <ServerPort>: The port of supernode
E.g. java -cp ".:/usr/local/Thrift/*" Client csel-kh1250-01 9090
Set "Harry Potter" "magic"
Get 「Harry Potter」
Get 「Harry」
Set 「Harry Potter」 「child」
Set 「../shakespeares.txt」

2.3 What will happen after running

            The results and log(node searching trace) will be output on the screen. You will be asked to input the next command.


3 Testing Document

3.1 Testing Environment

Machines:

We use 7 machines to perform the test, including 1 supernode machine (csel-kh1250-01), 5 computeNode machines (csel-kh4250-03, csel-kh4250-01, csel-kh4250-22, csel-kh4250-25, csel-kh4250-34), 1 client machine (csel-kh1250-03).

 

Test Set:

We use a test set (../shakespeares.txt) including 42 items, totally 1.1 kB. The data uses a shared directory via NSF.

 

Logging:

Trace Logging is output on the window.

 

Testing Settings:

We test positive test case – where the book title is present in the DHT and negative test case – where the book title is not present in the DHT. And also other tests.

3.2 Join and PostJoin

The successfully joined node will print its ID, successor, predecessor, and FingerTable as follows:

NODE INFO: Successfully joined SuperNode. My ID is 100, and ChordLen=7
Returned JointRes==csel-kh4250-25.cselabs.umn.edu : 9092
My Successor is Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) . My Predecessor is Address(ip:csel-kh4250-25.cselabs.umn.edu, port:9092, ID:75)
__________________________________________________
| Print Finger Table                             |
__________________________________________________
|| 1 || Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) || 101 ||
|| 2 || Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) || 102 ||
|| 3 || Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) || 104 ||
|| 4 || Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) || 108 ||
|| 5 || Address(ip:csel-kh4250-03.cselabs.umn.edu, port:9092, ID:0) || 116 ||
|| 6 || Address(ip:csel-kh4250-01.cselabs.umn.edu, port:9092, ID:25) || 4 ||
|| 7 || Address(ip:csel-kh4250-22.cselabs.umn.edu, port:9092, ID:50) || 36 ||
__________________________________________________

 

When another new node is added and the Finger Table of this node is updated, it will print its updated Finger Table again.

Supernode output:

Postjoin csel-kh4250-03.cselabs.umn.edu:9092
Postjoin csel-kh4250-01.cselabs.umn.edu:9092
Postjoin csel-kh4250-22.cselabs.umn.edu:9092
Postjoin csel-kh4250-25.cselabs.umn.edu:9092
Postjoin csel-kh4250-34.cselabs.umn.edu:9092

 

Stability Test: Join multiple nodes at the same time:

Node output:

NODE INFO: Try joining SuperNode…
NODE INFO: Try joining SuperNode…
NODE INFO: Try joining SuperNode…
NODE INFO: Try joining SuperNode…

In this case, only one node will join first, while other nodes will wait for the change for joining.

 

 

3.2 Set Book and Genre

1) With Book and Genre as input

Set "Harry Potter" "magic"

[Harry Potter is set on machine 0 with hash value= 120]Trace(75, 100, 0, )
Succeed to set: Harry Potter magic

When doing Set/Get task, the client command line will return its result, and all the nodes involved (including the node directly communicate with client, the nodes passed when forwarding, and the node physically saving this record). All the visited nodes will be successively printed in Trace() field.

 

2) With Filename

Set "../shakespeares.txt"

[All's Well That Ends Well is set on machine 50 with hash value= 35]Trace(75, 25, 50, )
Succeed to set: All's Well That Ends Well Comedies
[As You Like It is set on machine 25 with hash value= 10]Trace(75, 0, 25, )
Succeed to set: As You Like It Comedies
[The Comedy of Errors is set on machine 50 with hash value= 41]Trace(75, 25, 50, )
Succeed to set: The Comedy of Errors Comedies
[Love's Labor's Lost is set on machine 25 with hash value= 14]Trace(75, 0, 25, )
Succeed to set: Love's Labor's Lost Comedies
[Measure for Measure is set on machine 25 with hash value= 17]Trace(75, 0, 25, )
Succeed to set: Measure for Measure Comedies
[The Merchant of Venice is set on machine 0 with hash value= 102]Trace(75, 100, 0, )
Succeed to set: The Merchant of Venice Comedies
…...

 

 3) Reset book and genre
Set "Harry Potter" "magic and children"

[Harry Potter is set on machine 0 with hash value= 120]Trace(75, 100, 0, )
Succeed to set: Harry Potter magic and children

In this sample, the key is 「Harry Potter」 and the value is 「magic and children」. It is set on machine 0 because the hash value of its key is 120. When setting this key, it visited node 75, 100, 0 successively.

 

 

 3.3 Get

1)  Positive: File exists

Get "Harry Potter"

Succeed to get: Harry Potter.
[Harry Potter:magic and children is get on machine 0 with hash value= 120]Trace(75, 100, 0, )

In this sample, 「Harry Potter:magic and children」 means the key is 「Harry Potter」 and the value is 「magic and children」. It is found on machine 0 because the hash value of its key is 120. When getting this key, it visited node 75, 100, 0 successively.

 

 

2)  Negative: key does not exist

Get "Filename in Dream"

ERROR[Filename in Dream is NOT FOUND on machine 100 with hash value= 89]Trace(75, 100, )
Failed to find: Filename in Dream

In this sample, the key 「Filename in Dream」 should be on machine 100 because the hash value of its key is 89. When trying to get this key, it visited node 75, 100 successively. But the key does not exist.

 

3.4 Printing Trace Log

After each Set() and Get() operation, the system will print all nodes it passed when executing, in the trace() field.

Note: According to the algorithm in the reference paper, the find_processor() function will always pass the predecessor of the end point in the path, and then check whether the next point should end this finding process.

 

3.5 Invalid Command

When typing invalid command, the system will show error message.

WrongCommand "Book"

Wrong Command: please input again

Input Command: Set Book_title Genre|Set Filename|Get Book_title|Quit

 

 


下面是寫做業過程當中的一點筆記:

 

task: 基於Chord實現一個Hash Table

我負責寫Node,隊友寫SuperNode和Client。整體參考paper[Stoica et al., 2001]上的僞代碼

FindSuccessor(key):對chord環上的任意一個key,返回他的successor

FindPredecessor(key):對chord環上的任意一個key,返回他的Predecessor

n.Closet_Preceding_Finger(key):對chord環上的任意一個key,在node n的Finger table中查找它的最近的Predecessor

注意幾個坑點:

1. 在UpdateDHT時,由於這個函數是被遞歸調用的,因此有可能會出現若干個node造成infinite loop的狀況。解決方法就是在updateDHT函數入口設置一個List來記錄visited過的節點,若是下次遇到重複的就退出。

2. 在FindSuccesor / FindPredecessor裏,假設這個書名就該在本身node上 那findsuccessor裏就不須要rpc,直接訪問本身的的successor就好了。這個須要特別判斷一下,否則本身rpc本身是會崩的

3. 在UpdateDHT的僞代碼中,if(s屬於[n, finger[i].node)) 這裏在測試中發現有問題。咱們改爲了if(finger[i].start屬於[finger[i].node, s])解決了問題。它表示對於finger[i]這一項,新節點s比該項的當前值finger[i].node靠後,且是該節點能觸到的位置finger[i].start的successor,因此要更新。    (因此頂會論文也會出錯麼?

 

FingerTable的實現:

 1 public class FingerItem{    //For FingerTable[i],
 2     Address Node;           //  Node = Successor(FingerTable[i].start)
 3     long ivx; long ivy;     //  [ivx, ivy) = [FingerTable[i].start, FingerTable[i+1].start)
 4     long start;             //  start = (NodeInfo.ID+(long)(Math.pow(2,i-1))) % (long)(Math.pow(2,ChordLen)))
 5     public FingerItem(Address _Node, long _ivx, long _ivy, long _start) {
 6         Node=_Node;
 7         ivx=_ivx;
 8         ivy=_ivy;
 9         start=_start;
10     }
11     public static long FingerCalc(long _n, int _k, int _m) {
12         if (_k == _m + 1)
13             return (_n);
14         else
15             return ((_n + (long) (Math.pow(2, _k - 1))) % (long) (Math.pow(2, _m)));
16     }
17     public void PrintItem(int i){
18         System.out.println("FingerItem"+i+" "+Node+" "+ivx+"_"+ivy+" : "+start);
19     }
20 }
21 
22 FingerItem[ChordLen] FingerTable;
相關文章
相關標籤/搜索