【工程化】一致性hash

時間 2019-11-10

原文原文鏈接

介紹

Consistent hashing，一致性hash最先是由David Karger等人在《Consistent Hashing and Random Trees：Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web》論文中提出的，爲的是解決分佈式網絡中減小或消除熱點問題的發生而提出的緩存協議。node

論文指出了一致性的4個特性：web

Balance，平衡性是指哈希的結果可以儘量分佈到全部的緩存中去，這樣可使得全部的緩存空間都獲得利用。The balance property is what is prized about standard hash functions: they distribute items among buckets in a balanced fasion。
Monotonicity，單調性是指若是已經有一些item經過哈希分派到了相應的bucket中，又有新的bucket加入到系統中。哈希的結果應可以保證原有已分配的item能夠被映射到原有的或者新的bucket中去，而不會被映射到舊的bucket集合中的其餘bucket中。This property says that if items are initially assigned to a set of buckets V1 and then some new buckets are added to form V2, then an item may move from an old bucket to a new bucket, but not from one old bucket to another. This reflects one intuition about consistency: when the set of usable buckets changes, items should only move if necessary to preserve an even distribution.
Spread，分散性是指在分佈式環境中，終端有可能看不到全部的buckets，而是隻能看到其中的一部分。當終端但願經過哈希過程將內容映射到bucket上時，因爲不一樣終端所見的buckets範圍有可能不一樣，從而致使哈希的結果不一致，最終的結果是相同的內容被不一樣的終端映射到不一樣的bucket中。這種狀況顯然是應該避免的，由於它致使相同內容被存儲到不一樣bucket去，下降了系統存儲的效率。分散性的定義就是上述狀況發生的嚴重程度。好的哈希算法應可以儘可能避免不一致的狀況發生，也就是儘可能下降分散性。 The idea behind spread is that there are V people, each of whom can see at least a constant fraction ( 1/t ) of the buckets that are visible to anyone. Each person tries to assign an item i to a bucket using a consistent hash function. The property says that across the entire group, there are at most i different opinions about which bucket should contain the item. Clearly, a good consistent hash function should have low spread over all item.
Load,負載問題其實是從另外一個角度看待分散性問題。既然不一樣的終端可能將相同的內容映射到不一樣的緩衝區中，那麼對於一個特定的緩衝區而言，也可能被不一樣的用戶映射爲不一樣的內容。與分散性同樣，這種狀況也是應當避免的，所以好的哈希算法應可以儘可能下降緩衝的負荷。The load property is similar to spread. The same V people are back, but this time we consider a particular bucket b instead of an item. The property says that there are at most b distinct items that at least one person thinks belongs in the bucket. A good consistent hash function should also have low load.

常見的使用

一些已知的場景如：算法

memcached的分佈式緩存訪問
用做負載均衡，如dubbo的ConsistentHashLoadBalance
分佈式哈希表（DHT，Distributed Hash Table）用來在一羣節點中實現(key, value)的關係映射。在相似Cassandra等分佈式系統中使用了DHT

一致性hash更多的應用在負載均衡。緩存

問題提出

通常在分佈式系統設計中，若是咱們將某些用戶請求、或者某些城市數據，訪問指定的某臺機器，通常的算法是基於關鍵字取hash值而後%機器數（hash(key)% N）。
假設咱們有3臺機器A、B 、C，後來新加了一臺機器D，其索引與機器映射以下：服務器

針對不一樣的key，其hashcode爲1-10取模運算：網絡

通過上面的表格能夠看到，當添加了一臺新機器D的時候，致使大部分key產生了miss，命中率按照上面表格計算只爲20%。雖然是一個簡單的列子，但足以說明該算法在機器伸縮時候，會形成大量的數據沒法被正確被命中。若是這是緩存架構設計，那麼緩存miss後會把請求都落在DB上，形成DB壓力。若是這是個分佈式業務調用，原來訪問機器可能作了配置數據、或緩存了上下文等，miss就意味着本次調用失敗。數據結構

就上面的case，這個算法自己違背了「單調性」設計特性。架構

單調性是指若是已經有一些item經過哈希分派到了相應的bucket中，又有新的bucket加入到系統中。哈希的結果應可以保證原有已分配的item能夠被映射到原有的或者新的bucket中去，而不會被映射到舊的bucket集合中的其餘bucket中

Consistent Hashing 算法

先構造一個長度爲2^32的整數環（這個環被稱爲一致性Hash環），根據節點名稱的Hash值（其分佈爲[0, 2^32-1]）將緩存服務器節點放置在這個Hash環上，而後根據須要緩存的數據的Key值計算獲得其Hash值（其分佈也爲[0, 2^32-1]），而後在Hash環上順時針查找距離這個Key值的Hash值最近的服務器節點，完成Key到服務器的映射查找。

以上經過特定的Hash函數f=h(x)，
（1）計算出Node節點,而後散列到一致性Hash環上:負載均衡

Node節點的hash值：
h(Node1)=K1
h(Node2)=K2
h(Node3)=K3dom

（2）計算出對象的hash值，而後以順時針的方向計算，將全部對象存儲到離本身最近的機器中。

h(object1)=key1
h(object2)=key2
h(object3)=key3
h(object4)=key4

當發生機器節點Node的添加和刪除時：

（1）機器節點Node增長，新增一個節點Node4
計算出h(Node4)=K4，將其映射到一致性Hash環上以下：

經過按順時針遷移的規則，那麼object3被遷移到了NODE4中，其它對象還保持原有的存儲位置。

（2）機器節點Node刪除，刪除節點Node2

經過順時針遷移的規則，那麼object2被遷移到Node3中，其餘對象還保持原有的存儲位置。

經過對節點的添加和刪除的分析，一致性哈希算法在保持了單調性的同時，仍是數據的遷移達到了最小，這樣的算法對分佈式集羣來講是很是合適的，避免了大量數據遷移，減少了服務器的的壓力。

算法實現

根據以前的算法的描述，使得Node節點基於其hash值大小，按順序分佈在[0-2^32-1]這個環上，而後根據object的hash值，查找
a、hash值相等，返回這個節點Node。
b、大於它hash值的第一個，返回這個節點Node。

1）選擇合適的數據結構：

論文中提到：

官方建議實現可使用平衡二叉樹。如AVL、紅黑樹

2）選擇合適的Hash函數，足夠散列。

先看下Java String的hashcode：

public static void main(String[] args) {
    System.out.println("192.168.0.1:1111".hashCode());
    System.out.println("192.168.0.2:1111".hashCode());
    System.out.println("192.168.0.3:1111".hashCode());
    System.out.println("192.168.0.4:1111".hashCode());
 }
散列值：1874499238
1903128389
1931757540
1960386691

2^32-1 = 4294967296
若是咱們把上面4臺機器Node分佈到[0-2^32-1]這個環上，取值的範圍只是一個很小的範圍區間，這樣90%的請求將會落在Node1這個節點，這樣的分佈是在太糟糕了。

所以咱們要尋找一種衝突較小，且分佈足夠散列。一些hash函數有CRC32_HASH、FNV1_32_HASH、KETAMA_HASH、MYSQL_HASH，如下是一張各hash算法的比較（未驗證，來自網絡）

簡單判斷是FNV1_32_HASH不錯，KETAMA_HASH是MemCache推薦的一致性Hash算法。

代碼實現

public class ConsistentHashingWithoutVirtualNode {

    /**
     * key表示服務器的hash值，value表示服務器的名稱
     */
    private static SortedMap<Integer, String> sortedMap =
            new TreeMap<Integer, String>();

    /**
     * 使用FNV1_32_HASH算法計算服務器的Hash值,這裏不使用重寫hashCode的方法，最終效果沒區別
     */
    private static int getFNV1_32_HASHHash(String str) {
        final int p = 16777619;
        int hash = (int) 2166136261L;
        for (int i = 0; i < str.length(); i++)
            hash = (hash ^ str.charAt(i)) * p;
        hash += hash << 13;
        hash ^= hash >> 7;
        hash += hash << 3;
        hash ^= hash >> 17;
        hash += hash << 5;

        // 若是算出來的值爲負數則取其絕對值
        if (hash < 0)
            hash = Math.abs(hash);
        return hash;
    }

    /**
     * 待添加入Hash環的服務器列表
     */
    private static String[] servers = {"192.168.0.1:111", "192.168.0.2:111", "192.168.0.3:111",
            "192.168.0.3:111", "192.168.0.4:111"};

    /**
     * 程序初始化，將全部的服務器放入sortedMap中
     */
    static {
        for (int i = 0; i < servers.length; i++) {
            int hash = getFNV1_32_HASHHash(servers[i]);
            System.out.println("[" + servers[i] + "]加入集合中, 其Hash值爲" + hash);
            sortedMap.put(hash, servers[i]);
        }
        System.out.println();
    }

    /**
     * 獲得應當路由到的結點
     */
    private static String getServer(String node) {
        // 獲得帶路由的結點的Hash值
        int hash = getFNV1_32_HASHHash(node);
        if (!sortedMap.containsKey(hash)) {
            // 獲得大於該Hash值的全部Map
            SortedMap<Integer, String> tailMap =
                    sortedMap.tailMap(hash);
            if (!tailMap.isEmpty()) {
                // 第一個Key就是順時針過去離node最近的那個結點
                return sortedMap.get(tailMap.firstKey());
            } else {
                return sortedMap.get(sortedMap.firstKey());
            }
        }
        return sortedMap.get(hash);
    }


    public static void main(String[] args) {
        String[] nodes = {"hello1", "hello2", "hello3"};
        for (int i = 0; i < nodes.length; i++)
            System.out.println("[" + nodes[i] + "]的hash值爲" +
                    getFNV1_32_HASHHash(nodes[i]) + ", 被路由到結點[" + getServer(nodes[i]) + "]");
    }

算法的缺陷

一致性hashing雖然知足了單調性和負載均衡的特性以及通常hash算法的分散性。可是不知足「平衡性」。

Balance，平衡性是指哈希的結果可以儘量分佈到全部的緩存中去，這樣可使得全部的緩存空間都獲得利用。

該算法中，Hash函數是不能保證平衡的，如上面分析的，當集羣中發生節點添加時，該節點會承擔一部分數據訪問，當集羣中發生節點刪除時，被刪除的節點P負責的數據就會落在下一個節點Q上，這樣勢必會加劇Q節點的負擔。這就是發生了不平衡。

解決

引入虛擬節點。Virtual Node，是實際節點的複製品Replica。
好比集羣中如今有2個節點Node一、Node3，就是那個刪除Node2的圖，

每一個節點引入2個副本，Node1-一、Node1-2,Node3-一、Node3-2

如此引入虛擬節點，使得對象的分佈比較均衡。那麼對於節點，物理節點和虛擬節點之間的映射以下：

到此，該算法的改進已經完成，不過要用在工程中，仍有幾個問題需解決：

一個真實節點應該映射成多少個虛擬節點
根據虛擬節點如何找到對應的真實節點

解決方案

1）理論上物理節點越少，須要的虛擬節點就越多。看下ketama算法的描述中：

ketama默認是節點爲160個

2）「虛擬節點」的hash計算能夠採用對應節點的IP地址加帶數字後綴的方式。如「192.168.0.0:111」，2個副本爲「192.168.0.0:111-VN1」、「192.168.0.0:111-VN2」。
tips：在初始化虛擬節點到一致性hash環上的時候，能夠直接h(192.168.0.0:111-VN2)->"192.168.0.0:111" 真實節點。

Ketama算法實現

如下的是net.spy.memcached.KetamaNodeLocator.Java的setKetamaNodes()方法的實現：

protected void setKetamaNodes(List<MemcachedNode> nodes) {
    TreeMap<Long, MemcachedNode> newNodeMap =
            new TreeMap<Long, MemcachedNode>();
    int numReps = config.getNodeRepetitions();
    int nodeCount = nodes.size();
    int totalWeight = 0;

    if (isWeightedKetama) {
        for (MemcachedNode node : nodes) {
            totalWeight += weights.get(node.getSocketAddress());
        }
    }

    for (MemcachedNode node : nodes) {
      if (isWeightedKetama) {

          int thisWeight = weights.get(node.getSocketAddress());
          float percent = (float)thisWeight / (float)totalWeight;
          int pointerPerServer = (int)((Math.floor((float)(percent * (float)config.getNodeRepetitions() / 4 * (float)nodeCount + 0.0000000001))) * 4);
          for (int i = 0; i < pointerPerServer / 4; i++) {
              for(long position : ketamaNodePositionsAtIteration(node, i)) {
                  newNodeMap.put(position, node);
                  getLogger().debug("Adding node %s with weight %s in position %d", node, thisWeight, position);
              }
          }
      } else {
          // Ketama does some special work with md5 where it reuses chunks.
          // Check to be backwards compatible, the hash algorithm does not
          // matter for Ketama, just the placement should always be done using
          // MD5
          if (hashAlg == DefaultHashAlgorithm.KETAMA_HASH) {
              for (int i = 0; i < numReps / 4; i++) {
                  for(long position : ketamaNodePositionsAtIteration(node, i)) {
                    newNodeMap.put(position, node);
                    getLogger().debug("Adding node %s in position %d", node, position);
                  }
              }
          } else {
              for (int i = 0; i < numReps; i++) {
                  newNodeMap.put(hashAlg.hash(config.getKeyForNode(node, i)), node);
              }
          }
      }
    }
    assert newNodeMap.size() == numReps * nodes.size();
    ketamaNodes = newNodeMap;
  }

詳細的算法實現和分析見這篇文章