HashMap中的散列函數、衝突解決機制和rehash

時間 2019-12-13

標籤 hashmap 散列函數衝突解決機制 rehash 简体版

原文原文鏈接

1、概述

散列算法有兩個主要的實現方式：開散列和閉散列，HashMap採用開散列實現。算法

HashMap中，鍵值對（key-value）在內部是以Entry（HashMap中的靜態內部類）實例的方式存儲，散列表table是一個Entry數組，保存Entry實例。數組

對於衝突的狀況，在開散列中，若是若干個entry計算獲得相同散列地址（具體是由indexFor(hash(key.hashCode()),length)求得），這些entry被組織成一個鏈表，並以table[i]爲頭指針。數據結構

HashMap的數據結構大體能夠用下圖表示（以HashMap<String,String>的實例爲例）：app

2、散列函數函數

HashMap採用簡單的除法散列，其散列公式可表示爲：性能

通常來說，採用除法散列，m的值應該儘可能避免某些特殊值，例如m不該該爲2的冪。學習

若是m=2^p，那麼h(k)的結果就是k的p個最低位，這樣就會與k的比特位產生關聯，更容易產生衝突，不能很好的保證散列函數的結果在[0...m-1]之間均勻分佈。因此除非已知各類最低p爲排列是等可能的，不然m選擇應該考慮到關鍵字的全部位。this

可是HashMap中提供了hash(int h)函數，這個函數以key.hashCode爲參數，對其作進一步的處理，處理過程當中較好的解決了以上的因素的影響。大體保證了每個hashCode具備有限的衝突次數（經過移位運算和異或操做具體怎麼達到這個目的？我也沒有在深刻去挖，感興趣的同窗能夠來一塊兒探討學習下。。。）。spa

這樣一來，某個key散列地址計算過程實際就是:指針

indexFor(hash(key.hashCode()),length)

可見，這裏的hash(key.hashCode())結果至關於上面的散列公式中的k，lenght至關於m。

如下爲hash(int h)和indexFor(int h, int length)源代碼，更能說明問題：

    /**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because HashMap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

    /**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }

注意indexFor(int h, int length)的處理方式：

在length爲2的冪的狀況下，h & (length-1) 等效於h%length。這裏length爲table的長度，HashMap保證了不管是在初始化時仍是在後續resize操做過程當中，length都是2的冪。

3、衝突解決機制

在須要插入<key,value>鍵值對（內部對應插入Entry實例）時，執行put操做。

兩個相同的key必然計算出相同的散列地址（相同的indexFor(hash, table.length)結果），HashMap中不接受相同的key，對原有的key進行put操做其實是進行覆蓋value的操做。

兩個不一樣的key仍有可能計算出相同的散列地址（例如前例中key爲"d"和"u"），此時產生衝突。

HashMap中的衝突解決機制比較簡單，將這些衝突的entry節點以鏈表的方式掛靠到table[i]處。插入時以參數(hash, key, value, e)建立新的Entry實例，e就是位於table[i]處的鏈表的第一個entry節點，e做爲新建立的entry的next元素，因此新建立的entry直接插入到了鏈表的頭部充當新的頭結點。

從源代碼層面分析來看，put操做調用addEntry()方法，後者繼續調用HashMap中靜態內部類Entry<K,V>的構造函數。

    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

    void addEntry(int hash, K key, V value, int bucketIndex) {
    　　 Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }

    static class Entry<K,V> implements Map.Entry<K,V> {
    　　...
        /**
         * Creates new entry.
         */
        Entry(int h, K k, V v, Entry<K,V> n) {
            value = v;
            next = n;
            key = k;
            hash = h;
        }
    　　...

    }

4、rehash

當鍵值對的數量>=設定的閥值(capacity * load factor(0.75))時，爲保證HashMap的性能，會進行重散列(rehash)。

HashMap中，重散列主要有兩步：一、擴充table長度。二、轉移table中的entry，從舊table轉移到新的table。

table長度以2倍的方式擴充，一直到最大長度2^30。

entry轉移的過程是真正意義上的重散列，在此過程當中，對原來的每一個entry的key從新計算新的散列地址，舊table中相同位置的entry極有可能會被散列到新table中不一樣的位置，這主要是由於table的length變化的緣由。

在源代碼中主要涉及resize()和transfer()兩個方法。

    void resize(int newCapacity) {
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return;
        }

        Entry[] newTable = new Entry[newCapacity];
        transfer(newTable);
        table = newTable;
        threshold = (int)(newCapacity * loadFactor);
    }

    /**
     * Transfers all entries from current table to newTable.
     */
    void transfer(Entry[] newTable) {
        Entry[] src = table;
        int newCapacity = newTable.length;
        for (int j = 0; j < src.length; j++) {
            Entry<K,V> e = src[j];
            if (e != null) {
                src[j] = null;
                do {
                    Entry<K,V> next = e.next;
                    int i = indexFor(e.hash, newCapacity);
                    e.next = newTable[i];
                    newTable[i] = e;
                    e = next;
                } while (e != null);
            }
        }
    }

5、一些總結

一、capacity（table數組長度）必須爲2的冪，初始容量（initial capacity）默認爲16。即便是以傳入參數initialCapacity的方式構造實例（HashMap(int initialCapacity, float loadFactor)），構造過程當中內部也會將capacity修整爲與initialCapacity最接近而且不小於它的2的冪的數做爲capacity來實例化。

二、裝填因子loadFactor默認爲0.75。

三、若是key爲null，這始終會被散列到table[0]的桶中，即便是rehash的過程也是同樣。非null的key也有可能會被散列到table[0]的位置，例如上圖中key=「f」，並且相同的key在在不一樣的時間可能會被散列到不一樣的位置，這與rehash有關。

四、HashMap以鏈表的方式解決衝突，插入鍵值對（put操做）時，新增的entry會被插入到鏈表的頭部，也就是會插入到table[i]的位置。

五、與其餘集合類同樣，因爲fail-fast特性的存在，利用遍歷器（Iterator）進行遍歷操做時應該採用遍歷器自身的方法進行結構化的修改（例如remove操做），不該採用其餘方式對其數據內容進行修改。