Java基礎系列之(三) - HashMap深度分析

時間 2019-11-17

標籤 java 基礎系列 hashmap 深度分析欄目 Java 简体版

原文原文鏈接

此次主要是分析下HashMap的工做原理，爲何我會拿這個東西出來分析，緣由很簡單，之前我面試的時候，偶爾問起HashMap，99%的程序員都知道HashMap，基本都會用Hashmap，這其中不只僅包括剛畢業的大學生，也包括已經工做5年，甚至是10年的程序員。HashMap涉及的知識遠遠不止put和get那麼簡單。本次的分析但願對於面試的人起碼對於面試官的問題有所應付程序員

1、先來回憶下個人面試過程面試

問：「你用過HashMap，你能跟我說說它嗎？」算法

答：「固然用過，HashMap是一種<key,value>的存儲結構，可以快速將key的數據put方式存儲起來，而後很快的經過get取出來」，而後說「HashMap不是線程安全的，
HashTable是線程安全的，經過synchronized實現的。HashMap取值很是快」等等。這個時候說明他已經很熟練使用HashMap的工具了。數組

問：「你知道HashMap 在put和get的時候是怎麼工做的嗎？」安全

答：「HashMap是經過key計算出Hash值，而後將這個Hash值映射到對象的引用上，get的時候先計算key的hash值，而後找到對象」。這個時候已經顯得不自信了。多線程

問：「HashMap的key爲何通常用字符串比較多，能用其餘對象，或者自定義的對象嗎？爲何？」app

答：「這個沒研究過，通常習慣用String。」less

問：「你剛纔提到HashMap不是線程安全的，你怎麼理解線程安全。原理是什麼？幾種方式避免線程安全的問題。」函數

答：「線程安全就是多個線程去訪問的時候，會對對象形成不是預期的結果，通常要加鎖才能線程安全。」工具

其實，問了以上那些問題，我基本能斷定這個程序員的基本功了，通常技術中等，接下來的問題不必問了。

從個人我的角度來看，HashMap的面試問題可以考察面試者的線程問題、Java內存模型問題、線程可見與不可變問題、Hash計算問題、鏈表結構問題、二進制的&、|、<<、>>等問題。因此一個HashMap就能考驗一我的的技術功底了。

2、概念分析

一、HashMap的類圖結構

　此處的類圖是根據JDK1.6版本畫出來的。以下圖1:

　　　　圖(一)

二、HashMap存儲結構

HashMap的使用那麼簡單，那麼問題來了，它是怎麼存儲的，他的存儲結構是怎樣的，不少程序員都不知道，其實當你put和get的時候，稍稍往前一步，你看到就是它的真面目。其實簡單的說HashMap的存儲結構是由數組和鏈表共同完成的。如圖：

從上圖能夠看出HashMap是Y軸方向是數組，X軸方向就是鏈表的存儲方式。你們都知道數組的存儲方式在內存的地址是連續的，大小固定，一旦分配不能被其餘引用佔用。它的特色是查詢快，時間複雜度是O(1)，插入和刪除的操做比較慢，時間複雜度是O(n)，鏈表的存儲方式是非連續的，大小不固定，特色與數組相反，插入和刪除快，查詢速度慢。HashMap能夠說是一種折中的方案吧。

三、HashMap基本原理

一、首先判斷Key是否爲Null，若是爲null，直接查找Enrty[0]，若是不是Null，先計算Key的HashCode，而後通過二次Hash。獲得Hash值，這裏的Hash特徵值是一個int值。

二、根據Hash值，要找到對應的數組啊，因此對Entry[]的長度length求餘，獲得的就是Entry數組的index。

三、找到對應的數組，就是找到了所在的鏈表，而後按照鏈表的操做對Value進行插入、刪除和查詢操做。

四、HashMap概念介紹

變量	術語	說明
size	大小	HashMap的存儲大小
threshold	臨界值	HashMap大小達到臨界值，須要從新分配大小。
loadFactor	負載因子	HashMap大小負載因子，默認爲75%。
modCount	統一修改	HashMap被修改或者刪除的次數總數。
Entry	實體	HashMap存儲對象的實際實體，由Key，value，hash，next組成。

五、HashMap初始化

默認狀況下，大多數人都調用new HashMap()來初始化的，我在這裏分析new HashMap(int initialCapacity, float loadFactor)的構造函數，代碼以下：

public HashMap(int initialCapacity, float loadFactor) {
　　　　　// initialCapacity表明初始化HashMap的容量，它的最大容量是MAXIMUM_CAPACITY = 1 << 30。 if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;

　　　　 // loadFactor表明它的負載因子，默認是是DEFAULT_LOAD_FACTOR=0.75，用來計算threshold臨界值的。 if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        // Find a power of 2 >= initialCapacity
        int capacity = 1;
        while (capacity < initialCapacity)
            capacity <<= 1;

        this.loadFactor = loadFactor;
        threshold = (int)(capacity * loadFactor);
        table = new Entry[capacity];
        init();
    }

由上面的代碼能夠看出，初始化的時候須要知道初始化的容量大小，由於在後面要經過按位與的Hash算法計算Entry數組的索引，那麼要求Entry的數組長度是2的N次方。

六、HashMap中的Hash計算和碰撞問題

HashMap的hash計算時先計算hashCode(),而後進行二次hash。代碼以下：

// 計算二次Hash    
int hash = hash(key.hashCode());

// 經過Hash找數組索引
int i = indexFor(hash, table.length);

先不忙着學習HashMap的Hash算法，先來看看JDK的String的Hash算法。代碼以下：

    /** * Returns a hash code for this string. The hash code for a * <code>String</code> object is computed as * <blockquote><pre> * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] * </pre></blockquote> * using <code>int</code> arithmetic, where <code>s[i]</code> is the * <i>i</i>th character of the string, <code>n</code> is the length of * the string, and <code>^</code> indicates exponentiation. * (The hash value of the empty string is zero.) * * @return a hash code value for this object. */
    public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }

從JDK的API能夠看出，它的算法等式就是s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]，其中s[i]就是索引爲i的字符，n爲字符串的長度。這裏爲何有一個固定常量31呢，關於這個31的討論不少，基本就是優化的數字，主要參考Joshua Bloch's Effective Java的引用以下：

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

大致意思是說選擇31是由於它是一個奇素數，若是它作乘法溢出的時候，信息會丟失，並且當和2作乘法的時候至關於移位，在使用它的時候優勢仍是不清楚，可是它已經成爲了傳統的選擇，31的一個很好的特性就是作乘法的時候能夠被移位和減法代替的時候有更好的性能體現。例如31*i至關因而i左移5位減去i，即31*i == (i<<5)-i。現代的虛擬內存系統都使用這種自動優化。

如今進入正題，HashMap爲何還要作二次hash呢? 代碼以下：

    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

回答這個問題以前，咱們先來看看HashMap是怎麼經過Hash查找數組的索引的。

    /**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }

其中h是hash值，length是數組的長度，這個按位與的算法其實就是h%length求餘，通常什麼狀況下利用該算法，典型的分組。例如怎麼將100個數分組16組中，就是這個意思。應用很是普遍。

既然知道了分組的原理了，那咱們看看幾個例子，代碼以下：

        int h=15,length=16;
        System.out.println(h & (length-1));
        h=15+16;
        System.out.println(h & (length-1));
        h=15+16+16;
        System.out.println(h & (length-1));
        h=15+16+16+16;
        System.out.println(h & (length-1));

運行結果都是15，爲何呢?咱們換算成二進制來看看。

System.out.println(Integer.parseInt("0001111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0011111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0111111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("1111111", 2) & Integer.parseInt("0001111", 2));

這裏你就發現了，在作按位與操做的時候，後面的始終是低位在作計算，高位不參與計算，由於高位都是0。這樣致使的結果就是隻要是低位是同樣的，高位不管是什麼，最後結果是同樣的，若是這樣依賴，hash碰撞始終在一個數組上，致使這個數組開始的鏈表無限長，那麼在查詢的時候就速度很慢，又怎麼算得上高性能的啊。因此hashmap必須解決這樣的問題，儘可能讓key儘量均勻的分配到數組上去。避免形成Hash堆積。

回到正題，HashMap怎麼處理這個問題，怎麼作的二次Hash。

    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

這裏就是解決Hash的的衝突的函數，解決Hash的衝突有如下幾種方法：

(1)、開放定址法（線性探測再散列，二次探測再散列，僞隨機探測再散列）

　(2)、再哈希法

(3)、鏈地址法

(4)、創建一公共溢出區

而HashMap採用的是鏈地址法，這幾種方法在之後的博客會有單獨介紹，這裏就不作介紹了。

七、HashMap的put()解析

以上說了一些基本概念，下面該進入主題了，HashMap怎麼存儲一個對象的，代碼以下：

 /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

從代碼能夠看出，步驟以下：

(1) 首先判斷key是否爲null，若是是null，就單獨調用putForNullKey(value)處理。代碼以下：

    /**
     * Offloaded version of put for null keys
     */
    private V putForNullKey(V value) {
        for (Entry<K,V> e = table[0]; e != null; e = e.next) {
            if (e.key == null) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }
        modCount++;
        addEntry(0, null, value, 0);
        return null;
    }

從代碼能夠看出，若是key爲null的值，默認就存儲到table[0]開頭的鏈表了。而後遍歷table[0]的鏈表的每一個節點Entry，若是發現其中存在節點Entry的key爲null，就替換新的value，而後返回舊的value，若是沒發現key等於null的節點Entry，就增長新的節點。

(2) 計算key的hashcode，再用計算的結果二次hash，經過indexFor(hash, table.length);找到Entry數組的索引i。

(3) 而後遍歷以table[i]爲頭節點的鏈表，若是發現有節點的hash，key都相同的節點時，就替換爲新的value，而後返回舊的value。

(4) modCount是幹嗎的啊? 讓我來爲你解答。衆所周知，HashMap不是線程安全的，但在某些容錯能力較好的應用中，若是你不想僅僅由於1%的可能性而去承受hashTable的同步開銷，HashMap使用了Fail-Fast機制來處理這個問題，你會發現modCount在源碼中是這樣聲明的。

    transient volatile int modCount;

volatile關鍵字聲明瞭modCount，表明了多線程環境下訪問modCount，根據JVM規範，只要modCount改變了，其餘線程將讀到最新的值。其實在Hashmap中modCount只是在迭代的時候起到關鍵做用。

private abstract class HashIterator<E> implements Iterator<E> {
        Entry<K,V> next;    // next entry to return
        int expectedModCount;    // For fast-fail
        int index;        // current slot
        Entry<K,V> current;    // current entry

        HashIterator() {
           expectedModCount = modCount;
            if (size > 0) { // advance to first entry
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        }

        public final boolean hasNext() {
            return next != null;
        }

        final Entry<K,V> nextEntry() {
　　　　　　　　// 這裏就是關鍵 if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Entry<K,V> e = next;
            if (e == null)
                throw new NoSuchElementException();

            if ((next = e.next) == null) {
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        current = e;
            return e;
        }

        public void remove() {
            if (current == null)
                throw new IllegalStateException();
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Object k = current.key;
            current = null;
            HashMap.this.removeEntryForKey(k);
          expectedModCount = modCount;
        }

    }

使用Iterator開始迭代時，會將modCount的賦值給expectedModCount，在迭代過程當中，經過每次比較二者是否相等來判斷HashMap是否在內部或被其它線程修改，若是modCount和expectedModCount值不同，證實有其餘線程在修改HashMap的結構，會拋出異常。

因此HashMap的put、remove等操做都有modCount++的計算。

(5) 若是沒有找到key的hash相同的節點，就增長新的節點addEntry(),代碼以下：

  void addEntry(int hash, K key, V value, int bucketIndex) {
    Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }

這裏增長節點的時候取巧了，每一個新添加的節點都增長到頭節點，而後新的頭節點的next指向舊的老節點。

(6) 若是HashMap大小超過臨界值，就要從新設置大小，擴容，見第9節內容。

八、HashMap的get()解析

理解上面的put，get就很好理解了。代碼以下：

    public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
    }

別看這段代碼，它帶來的問題是巨大的，千萬記住,HashMap是非線程安全的，因此這裏的循環會致使死循環的。爲何呢?當你查找一個key的hash存在的時候，進入了循環，偏偏這個時候，另一個線程將這個Entry刪除了，那麼你就一直由於找不到Entry而出現死循環，最後致使的結果就是代碼效率很低，CPU特別高。必定記住。

九、HashMap的size()解析

HashMap的大小很簡單，不是實時計算的，而是每次新增長Entry的時候，size就遞增。刪除的時候就遞減。空間換時間的作法。由於它不是線程安全的。徹底能夠這麼作。效力高。

九、HashMap的reSize()解析

當HashMap的大小超過臨界值的時候，就須要擴充HashMap的容量了。代碼以下：

    void resize(int newCapacity) {
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return;
        }

        Entry[] newTable = new Entry[newCapacity];
        transfer(newTable);
        table = newTable;
        threshold = (int)(newCapacity * loadFactor);
    }

從代碼能夠看出，若是大小超過最大容量就返回。不然就new 一個新的Entry數組，長度爲舊的Entry數組長度的兩倍。而後將舊的Entry[]複製到新的Entry[].代碼以下：

    void transfer(Entry[] newTable) {
        Entry[] src = table;
        int newCapacity = newTable.length;
        for (int j = 0; j < src.length; j++) {
            Entry<K,V> e = src[j];
            if (e != null) {
                src[j] = null;
                do {
                    Entry<K,V> next = e.next;
                    int i = indexFor(e.hash, newCapacity);
                    e.next = newTable[i];
                    newTable[i] = e;
                    e = next;
                } while (e != null);
            }
        }
    }