Hash函數與xor-shift scheme，HashCollections，BloomFilter

時間 2019-11-17

標籤 hash 函數 xor shift scheme hashcollections bloomfilter 简体版

原文原文鏈接

Hash函數與xor-shift scheme，HashCollections，BloomFilter

根據定義，Hashcode用於幫助Equals更快速的鑑定兩個對象是否相同，於此同時，HashCode也普遍運用於不少基於Hash的Collections。本文簡要分析了HashCode的實現以及用於。html

1. HashCode

1.1 來自hashcode的思考

/**
     * Returns a hash code value for the object. This method is
     * supported for the benefit of hash tables such as those provided by
     * {@link java.util.HashMap}.
     * <p>
     * The general contract of {@code hashCode} is:
     * <ul>
     * <li>Whenever it is invoked on the same object more than once during
     *     an execution of a Java application, the {@code hashCode} method
     *     must consistently return the same integer, provided no information
     *     used in {@code equals} comparisons on the object is modified.
     *     This integer need not remain consistent from one execution of an
     *     application to another execution of the same application.
     * <li>If two objects are equal according to the {@code equals(Object)}
     *     method, then calling the {@code hashCode} method on each of
     *     the two objects must produce the same integer result.
     * <li>It is <em>not</em> required that if two objects are unequal
     *     according to the {@link java.lang.Object#equals(java.lang.Object)}
     *     method, then calling the {@code hashCode} method on each of the
     *     two objects must produce distinct integer results.  However, the
     *     programmer should be aware that producing distinct integer results
     *     for unequal objects may improve the performance of hash tables.
     * </ul>
     * <p>
     * As much as is reasonably practical, the hashCode method defined by
     * class {@code Object} does return distinct integers for distinct
     * objects. (This is typically implemented by converting the internal
     * address of the object into an integer, but this implementation
     * technique is not required by the
     * Java<font size="-2"><sup>TM</sup></font> programming language.)
     *
     * @return  a hash code value for this object.
     * @see     java.lang.Object#equals(java.lang.Object)
     * @see     java.lang.System#identityHashCode
     */
    public native int hashCode();

註釋說的太清楚我竟然無言以對....java

對一個對象調用屢次hashcode，其返回值應當相同。
若是o1.equals(o2),那麼o1.hashcode==o2.hashcode
(不是必須但try best)，if(!o1.equals(o2)) o1.hashcode能夠=o2.hashcode,但應該儘量避免該狀況。

但我有幾個問題以前一直沒想明白：算法

問題1. 簡單類型和其餘類型的實現不一樣麼：

Integer i1=1;
Integer i2=1;
i1.hashcode=i2.hascode?

Entry o1=Entry(1,2)
Entry o2=Entry(1,2)
o2.hashcode=o2.hashcode?

後來我看到Integer.hashcode方法...併發

public int hashCode() {
        return value;
    }

問題2. 對象的引用更改時，Hashcode變麼？

這個問題源自我對一個很長的鏈表上某個元素作hash的時候，讓我不得不考慮會不會每次作hash都會遍歷一遍整個鏈表。app

結果很蛋疼，更改引用甚至值的話，若是你不重寫hashcode方法，hashcode值是不會變的dom

TmpOB o1=new TmpOB(1,"aaa");
    TmpOB o2=new TmpOB(1,"aaa");
    System.out.println(o1.hashCode());
    o2.a=3;
    System.out.println(o1.hashCode());
    ***************
    //result:
    727129599
    727129599

1.2 hashcode的實現

由於Object 的hashcode方法是native的，因此參考一些別人的文章，大體意思是JVM裏C++實現的，代碼大致以下，其返回值就是hashcode。ide

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ;
  if (hashCode == 0) {
     // This form uses an unguarded global Park-Miller RNG,
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.
     value = os::random() ;
  } else
  if (hashCode == 1) {
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.
     intptr_t addrBits = intptr_t(obj) >> 3 ;
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ;
  } else
  if (hashCode == 2) {
     value = 1 ;            // for sensitivity testing
  } else
  if (hashCode == 3) {
     value = ++GVars.hcSequence ;
  } else
  if (hashCode == 4) {
     value = intptr_t(obj) ;
  } else {
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ;
     t ^= (t << 11) ;
     Self->_hashStateX = Self->_hashStateY ;
     Self->_hashStateY = Self->_hashStateZ ;
     Self->_hashStateZ = Self->_hashStateW ;
     unsigned v = Self->_hashStateW ;
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
     Self->_hashStateW = v ;
     value = v ;
  }
 
  value &= markOopDesc::hash_mask;
  if (value == 0) value = 0xBAD ;
  assert (value != markOopDesc::no_hash, "invariant") ;
  TEVENT (hashCode: GENERATE) ;
  return value;
}

原諒我C++學藝不精，徹底不知道他在作什麼。還好我找到了一些大神的討論函數

ITEYE上的討論oop

StackOverFlow上的討論測試

The hashCode() method is often used for identifying an object. I think the Object implementation returns the pointer (not a real pointer but a unique id or something like that) of the object. But most classes override the method. Like the String class. Two String objects have not the same pointer but they are equal:

new String("a").hashCode() == new String("a").hashCode()
I think the most common use for hashCode() is in Hashtable, HashSet, etc..

Java API Object hashCode()

Edit: (due to a recent downvote and based on an article I read about JVM parameters)

With the JVM parameter -XX:hashCode you can change the way how the hashCode is calculated (see the Issue 222 of the Java Specialists' Newsletter).

HashCode==0: Simply returns random numbers with no relation to where in memory the object is found. As far as I can make out, the global read-write of the seed is not optimal for systems with lots of processors.

HashCode==1: Counts up the hash code values, not sure at what value they start, but it seems quite high.

HashCode==2: Always returns the exact same identity hash code of 1. This can be used to test code that relies on object identity. The reason why JavaChampionTest returned Kirk's URL in the example above is that all objects were returning the same hash code.

HashCode==3: Counts up the hash code values, starting from zero. It does not look to be thread safe, so multiple threads could generate objects with the same hash code.

HashCode==4: This seems to have some relation to the memory location at which the object was created.

HashCode>=5: This is the default algorithm for Java 8 and has a per-thread seed. It uses Marsaglia's xor-shift scheme to produce pseudo-random numbers.

能夠看到HashCode>=5是默認實現,但是這個XorShift是個蝦米？

感謝wikipedia上的資料

上面講的很清楚，弗羅裏達州立大學一位叫作George Marsaglia的老師發表了一篇使用位移以及亦或運算生成隨機數的方法，並發表了論文在一篇統計學雜誌上。

最簡單的實現是這樣

#include <stdint.h>

/* These state variables must be initialized so that they are not all zero. */
uint32_t x, y, z, w;

uint32_t xorshift128(void) {
    uint32_t t = x ^ (x << 11);
    x = y; y = z; z = w;
    return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}

哈哈此次和C++代碼總算對上了。先無論爲何，先看看wiki上繼續的討論。

1.3 隨機數

由於隨機數是密碼學的基礎，所以關於隨機數生成，測試，比較，坑很深，再也不過多解釋。

隨機數大致可分爲物理產生的真正的隨機數，以及一些函數生成的僞隨機數，部分Intel的芯片裏就帶有物理產生隨機數的方法

來自hcwang的關於隨機數的筆記

真隨機數

真隨機數只能用某些隨機物理過程來產生。例如：放射性衰變、電子設備的熱噪音、宇宙射線的觸發時間等等。若是採用隨機物理過程來產生蒙特卡洛計算用的隨機數，理論上不存在問題，可是實際應用中，要作出速度很快而又準確的隨機物理過程產生器是很困難的。Intel810RNG的原理大概是：利用熱噪聲(是由導體中電子的熱震動引發的)放大後，影響一個由電壓控制的振盪器，經過另外一個高頻振盪器來收集數據。

僞隨機數

實際應用的隨機數一般都是經過某些數學公式計算而產生的僞隨機數。這樣的僞隨機數從數學意義上講已經一點不是隨機的了。可是，只要僞隨機數可以經過隨機數的一系列的統計檢驗，咱們就能夠把它看成真隨機數而放心地使用。這樣咱們就能夠很經濟地、重複地產生出隨機數。理論上要求僞隨機數產生器要具有如下特徵：良好的統計分佈特性、高效率的僞隨機數產生、僞隨機數產生的循環週期長，產生程序可移植性好和僞隨機數能夠重複產生。其中知足良好的統計特性是最重要的。

摘自知乎的部分測試隨機數生成算法的一些測試思想：

來自知乎-如何評價一個僞隨機數生成算法的優劣？

頻數測試：測試二進制序列中，「0」和「1」數目是否近似相等。若是是，則序列是隨機的。

塊內頻數測試：目的是肯定在待測序列中，全部非重疊的長度爲M位的塊內的「0」和「1」的數目是否表現爲隨機分佈。若是是，則序列是隨機的。

遊程測試：目的是肯定待測序列中，各類特定長度的「0」和「1」的遊程數目是否如真隨機序列指望的那樣。若是是，則序列是隨機的。

塊內最長連續「1」測試：目的是肯定待測序列中，最長連「1」串的長度是否與真隨機序列中最長連「1」串的長度近似一致。若是是，則序列是隨機的。

矩陣秩的測試：目的是檢測待測序列中，固定長度子序列的線性相關性。若是線性相關性較小，則序列是隨機的。

離散傅里葉變換測試：目的是經過檢測待測序列的週期性質，並與真隨機序列週期性質相比較，經過它們之間的偏離程度來肯定待測序列隨機性。若是偏離程度較小，序列是隨機的。

非重疊模板匹配測試：目的是檢測待測序列中，子序列是否與太多的非週期模板相匹配。太多就意味着待測序列是非隨機的。

重疊模板匹配測試：目的是統計待測序列中，特定長度的連續「1」的數目，是否與真隨機序列的狀況偏離太大。太大是非隨機的。

通用統計測試：目的是檢測待測序列是否能在信息不丟失的狀況下被明顯壓縮。一個不可被明顯壓縮的序列是隨機的。

壓縮測試：目的是肯定待測序列能被壓縮的程度，若是能被顯著壓縮，說明不是隨機序列。

線性複雜度測試：目的是肯定待測序列是否足夠複雜，若是是，則序列是隨機的。

連續性測試：目的是肯定待測序列全部可能的m位比特的組合子串出現的次數是否與真隨機序列中的狀況近似相同，若是是，則序列是隨機的。

近似熵測試：目的是經過比較m位比特串與m-1位比特串在待測序列中出現的頻度，再與正態分佈的序列中的狀況相對比，從而肯定隨機性。

部分和測試：目的肯定待測序列中的部分和是否太大或過小。太大或過小都是非隨機的。

隨機遊走測試：目的是肯定在一個隨機遊程中，某個特定狀態出現的次數是否遠遠超過真隨機序列中的狀況。若是是，則序列是非隨機的。

隨機遊走變量測試：目的是檢測待測序列中，某一特定狀態在一個遊機遊程中出現次數與真隨機序列的偏離程度。若是偏離程度較大，則序列是非隨機的。

1.4 xor-shift scheme

好了，咱們仍是來看看xor-shift scheme吧。根據Wiki上的描述，該方法在計算機上奇快，比較好，只是有一些統計學測試點沒過......但若是經過該方法加上一些非線性函數就能夠輕鬆過全部測試點，輕鬆虐 Mersenne Twister以及WELL。

固然該方法不是沒有問題，x,y,z幾個變量的初始值還得好好選選。估計人家搞加密的比較關係這些問題，我們是在算hash衝突啊！

O(∩∩)O咱們是在寫Hash是吧，怎麼越寫越遠啊O(∩∩)O

1.5 小結

總之能夠看到，目前對於一個Object的hashCode方法的第一次計算，默認狀況下是經過xor-shift scheme的一個僞隨機函數生成的。

那麼第二次調用呢？事實上，hashcode值在第一次生成後會寫入到內存裏，和Object存儲在一塊兒（存在Object的頭部），以後的調用直接從object裏當一個property讀出來就能夠了。

固然，還有部分陳舊的實現（老版本的JDK裏）是經過使用Object存儲的地址實現的。但這也存在一個問題，就算你第一次是按照當前Object的地址生成，之後都直接讀取，但你GC要是用並行GC或者 SerialGC的時候Object是會被移動的，移動以後原來的地址上再生成Object會不會很容易Hash衝突呢？

因此我以爲若是使用地址實現，至少也要再和GC Times進行一下位運算減小衝突次數。

至此爲止，一個Object的Hashcode生成方法就是這樣了。

2. HashMap && HashSet

HashSet經過HashMap實現，這就不用說啥了。

2.1 HashMap裏的hash函數

惟一值得說一說的就是HashMap.hash(int)方法。該方法輸入key.hashCode又進行了不少位運算才返回HashMap裏給Key使用的Hash值。

爲何？

final int hash(Object k) {
    int h = hashSeed;
    if (0 != h && k instanceof String) {
        return sun.misc.Hashing.stringHash32((String) k);
    }

    h ^= k.hashCode();

    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

遺憾，仍是沒想明白，看別人的blog才明白的。

基本思路很簡單，與HashMap自身結合也很緊密。

首先，hashMap是用一些桶存key，而後用拉鍊解決衝突的。並且每次擴容的時候是成倍擴容，因此通數量必定是1<<k的形式。

那麼咱們假設某時刻，一個hashmap.tablesize=1<<11,這時兩個objectkey要入庫，一個是hashvalue=M,另外一個是M+(1<<13)。咱們算index函數：

static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }

index函數直接取前10位，顯然M&(1<<12-1)==(M+(1<<13))&(1<<12-1)
由於他們的低11位原本就相同啊！

看到問題了吧，雖然hash函數將每一個obj與1<<32或者1<<64對應起來了，並且對應的很離散。但只要HashMap的size不是1<<32,那麼就要把原hashcode 向一個更小的集合映射。而這個映射過程，就是這個HashMap裏的hash過程。

該過程保證了任何一個高位（好比前面例子裏M 與 M+1<<13）的移動，都會在低位產生對應的影響。而不是直接截取Hashcode裏的低位，從而使衝突變得顯而易見。

具體過程看圖吧，太清楚了，感謝marystone。

其中h^(h>>>7)^(h>>>4) 結果中的位運行標識是把h>>>7 換成 h>>>8來看。

即最後h^(h>>>8)^(h>>>4) 運算後hashCode值每位數值以下：

8=8

7=7^8

6=6^7^8

5=5^8^7^6

4=4^7^6^5^8

3=3^8^6^5^8^4^7

2=2^7^5^4^7^3^8^6

1=1^6^4^3^8^6^2^7^5

結果中的一、二、3三位出現重複位^運算

3=3^8^6^5^8^4^7 -> 3^6^5^4^7

2=2^7^5^4^7^3^8^6 -> 2^5^4^3^8^6

1=1^6^4^3^8^6^2^7^5 -> 1^4^3^8^2^7^5

算法中是採用(h>>>7)而不是(h>>>8)的算法，應該是考慮一、二、3三位出現重複位^運算的狀況。使得最低位上原hashCode的8位都參與了^運算，因此在table.length爲默認值16的狀況下面，hashCode任意位的變化基本都能反應到最終hash table 定位算法中，這種狀況下只有原hashCode第3位高1位變化不會反應到結果中。

關於Hashmap以及ConcurrentHashMap，和BloomFilter咱們單開一章吧。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。