ThreadLocal的hash算法（關於 0x61c88647）

時間 2019-12-04

標籤 threadlocal hash 算法關於 0x61c88647 欄目 Java 简体版

原文原文鏈接

前言

此文原是「研習小組」中的一篇主題，當時我在這個主題上花了一番精力，有一些收穫，在這裏記錄一下，備忘。html

正文

這段時間我看了陳同窗的一篇文章，裏面提到了 ThreadLocal ，它的源碼我之前沒看過，因此就藉着這個機會看了一下，我發現了在 ThreadLocal 裏的 ThreadLocalMap 當中，使用了一種被稱之爲 斐波那契散列 (存疑)的哈希函數，他的大體過程是：java

每次用 0x61c88647 這個數累加。（這裏對應哈希函數的通常操做步驟的第一步）

// 這個數能夠這樣計算出來 ~(2^32 / 1.618) + 0b1 (這裏 ^ 表明求冪操做)
BigDecimal goldenRatioResult = BigDecimal.valueOf(Math.pow(2, 32)).divide(new BigDecimal("1.618"), 0, ROUND_HALF_UP);
int hashIncrenment = ~goldenRatioResult.intValue() + 0b1; // 等效於 Math.abs() 結果是 1640531527 也就是十六進制的 0x61c88647
複製代碼

獲得的結果按位和 ThreadLocalMap 的容量-1 作 & 操做。（這裏實際上對應了哈希函數的通常操做步驟的第二步，也就是將第一步計算的值和哈希表的容量作取模操做。另外，這裏ThreadLocalMap 的容量必須爲2的冪，之因此有這樣的規定，是由於，當求解 x % y 時，若是y = 2^n 那麼取模操做能夠轉換爲按位與操做 x & (y - 1)，效率會比取模操做高，取模操做能夠看做是除操做，而除操做是性能比較低的）

/** * The difference between successively generated hash codes - turns * implicit sequential thread-local IDs into near-optimally spread * multiplicative hash values for power-of-two-sized tables. */
    private static final int HASH_INCREMENT = 0x61c88647;

    /** * Returns the next hash code. */
    private static int nextHashCode() {
        return nextHashCode.getAndAdd(HASH_INCREMENT);
    }
複製代碼

其中須要注意的是1.618這個數又叫 黃金分割 (黃金分割也能夠是0.618，0.618和1.618互爲倒數，也就是1/1.618 = 0.618)，而這個黃金分割又是能夠由斐波那契數列推導出來的：
複製代碼

Fibonacci numbers: F(n) = F(n-1) + F(n-2) with F(0) = 0 and F(1) = 1. 
  
n		a(n)
0		0
1		1
2		1
3		2
4		3
5		5
6		8
7		13
8		21
9		34
10		55
11		89
12		144
13		233
14		377
15		610
16		987
17		1597
18		2584
19		4181
20		6765
21		10946
22		17711
23		28657
24		46368
25		75025
26		121393
27		196418
28		317811
29		514229
30		832040
31		1346269
32		2178309
33		3524578
34		5702887
35		9227465
36		14930352
37		24157817
38		39088169
39		63245986
40		102334155

102334155 / 63245986 = 1.61803
從上邊觀察，對於考後的 n ，a(n+1) / a(n) 約等於 1.618

複製代碼

「Golden ratio」
「Fibonacci number」git

那麼問題就來了：github

爲何必定要用黃金分割來累加？我看到有個視頻說用黃金分割可讓hash值分散(存疑)，可是我有一個疑問，用別的一個什麼數，不照樣可讓 hash 分開嗎？
另外 ThreadLocalMap 的 hash 最後還須要用按位與的方式用掩碼消除高位，那麼爲何消除高位後仍能保持獲得hash值的分散呢。(存疑，有可能壓根不是仍然保持分散)

概念與理論

哈希函數的通常分類

Division Based Hash Functions h(x) = x % M Bit-Shifting Hash Functions Using the Middle Square Method Multiplicative Hash Functions算法

斐波那契哈希屬於Multiplicative Hash Functions中的一種實現express

In the multiplicative method, the multiplier A must be carefully chosen to scatter the numbers in −1 the table. A very good choice for the constant A is φ W , where φ, known as the golden ratio , is the positive root of the polynomial x 2 − x − 1 This is because x is a solution to the polynomial if and only if x 2 − x − 1=0 iff x 2 − x =1 iff x − 1 1 = x In other words, a solution x has the property that its inverse is x − 1. The solution, φ, is called the golden ratio because it arises in nature in an astonishing number of places and because the ancient Greeks used this ratio in the design of many of their buildings, considering it a divine −1 2 proportion . Thus, φ and φ = φ − 1 are both roots of x x1. φ is also the value to which th −1 f n /f n−1 converges as n → ∞, where f n is the n Fibonacci number. Since φ = φ − 1, it is approximately 0.6180339887498948482045868343656. (Well approximately depends on your notion of approximation, doesn't it?) −1 When we let A = φ W as the multiplicative constant, the multiplicative method is called Fibonacci hashing . The constant has many remarkable and astonishing mathematical properties, but the property that makes it a good factor in the above hash function is the following fact. −1 −1 −1 First observe that the sequence of values φ , 2φ , 3φ , ... lies entirely in the interval (0, 1). Remember that the curly braces mean, the fractional part of, so for example, 2φ −1 ≈ −1 −1bash

1.236067977 and ≈ 0.236067977. 2φ The rst value, two segments whose lengths are in the golden ratio.數據結構

哈希函數的通常步驟

Main article: Hash function The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets_index that suggests where the entry can be found: index = f(key, array_size)> Often this is done in two steps: hash = hashfunc(key) index = hash % array_size> In this method, the hash_reduced to an index (a number between 0 and array_size − 1) using the modulo operator (%). In the case that the array size is a power of two, the remainder operation is reduced to masking, which improves speed, but can increase problems with a poor hash function.[5]app

「Hash table」less

Modulo operations might be implemented such that a division with a remainder is calculated each time. For special cases, on some hardware, faster alternatives exist. For example, the modulo of powers of 2 can alternatively be expressed as a bitwise AND operation: x % 2 == x & (2 - 1) Examples (assuming x is a positive integer): x % 2 == x & 1``x % 4 == x & 3``x % 8 == x & 7 In devices and software that implement bitwise operations more efficiently than modulo, these alternative forms can result in faster calculations.[11] Optimizing compilers may recognize expressions of the form expression % constant where constant is a power of two and automatically implement them as expression & (constant-1). This can allow writing clearer code without compromising performance. This optimization is not possible for languages in which the result of the modulo operation has the sign of the dividend (including C), unless the dividend is of an unsigned integer type. This is because, if the dividend is negative, the modulo will be negative, whereas expression & (constant-1) will always be positive.

「Modulo operation」

怎樣是好的哈希函數

Properties of a Good Hash Function hash To means to chop up or make a mess of things, liked hashed potatoes. A hash function is supposed to chop up its argument and reconstruct a value out of the chopped up little pieces. Good hash functions • make the original value intractably hard to reconstruct from the computed hash value, • are easy to compute (for speed), and • randomly disperse keys evenly throughout the table, making sure that no two keys map to the same index.

Choosing a hash function A good hash function and implementation algorithm are essential for good hash table performance, but may be difficult to achieve.citation needed[6] A basic requirement is that the function should provide a uniform distribution of hash values. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions.[7][8] The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a power of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a prime number.[9] The modulus operation may provide some additional mixing; this is especially useful with a poor hash function. For open addressing schemes, the hash function should also avoid clustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash[3] is claimed to have particularly poor clustering behavior.[9]

好的哈希函數應該能作到「discrete uniform distributions」，也就是「離散均勻分佈」，而斐波那契哈希是能夠作到很好的「離散均勻分佈」的，取一個別的值不必定能作到（也並不必定作不到），這個能夠經過一個實驗觀察到。下圖爲我寫的一個DEMO，紅色虛線框中是斐波那契散列，其餘的爲其餘數作的對比：（在 0 - 1 內的分佈）

如下爲 ThreadLocalMap 自身的 hash 函數的分佈狀況：（能夠看出來，當值爲 1.618 時，是「離散均勻分佈」的，其餘值，也能作到必定的離散均勻分佈，可是有的值就作不到）(存疑)

Re: Golden Ratio & Hash Tables
Posted: Nov 2, 2002 2:58 AM
Click to see the message monospaced in plain text Plain Text Click to reply to this topic Reply "Alexei A. Frounze" alexfru@chat.ru wrote in message news:apop0i$3ib1k$2@ID-57378.news.dfncis.de...

Can anybody enlighten me why golden ratio is so widely used when doing hash
tables? Is it the only possible and effective value? I've not found anything
about this particular choice in the Sedgewick's book, just the fact that
it's used. Any simple explanation? I believe Knuth has explanation, but I don't have the book handy and don't
know when I'll be able to read it. Can you explain why (sqrt(5)-1)/2, but
not let's say (sqrt(3)-1)/2? Does this really have anything to do with
Fibonacci numbers, packing of seeds and thelike? TIA

The usual explanation is that it provides the best dispersion of
consecutive keys in a multiplicative hash table: that was Knuth's
reason for advocating it. For a hash table of size 2^N, an index for each key K can be
constructed by multiplying K * M, where M is a prime number close to
2^N * R, R being the inverse golden ratio (sqrt(5)-1)/2. Keep the N
most significant bits as the hash index. Knuth's finding was that the
dispersion of indexes for a sequence of consecutive keys is maximized
when M is chosen this way, thus a multiplicative hash table with a
dense set of keys will have the fewest possible collisions when M
approx= 2^N * R. --
Chris Green

「Topic: Golden Ratio & Hash Tables」

x mod 2^k is not a desirable function because it does not depend on all the bits in the binary representation of x.

Similarly, it is often tempting to let M be a power of two. E.g.,

for some integer _k>_1. In this case, the hash function

simply extracts the bottom k bits of the binary representation of x. While this hash function is quite easy to compute, it is not a desirable function because it does not depend on all the bits in the binary representation of x.

「Division Method」

哈希表的衝突處理

• separate chaining • open addressing

Open Addressing In open addressing, there are no separate lists attached to the table. All values are in the table itself. When a collision occurs, the cells of the hash table itself are searched until an empty one is found. Which cells are searched depends upon the specic method of open addressing. All variations can be described generically by a sequence of functions h 0 (x) h 1 (x) h k (x) = = ... = h(x) + f(0, x) % M h(x) + f(1, x) % M h(x) + f(k, x) % M th where h i (x) is the i location tested and f(i, x) is a function that returns some integer value based on the values of i and x. The idea is that the hash function h(x) is rst used to nd a location in the hash table for x. If we are trying to insert x into the table, and the index h(x) is empty, we insert it there. Otherwise we need to search for another place in the table into which we can store x. The function f(i, x), called the collision resolution function , serves that purpose. We search the locations h(x) + f(0, x) % M h(x) + f(1, x) % M h(x) + f(2, x) % M ... h(x) + f(k, x)% M until either an empty cell is found or the search returns to a cell previously visited in the sequence. The function f(i, x) need not depend on both i and x. Soon, we will look at a few dierent collision resolution functions. To search for an item, the same collision resolution function is used. The hash function is applied to nd the rst index. If the key is there, the search stops. Otherwise, the table is searched until either the item is found or an empty cell is reached. If an empty cell is reached, it implies that the item is not in the table. This raises a question about how to delete items. If an item is deleted, then there will be no way to tell whether the search should stop when it reaches an empty cell, or just jump over the hole. The way around this problem is to lazy deletion . In lazy deletion, the cell is marked DELETED. Only when it is needed again is it re-used. Every cell is marked as either ACTIVE: it has an item in it EMPTY: it has no item in it DELETED: it had an item that was deleted it can be re-used These three constants are supposed to be dened for any hash table implementation in the ANSI standard. There are several dierent methods of collision resolution using open addressing: linear probing , quadratic probing , double hashing , and hopscotch hashing .