散列表(hash table)——算法導論(13)

時間 2019-11-13

原文原文鏈接

1. 引言java

許多應用都須要動態集合結構，它至少須要支持Insert，search和delete字典操做。散列表（hash table）是實現字典操做的一種有效的數據結構。git

2. 直接尋址表github

在介紹散列表以前，咱們先介紹直接尋址表。算法

當關鍵字的全域U（關鍵字的範圍）比較小時，直接尋址是一種簡單而有效的技術。咱們假設某應用要用到一個動態集合，其中每一個元素的關鍵字都是取自於全域U=｛0，1，…，m-1｝，其中m不是一個很大的數。另外，假設每一個元素的關鍵字都不一樣。數組

爲表示動態集合，咱們用一個數組，或稱爲直接尋址表（direct-address table），記爲T[0~m-1]，其中每個位置（slot，槽）對應全域U中的一個關鍵字，對應規則是，槽k指向集合中關鍵字爲k的元素，若是集合中沒有關鍵字爲k的元素，則T[k]=NIL。服務器

幾種字典操做實現起來很是簡單：數據結構

上述的每個操做的時間均爲O(1)時間。less

在某些應用中，咱們其實能夠把對象做爲元素直接保存在尋址表的槽中，而不須要像上圖所示使用指針指向該對象，這樣能夠節省空間。ide

3. 散列表函數

(1) 直接尋址的缺點

咱們能夠看出，直接尋址技術有幾個明顯的缺點：若是全域U很大，那麼表T 將要申請一段很是長的空間，極可能會申請失敗；對於全域較大，可是元素卻十分稀疏的狀況，使用這種存儲方式將浪費大量的存儲空間。

(2) 散列函數

爲了克服直接尋址技術的缺點，而又保持其快速字典操做的優點，咱們能夠利用散列函數（hash function）

h：U→｛0，1，2，…，m-1｝

來計算關鍵字k所在的的位置，簡單的講，散列函數h(k)的做用是將範圍較大的關鍵字映射到一個範圍較小的集合中。這時咱們能夠說，一個具備關鍵字k的元素被散列到槽h(k)上，或者說h(k)是關鍵字k的散列值。

示意圖以下：

這時會產生一個問題：兩個關鍵字可能映射到同一槽中（咱們稱之爲衝突（collision）），而且無論你如何優化h(k)函數，這種狀況都會發生（由於|U|>m）。

所以咱們如今面臨兩個問題，一是遇到衝突時如何解決；二是要找出一個的函數h(k)可以儘可能的減小衝突；

(3) 經過鏈表法解決衝突

咱們先來解決第一個問題。

解決辦法就是，咱們把同時散列到同一槽中的元素以鏈表的形式「串聯」起來，而該槽中保存的是指向該鏈表的指針。以下圖所示：

採用該解決辦法後，咱們能夠經過以下的操做方式來進行字典操做：

下面咱們來分析上圖各操做的性能。

首先是插入操做，很明顯時間爲O(1)。

而後分析刪除操做，其花費的時間至關於從鏈表中刪除一個元素的時間：若是鏈表T[h(k)]是雙鏈表，花費的時間爲O(1)；若是鏈表T[h(k)]是單鏈表，則花費的時間和查找操做的漸進運行時間相同。

下面咱們重點分析查找運行時間：

首先，咱們假定任何一個給定元素都等可能地散列在散列表T的任何一個槽位中，且與其餘元素被散列在T的哪一個位置無關。咱們稱這個假設爲簡單均勻散列（simple uniform hashing）。

不失通常性，咱們設散列表T的m個槽位散列了n個元素，則平均每一個槽位散列了α = n/m個元素，咱們稱α爲T的裝載因子（load factor）。咱們記位於槽位j的鏈表爲T[j]（j=1，2，…，m-1），而nj表示鏈表T[j]的長度，因而有

n = n0+n1+…+nm-1，

且E[nj] = α = n / m。

如今咱們分查找成功和查找不成功兩種狀況討論。

① 查找不成功

在查找不成功的狀況下，咱們須要遍歷鏈表T[j]的每個元素，而鏈表T[j]的長度是α，所以須要時間O(α)，加上索引到T(j)的時間O(1)，總時間爲θ(1 + α)。

② 查找成功

在查找成功的狀況下，咱們沒法準確知道遍歷到鏈表T[j]的何處中止，所以咱們只能討論平均狀況。

咱們設xi是散列表T的第i個元素（假設咱們按插入順序對散列表T中的n個元素進行了1~n的編號），ki表示xi.key，其中i = 1，2，…，n，再定義隨機變量Xij=I｛h(ki)=h(kj)｝，即：

在簡單均勻散列的假設下有

P｛h(ki)=h(kj)｝ = 1 / m，

E[Xij] = 1 / m。

則所需檢查的元素的數目的指望是：

所以，一次成功的檢查所須要的時間是O(2 + α / 2 –α / 2n) = θ(1 + α)。

綜合上面的分析，在平均下，所有的字典操做均可以在O(1)時間下完成。

4. 散列函數

如今咱們來解決第二個問題：如何構造一個好的散列函數。

一個好的散列函數應（近似地）知足簡單均勻散列：每一個關鍵字都等可能的被散列到各個槽位，並與其餘關鍵字散列到哪個槽位無關（但很遺憾，咱們通常沒法檢驗這一條件是否成立）。

在實際應用中，經常能夠能夠運用啓發式方法來構造性能好的散列函數。設計過程當中，能夠利用關鍵字分佈的有用信息。一個好的方法導出的散列值，在某種程度上應獨立於數據可能存在的任何模式。

下面給出兩種基本的構造散列函數的方法：

(1) 除法散列法

除法散列法的作法很簡單，就是讓關鍵字k去除以一個數m，取餘數，這樣就將k映射到m個槽位中的某一個，即散列函數是：

h(k) = k mod m ，

因爲只作一次除法運算，該方法的速度是很是快的。但應當注意的是，咱們在選取m的值時，應當避免一些選取一些值。例如，m不該是2的整數冪，由於若是m = 2 ^ p，則h(k)就是k的p個最低位數字。除非咱們已經知道各類最低p位的排列是等可能的，不然咱們最好慎重的選擇m。而一個不太接近2的整數冪的素數，每每是較好的選擇。

(2) 乘法散列法

該方法包含兩個步驟。第一步：用關鍵字k乘以A（0 < A < 1），並提取kA的小數部分；第二步：用m乘以這個值，在向下取整，即散列函數是：

h(k) = [m (kA mod 1)]，

這裏「kA mod 1」的是取kA小數部分的意思，即kA –[kA]。

乘法散列法的一個優勢是，通常咱們對m的選擇不是特別的關鍵，通常選擇它爲2的整數冪便可。雖然這個方法對任意的A都適用，但Knuth認爲，A ≈ （√5 - 1）/ 2 = 0.618033988…是一個比較理想的值

(5) 布隆過濾器

布隆過濾器（Bloom Filter）是一種常被用來檢驗一個元素是否在一個集合裏面的算法（從這裏咱們能夠看出，這個集合只須要保存比對元素的「指紋」便可，而不須要保存比對元素的所有信息），由一個很長的二進制向量和一系列隨機映射函數組成。相較於其餘算法，它具備空間利用率高，檢測速度快等優勢。

在介紹布隆過濾器以前，咱們先假設這樣一種場景：某公司致力於解決用戶經常遭遇騷擾電話的問題。該公司打算創建一個騷擾電話號碼的黑名單，即把全部騷擾電話的號碼保存到一張hash表中。當用戶接到某個陌生電話時，服務器會當即將該號碼與黑名單進行比對，若比對成功，則對該號碼進行攔截。

他們固然不會直接將騷擾電話號碼保存在hash表中，而是對每個號碼利用某種算法進行數據壓縮，最終獲得一個8字節的信息指紋，而後將其存入表中。但即使如此，問題仍是來了：因爲hash表的空間利用率大約只有50%，等價換算過來，儲存一個號碼將要花費16字節的空間。按照這樣計算，儲存1億個號碼將要花費大約1.6G的空間，儲存幾十億的號碼可能須要上百G的空間。那麼有沒有更好解決辦法呢？

這時，布隆過濾器就派上用場了。假設咱們有1億條騷擾電話號碼須要記錄，咱們的作法是，首先創建一個2億字節（即16億位，並假設咱們對這16億位以1~16億的順序進行了編號）的向量，將每位都置爲0。當要插入某個電話號碼x時，咱們使用某種算法（該算法能夠作到每一個位被映射的機率是同樣的，且某個映射的分佈與其餘的映射分佈無關）讓號碼x映射到1~16億中的8個位上，而後把這8個位設爲1。當查找時，利用一樣的方法將號碼映射到8個位上，若這8個位都爲1，則說明該號碼在黑名單中，不然就不在。

咱們能夠發現，布隆過濾器的作法在思想上和hash函數將關鍵字映射到hash表的作法很類似，所以布隆過濾器也會遇到衝突問題，這會致使將一個「好」的號碼誤判爲騷擾號碼（但絕對不會將騷擾號碼誤判爲一個「好」的號碼）。下面咱們經過計算來證實，在大多數狀況和場景中，這種誤判咱們是能夠忍受的。

假設某布隆過濾器共有的m個槽位，咱們要把n個號碼添加到其中，而每一個號碼會映射k個槽位。那麼，添加這n個號碼將會產生kn次映射。由於這m個槽位中，每一個槽被映射到的機率是相等的。所以，

在一次映射中，某個槽位被映射到的機率（即該槽位值爲1的機率）爲

該槽位值爲0的機率爲

通過kn次映射後，某個槽值爲0的機率爲

爲1的機率爲

因此，誤判（k個槽位均爲1）的機率就爲

利用，上式可化爲：

這時咱們注意到，當k=1時，狀況就就變成了hash table的狀況，

根據自變量的不一樣咱們分如下兩種方式討論：

① 咱們把誤判率p看做關於裝載因子α的函數（k看做常數），這時咱們從函數的函數圖像

中能夠得出一下結論：

隨着裝載因子α（α = n / m）的增大，誤判率（或者是產生衝突的機率）也將增大，但增加速度逐漸減慢。

要使誤判率小於0.5，裝載因子必須小於0.7。這也從某種程度上解釋了爲何JDK HashMap的裝載因子默認是0.75。

② 咱們把誤判率p看做關於k的函數（α做爲常數），經過對p求導分析，咱們發現，當k=ln2 / α時，誤判率p取得最小值。此時，p = 2^(-k)（或者k = – ln p / ln 2），這個結論讓咱們可以根據能夠忍受的誤判率計算出最爲合適的k值。

下面給出一個BloomFilter的Java實現代碼（來自：https://github.com/MagnusS/Java-BloomFilter，只是把其中的變量和方法名換成了上文說起的）：

public class BloomFilter<E> implements Serializable {
    private static final long serialVersionUID = -9077350041930475408L;

    private BitSet bitset;// 二進制向量
    private int slotSize; // 二進制向量的總位（槽）數（文中的m）
    private double loadFactor; // 裝載因子 （文中的α）
    private int capacity; // 布隆過濾器的容量（文中的n）
    private int size; // 裝載的數目
    private int k; // 一個元素對應的位數（文中的k）

    static final Charset charset = Charset.forName("UTF-8");
    static final String hashName = "MD5";// 默認採用MD5算法，也可改成SHA1
    static final MessageDigest digestFunction;

    static {
        MessageDigest tmp;
        try {
            tmp = java.security.MessageDigest.getInstance(hashName);
        } catch (NoSuchAlgorithmException e) {
            tmp = null;
        }
        digestFunction = tmp;
    }

    public BloomFilter(int slotSize, int capacity) {
        this(slotSize / (double) capacity, capacity, (int) Math.round((slotSize / (double) capacity) * Math.log(2.0)));
    }

    public BloomFilter(double falsePositiveProbability, int capacity) {
        this(Math.log(2) / (Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))),//loadFactor = ln2 / k;
                capacity, //
                (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); //k = -ln p / ln2
    }

    public BloomFilter(int slotSize, int capacity, int size, BitSet filterData) {
        this(slotSize, capacity);
        this.bitset = filterData;
        this.size = size;
    }

    public BloomFilter(double loadFactor, int capacity, int k) {
        size = 0;
        this.loadFactor = loadFactor;
        this.capacity = capacity;
        this.k = k;
        this.slotSize = (int) Math.ceil(capacity * loadFactor);
        this.bitset = new BitSet(slotSize);
    }

    public static int createHash(String val, Charset charset) {
        return createHash(val.getBytes(charset));
    }

    public static int createHash(String val) {
        return createHash(val, charset);
    }

    public static int createHash(byte[] data) {
        return createHashes(data, 1)[0];
    }

    public static int[] createHashes(byte[] data, int hashes) {
        int[] result = new int[hashes];

        int k = 0;
        byte salt = 0;
        while (k < hashes) {
            byte[] digest;
            synchronized (digestFunction) {
                digestFunction.update(salt);
                salt++;
                digest = digestFunction.digest(data);
            }

            for (int i = 0; i < digest.length / 4 && k < hashes; i++) {
                int h = 0;
                for (int j = (i * 4); j < (i * 4) + 4; j++) {
                    h <<= 8;
                    h |= ((int) digest[j]) & 0xFF;
                }
                result[k] = h;
                k++;
            }
        }
        return result;
    }

    /**
     * Compares the contents of two instances to see if they are equal.
     *
     * @param obj
     *            is the object to compare to.
     * @return True if the contents of the objects are equal.
     */
    @Override
    public boolean equals(Object obj) {
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final BloomFilter<E> other = (BloomFilter<E>) obj;
        if (this.capacity != other.capacity) {
            return false;
        }
        if (this.k != other.k) {
            return false;
        }
        if (this.slotSize != other.slotSize) {
            return false;
        }
        if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) {
            return false;
        }
        return true;
    }

    /**
     * Calculates a hash code for this class.
     * 
     * @return hash code representing the contents of an instance of this class.
     */
    @Override
    public int hashCode() {
        int hash = 7;
        hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
        hash = 61 * hash + this.capacity;
        hash = 61 * hash + this.slotSize;
        hash = 61 * hash + this.k;
        return hash;
    }

    /**
     * Calculates the expected probability of false positives based on the
     * number of expected filter elements and the size of the Bloom filter.
     * <br />
     * <br />
     * The value returned by this method is the <i>expected</i> rate of false
     * positives, assuming the number of inserted elements equals the number of
     * expected elements. If the number of elements in the Bloom filter is less
     * than the expected value, the true probability of false positives will be
     * lower.
     *
     * @return expected probability of false positives.
     */
    public double expectedFalsePositiveProbability() {
        return getFalsePositiveProbability(capacity);
    }

    /**
     * Calculate the probability of a false positive given the specified number
     * of inserted elements.
     *
     * @param numberOfElements
     *            number of inserted elements.
     * @return probability of a false positive.
     */
    public double getFalsePositiveProbability(double numberOfElements) {
        // (1 - e^(-k * n / m)) ^ k
        return Math.pow((1 - Math.exp(-k * (double) numberOfElements / (double) slotSize)), k);

    }

    /**
     * Get the current probability of a false positive. The probability is
     * calculated from the size of the Bloom filter and the current number of
     * elements added to it.
     *
     * @return probability of false positives.
     */
    public double getFalsePositiveProbability() {
        return getFalsePositiveProbability(size);
    }

    /**
     * Returns the value chosen for K.<br />
     * <br />
     * K is the optimal number of hash functions based on the size of the Bloom
     * filter and the expected number of inserted elements.
     *
     * @return optimal k.
     */
    public int getK() {
        return k;
    }

    /**
     * Sets all bits to false in the Bloom filter.
     */
    public void clear() {
        bitset.clear();
        size = 0;
    }

    /**
     * Adds an object to the Bloom filter. The output from the object's
     * toString() method is used as input to the hash functions.
     *
     * @param element
     *            is an element to register in the Bloom filter.
     */
    public void add(E element) {
        add(element.toString().getBytes(charset));
    }

    /**
     * Adds an array of bytes to the Bloom filter.
     *
     * @param bytes
     *            array of bytes to add to the Bloom filter.
     */
    public void add(byte[] bytes) {
        int[] hashes = createHashes(bytes, k);
        for (int hash : hashes)
            bitset.set(Math.abs(hash % slotSize));
        size++;
    }

    /**
     * Adds all elements from a Collection to the Bloom filter.
     * 
     * @param c
     *            Collection of elements.
     */
    public void addAll(Collection<? extends E> c) {
        for (E element : c)
            add(element);
    }

    /**
     * Returns true if the element could have been inserted into the Bloom
     * filter. Use getFalsePositiveProbability() to calculate the probability of
     * this being correct.
     *
     * @param element
     *            element to check.
     * @return true if the element could have been inserted into the Bloom
     *         filter.
     */
    public boolean contains(E element) {
        return contains(element.toString().getBytes(charset));
    }

    /**
     * Returns true if the array of bytes could have been inserted into the
     * Bloom filter. Use getFalsePositiveProbability() to calculate the
     * probability of this being correct.
     *
     * @param bytes
     *            array of bytes to check.
     * @return true if the array could have been inserted into the Bloom filter.
     */
    public boolean contains(byte[] bytes) {
        int[] hashes = createHashes(bytes, k);
        for (int hash : hashes) {
            if (!bitset.get(Math.abs(hash % slotSize))) {
                return false;
            }
        }
        return true;
    }

    /**
     * Returns true if all the elements of a Collection could have been inserted
     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the
     * probability of this being correct.
     * 
     * @param c
     *            elements to check.
     * @return true if all the elements in c could have been inserted into the
     *         Bloom filter.
     */
    public boolean containsAll(Collection<? extends E> c) {
        for (E element : c)
            if (!contains(element))
                return false;
        return true;
    }

    /**
     * Read a single bit from the Bloom filter.
     * 
     * @param bit
     *            the bit to read.
     * @return true if the bit is set, false if it is not.
     */
    public boolean getBit(int bit) {
        return bitset.get(bit);
    }

    /**
     * Set a single bit in the Bloom filter.
     * 
     * @param bit
     *            is the bit to set.
     * @param value
     *            If true, the bit is set. If false, the bit is cleared.
     */
    public void setBit(int bit, boolean value) {
        bitset.set(bit, value);
    }

    /**
     * Return the bit set used to store the Bloom filter.
     * 
     * @return bit set representing the Bloom filter.
     */
    public BitSet getBitSet() {
        return bitset;
    }

    /**
     * Returns the number of bits in the Bloom filter. Use count() to retrieve
     * the number of inserted elements.
     *
     * @return the size of the bitset used by the Bloom filter.
     */
    public int slotSize() {
        return slotSize;
    }

    /**
     * Returns the number of elements added to the Bloom filter after it was
     * constructed or after clear() was called.
     *
     * @return number of elements added to the Bloom filter.
     */
    public int size() {
        return size;
    }

    /**
     * Returns the expected number of elements to be inserted into the filter.
     * This value is the same value as the one passed to the constructor.
     *
     * @return expected number of elements.
     */
    public int capacity() {
        return capacity;
    }

    /**
     * Get expected number of bits per element when the Bloom filter is full.
     * This value is set by the constructor when the Bloom filter is created.
     * See also getBitsPerElement().
     *
     * @return expected number of bits per element.
     */
    public double getLoadFactor() {
        return this.loadFactor;
    }
}