BitSet和布隆過濾器(Bloom Filter)

時間 2019-11-09

標籤 bitset 過濾器 bloom filter 简体版

原文原文鏈接

布隆過濾器

Bloom Filter 是由Howard Bloom 在 1970 年提出的二進制向量數據結構，它具備很好的空間和時間效率，被用來檢測一個元素是否是集合中的一個成員。若是檢測結果爲是，該元素不必定在集合中；但若是檢測結果爲否，該元素必定不在集合中。所以Bloom filter具備100%的召回率。這樣每一個檢測請求返回有「在集合內（可能錯誤）」和「不在集合內（絕對不在集合內）」兩種狀況，可見 Bloom filter 是犧牲了正確率和時間以節省空間。
java

固然布隆過濾器也有缺點，主要是誤判的問題，隨着數據量的增長，誤判率也隨着增大，解決辦法：能夠創建一個列表，保存哪些數值是容易被誤算的。apache

Bloom Filter最大的特色是不會存在false negative，即：若是contains()返回false，則該元素必定不在集合中，但會存在必定的true negative，即：若是contains()返回true，則該元素可能在集合中。
數組

Bloom Filter在不少開源框架都有實現，例如：數據結構

Elasticsearch：org.elasticsearch.common.util.BloomFilter框架

guava：com.google.common.hash.BloomFilter
elasticsearch

Hadoop：org.apache.hadoop.util.bloom.BloomFilter（基於BitSet實現）
oop

有興趣能夠看看源碼。
ui

BitSet的基本原理

最後再瞭解一下BitSet的基本原理，BitSet是位操做的對象，值只有0或1，內部實現是一個long數組，初始只有一個long數組，因此BitSet最小的size是64，當存儲的數據增長，初始化的Long數組已經沒法知足時，BitSet內部會動態擴充，最終內部是由N個long來存儲，BitSet的內部擴充和List，Set，Map等得實現差很少，並且都是對於用戶透明的。
1G的空間，有 8*1024*1024*1024=8589934592bit，也就是能夠表示85億個不一樣的數。google

BitSet用1位來表示一個數據是否出現過，0爲沒有出現過，1表示出現過。在long型數組中的一個元素能夠存放64個數組，由於Java的long佔8個byte=64bit，具體的實現，看看源碼：spa

首先看看set方法的實現：

public void set(int bitIndex) {
   if (bitIndex < 0)   //set的數不能小於0
        throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

   int wordIndex = wordIndex(bitIndex);//將bitIndex右移6位，這樣能夠保證每64個數字在long型數組中能夠佔一個坑。
   expandTo(wordIndex);

   words[wordIndex] |= (1L << bitIndex); // Restores invariants
   checkInvariants();
}

get命令實現：

public boolean get(int bitIndex) {
   if (bitIndex < 0)
       throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);

   checkInvariants();

   int wordIndex = wordIndex(bitIndex);//和get同樣獲取數字在long型數組的那個位置。
   return (wordIndex < wordsInUse)
        && ((words[wordIndex] & (1L << bitIndex)) != 0);//在指定long型數組元素中獲取值。
}

BitSet容量動態擴展：

private void ensureCapacity(int wordsRequired) {
   if (words.length < wordsRequired) {
        // Allocate larger of doubled size or required size
        int request = Math.max(2 * words.length, wordsRequired);//默認是擴大一杯的容量，若是傳入的數字大於兩倍的，則以傳入的爲準。
        // wordsRequired = 傳入的數值右移6位 + 1
        words = Arrays.copyOf(words, request);
        sizeIsSticky = false;
   }
}