1、HyperLogLogredis
HyperLogLog是用來作基數統計的。算法
其能夠很是省內存的去統計各類計數,好比註冊ip數、每日訪問IP數、頁面實時UV(PV確定字符串就搞定了)、在線用戶數等在對準確性不是很重要的應用場景。數據結構
HyperLogLog的優勢是:app
在輸入元素的數量或者體積很是很是大時,計算基數所需的空間老是固定的、而且是很小的,ide
HyperLogLog的缺點:ui
它是估計基數的算法,因此會有必定偏差0.81%。this
每一個HyperLogLog鍵只須要花費12KB內存,就能夠計算接近264個不一樣元素的基數。這和計算基數時,元素越多耗費內存就越多的集合造成鮮明對比。 idea
可是,由於 HyperLogLog 只會根據輸入元素來計算基數,而不會儲存輸入元素自己,因此 HyperLogLog 不能像集合那樣,返回輸入的各個元素即沒法知道統計的詳細內容。spa
2、基數和估算值code
一、基數
基數是集合中不一樣元素的數量。
好比數據集 {1, 3, 5, 7, 5, 7, 8}, 那麼這個數據集的基數集爲 {1, 3, 5 ,7, 8}, 基數(不重複元素)爲5。
基數估計就是在偏差可接受的範圍內,快速計算基數。
二、估算值
算法給出的基數並非精確的,可能會比實際稍微多一些或者稍微少一些,但會控制在合理的範圍以內。
3、HperLogLog基本命令
redis HyperLogLog 的基本命令:
1 PFADD key element [element ...]
添加指定元素到 HyperLogLog 中。
2 PFCOUNT key [key ...]
返回給定 HyperLogLog 的基數估算值。
3 PFMERGE destkey sourcekey [sourcekey ...]
將多個 HyperLogLog 合併爲一個 HyperLogLog
PFADD
將任意數量的元素添加到指定的 HyperLogLog 裏面。在執行這個命令以後,HyperLogLog內部的結構會被更新,並有所反饋,
若是執行完以後HyperLogLog內部的基數估算髮生了變化,那麼就會返回1,不然(認爲已經存在)就返回0。
這個命令還有一個比較神器的就是能夠只有鍵,沒有值,這樣的意思就是隻是建立空的鍵,不放值。
若是這個鍵存在,不作任何事情,返回0;不存在的話就建立,並返回1。
這個命令的時間複雜度爲O(1),因此就放心用吧~
PFCOUNT
當命令做用於單個鍵的時候,返回這個鍵的基數估算值。若是鍵不存在,則返回0。
當 PFCOUNT 命令做用於多個鍵時, 返回全部給定 HyperLogLog 的並集的近似基數, 這個近似基數是經過將全部給定 HyperLogLog 合併至一個臨時 HyperLogLog 來計算得出的。
這個命令在做用於單個值的時候,時間複雜度爲O(1),而且具備很是低的平均常數時間;在做用於N個值的時候,時間複雜度爲O(N),這個命令的常數複雜度會比較低些。
命令返回的可見集合(observed set)基數並非精確值, 而是一個帶有 0.81% 標準錯誤(standard error)的近似值。
舉個例子, 爲了記錄一天會執行多少次各不相同的搜索查詢, 一個程序能夠在每次執行搜索查詢時調用一次 PFADD , 並經過調用 PFCOUNT 命令來獲取這個記錄的近似結果。
PFMERGE
合併(merge)多個HyperLogLog爲一個HyperLogLog。 合併後的 HyperLogLog 的基數接近於全部輸入 HyperLogLog 的可見集合(observed set)的並集。
合併得出的 HyperLogLog 會被儲存在 destkey 鍵裏面, 若是該鍵並不存在, 那麼命令在執行以前, 會先爲該鍵建立一個空的 HyperLogLog 。
這個命令的第一個參數爲目標鍵,剩下的參數爲要合併的HyperLogLog。命令執行時,若是目標鍵不存在,則建立後再執行合併。
這個命令的時間複雜度爲O(N),其中N爲要合併的HyperLogLog的個數。不過這個命令的常數時間複雜度比較高。
redis> PFADD ip:20170626 "192.168.0.10" "192.168.0.20" "192.168.0.30"
(integer) 1
redis> PFADD ip:20170626 "192.168.0.20" "192.168.0.40" "192.168.0.50" # 存在就只加新的
(integer) 1
redis> PFCOUNT ip:20170626 # 元素估計數量沒有變化
(integer) 5
redis> PFADD ip:20170626 "192.168.0.20" # 存在就不會增長
(integer) 0
edis> PFMERGE ip:20170626 ip:20170627 ip:20170628
OK
redis> PFCOUNT ip:201706
(integer) 5
4、hperloglog 描述
因爲hperloglog,這種數據結構在實際應用場景中並很少。所以,這裏就再也不詳細討論了。
咱們看下hperloglog.c文件,對HperLogLog的描述
/* The Redis HyperLogLog implementation is based on the following ideas:
*
* * The use of a 64 bit hash function as proposed in [1], in order to don't
* limited to cardinalities up to 10^9, at the cost of just 1 additional
* bit per register.
* * The use of 16384 6-bit registers for a great level of accuracy, using
* a total of 12k per key.
* * The use of the Redis string data type. No new type is introduced.
* * No attempt is made to compress the data structure as in [1]. Also the
* algorithm used is the original HyperLogLog Algorithm as in [2], with
* the only difference that a 64 bit hash function is used, so no correction
* is performed for values near 2^32 as in [1].
*
* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic
* Engineering of a State of The Art Cardinality Estimation Algorithm.
*
* [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The
* analysis of a near-optimal cardinality estimation algorithm.
*
* Redis uses two representations:
*
* 1) A "dense" representation where every entry is represented by
* a 6-bit integer.
* 2) A "sparse" representation using run length compression suitable
* for representing HyperLogLogs with many registers set to 0 in
* a memory efficient way.
*
*
* HLL header
* ===
*
* Both the dense and sparse representation have a 16 byte header as follows:
*
* +------+---+-----+----------+
* | HYLL | E | N/U | Cardin. |
* +------+---+-----+----------+
*
* The first 4 bytes are a magic string set to the bytes "HYLL".
* "E" is one byte encoding, currently set to HLL_DENSE or
* HLL_SPARSE. N/U are three not used bytes.
*
* The "Cardin." field is a 64 bit integer stored in little endian format
* with the latest cardinality computed that can be reused if the data
* structure was not modified since the last computation (this is useful
* because there are high probabilities that HLLADD operations don't
* modify the actual data structure and hence the approximated cardinality).
*
* When the most significant bit in the most significant byte of the cached
* cardinality is set, it means that the data structure was modified and
* we can't reuse the cached value that must be recomputed.
*
* Dense representation
* ===
*
* The dense representation used by Redis is the following:
*
* +--------+--------+--------+------// //--+
* |11000000|22221111|33333322|55444444 .... |
* +--------+--------+--------+------// //--+
*
* The 6 bits counters are encoded one after the other starting from the
* LSB to the MSB, and using the next bytes as needed.
*
* Sparse representation
* ===
*
* The sparse representation encodes registers using a run length
* encoding composed of three opcodes, two using one byte, and one using
* of two bytes. The opcodes are called ZERO, XZERO and VAL.
*
* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented
* by the six bits 'xxxxxx', plus 1, means that there are N registers set
* to 0. This opcode can represent from 1 to 64 contiguous registers set
* to the value of 0.
*
* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit
* integer represented by the bits 'xxxxxx' as most significant bits and
* 'yyyyyyyy' as least significant bits, plus 1, means that there are N
* registers set to 0. This opcode can represent from 0 to 16384 contiguous
* registers set to the value of 0.
*
* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer
* representing the value of a register, and a 2-bit integer representing
* the number of contiguous registers set to that value 'vvvvv'.
* To obtain the value and run length, the integers vvvvv and xx must be
* incremented by one. This opcode can represent values from 1 to 32,
* repeated from 1 to 4 times.
*
* The sparse representation can't represent registers with a value greater
* than 32, however it is very unlikely that we find such a register in an
* HLL with a cardinality where the sparse representation is still more
* memory efficient than the dense representation. When this happens the
* HLL is converted to the dense representation.
*
* The sparse representation is purely positional. For example a sparse
* representation of an empty HLL is just: XZERO:16384.
*
* An HLL having only 3 non-zero registers at position 1000, 1020, 1021
* respectively set to 2, 3, 3, is represented by the following three
* opcodes:
*
* XZERO:1000 (Registers 0-999 are set to 0)
* VAL:2,1 (1 register set to value 2, that is register 1000)
* ZERO:19 (Registers 1001-1019 set to 0)
* VAL:3,2 (2 registers set to value 3, that is registers 1020,1021)
* XZERO:15362 (Registers 1022-16383 set to 0)
*
* In the example the sparse representation used just 7 bytes instead
* of 12k in order to represent the HLL registers. In general for low
* cardinality there is a big win in terms of space efficiency, traded
* with CPU time since the sparse representation is slower to access:
*
* The following table shows average cardinality vs bytes used, 100
* samples per cardinality (when the set was not representable because
* of registers with too big value, the dense representation size was used
* as a sample).
*
* 100 267
* 200 485
* 300 678
* 400 859
* 500 1033
* 600 1205
* 700 1375
* 800 1544
* 900 1713
* 1000 1882
* 2000 3480
* 3000 4879
* 4000 6089
* 5000 7138
* 6000 8042
* 7000 8823
* 8000 9500
* 9000 10088
* 10000 10591
*
* The dense representation uses 12288 bytes, so there is a big win up to
* a cardinality of ~2000-3000. For bigger cardinalities the constant times
* involved in updating the sparse representation is not justified by the
* memory savings. The exact maximum length of the sparse representation
* when this implementation switches to the dense representation is
* configured via the define server.hll_sparse_max_bytes.
*/