Redis數據結構之HperLogLog

時間 2019-12-13

標籤 redis 數據結構 hperloglog 欄目 Redis 简体版

原文原文鏈接

1、HyperLogLogredis

HyperLogLog是用來作基數統計的。算法

其能夠很是省內存的去統計各類計數，好比註冊ip數、每日訪問IP數、頁面實時UV（PV確定字符串就搞定了）、在線用戶數等在對準確性不是很重要的應用場景。數據結構

HyperLogLog的優勢是：app

在輸入元素的數量或者體積很是很是大時，計算基數所需的空間老是固定的、而且是很小的，ide

HyperLogLog的缺點:ui

它是估計基數的算法，因此會有必定偏差0.81%。this

每一個HyperLogLog鍵只須要花費12KB內存，就能夠計算接近264個不一樣元素的基數。這和計算基數時，元素越多耗費內存就越多的集合造成鮮明對比。 idea

可是，由於 HyperLogLog 只會根據輸入元素來計算基數，而不會儲存輸入元素自己，因此 HyperLogLog 不能像集合那樣，返回輸入的各個元素即沒法知道統計的詳細內容。spa

2、基數和估算值code

一、基數

基數是集合中不一樣元素的數量。

好比數據集 {1, 3, 5, 7, 5, 7, 8}，那麼這個數據集的基數集爲 {1, 3, 5 ,7, 8}, 基數(不重複元素)爲5。

基數估計就是在偏差可接受的範圍內，快速計算基數。

二、估算值

算法給出的基數並非精確的，可能會比實際稍微多一些或者稍微少一些，但會控制在合理的範圍以內。

3、HperLogLog基本命令

redis HyperLogLog 的基本命令：

1 PFADD key element [element ...]

添加指定元素到 HyperLogLog 中。

2 PFCOUNT key [key ...]

返回給定 HyperLogLog 的基數估算值。

3 PFMERGE destkey sourcekey [sourcekey ...]

將多個 HyperLogLog 合併爲一個 HyperLogLog

PFADD

將任意數量的元素添加到指定的 HyperLogLog 裏面。在執行這個命令以後，HyperLogLog內部的結構會被更新，並有所反饋，

若是執行完以後HyperLogLog內部的基數估算髮生了變化，那麼就會返回1，不然（認爲已經存在）就返回0。

這個命令還有一個比較神器的就是能夠只有鍵，沒有值，這樣的意思就是隻是建立空的鍵，不放值。

若是這個鍵存在，不作任何事情，返回0；不存在的話就建立，並返回1。

這個命令的時間複雜度爲O(1)，因此就放心用吧~

PFCOUNT

當命令做用於單個鍵的時候，返回這個鍵的基數估算值。若是鍵不存在，則返回0。

當 PFCOUNT 命令做用於多個鍵時，返回全部給定 HyperLogLog 的並集的近似基數，這個近似基數是經過將全部給定 HyperLogLog 合併至一個臨時 HyperLogLog 來計算得出的。

這個命令在做用於單個值的時候，時間複雜度爲O(1)，而且具備很是低的平均常數時間；在做用於N個值的時候，時間複雜度爲O(N)，這個命令的常數複雜度會比較低些。

命令返回的可見集合（observed set）基數並非精確值，而是一個帶有 0.81% 標準錯誤（standard error）的近似值。

舉個例子，爲了記錄一天會執行多少次各不相同的搜索查詢，一個程序能夠在每次執行搜索查詢時調用一次 PFADD ，並經過調用 PFCOUNT 命令來獲取這個記錄的近似結果。

PFMERGE

合併（merge）多個HyperLogLog爲一個HyperLogLog。合併後的 HyperLogLog 的基數接近於全部輸入 HyperLogLog 的可見集合（observed set）的並集。

合併得出的 HyperLogLog 會被儲存在 destkey 鍵裏面，若是該鍵並不存在，那麼命令在執行以前，會先爲該鍵建立一個空的 HyperLogLog 。

這個命令的第一個參數爲目標鍵，剩下的參數爲要合併的HyperLogLog。命令執行時，若是目標鍵不存在，則建立後再執行合併。

這個命令的時間複雜度爲O(N)，其中N爲要合併的HyperLogLog的個數。不過這個命令的常數時間複雜度比較高。

redis> PFADD ip:20170626 "192.168.0.10" "192.168.0.20" "192.168.0.30"

(integer) 1

redis> PFADD ip:20170626 "192.168.0.20" "192.168.0.40" "192.168.0.50" # 存在就只加新的

(integer) 1

redis> PFCOUNT ip:20170626 # 元素估計數量沒有變化

(integer) 5

redis> PFADD ip:20170626 "192.168.0.20" # 存在就不會增長

(integer) 0

edis> PFMERGE ip:20170626 ip:20170627 ip:20170628

redis> PFCOUNT ip:201706

(integer) 5

4、hperloglog 描述

因爲hperloglog，這種數據結構在實際應用場景中並很少。所以，這裏就再也不詳細討論了。

咱們看下hperloglog.c文件，對HperLogLog的描述

/* The Redis HyperLogLog implementation is based on the following ideas:

*

* * The use of a 64 bit hash function as proposed in [1], in order to don't

* limited to cardinalities up to 10^9, at the cost of just 1 additional

* bit per register.

* * The use of 16384 6-bit registers for a great level of accuracy, using

* a total of 12k per key.

* * The use of the Redis string data type. No new type is introduced.

* * No attempt is made to compress the data structure as in [1]. Also the

* algorithm used is the original HyperLogLog Algorithm as in [2], with

* the only difference that a 64 bit hash function is used, so no correction

* is performed for values near 2^32 as in [1].

*

* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic

* Engineering of a State of The Art Cardinality Estimation Algorithm.

*

* [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The

* analysis of a near-optimal cardinality estimation algorithm.

*

* Redis uses two representations:

*

* 1) A "dense" representation where every entry is represented by

* a 6-bit integer.

* 2) A "sparse" representation using run length compression suitable

* for representing HyperLogLogs with many registers set to 0 in

* a memory efficient way.

*

*

* HLL header

* ===

*

* Both the dense and sparse representation have a 16 byte header as follows:

*

* +------+---+-----+----------+

* | HYLL | E | N/U | Cardin. |

* +------+---+-----+----------+

*

* The first 4 bytes are a magic string set to the bytes "HYLL".

* "E" is one byte encoding, currently set to HLL_DENSE or

* HLL_SPARSE. N/U are three not used bytes.

*

* The "Cardin." field is a 64 bit integer stored in little endian format

* with the latest cardinality computed that can be reused if the data

* structure was not modified since the last computation (this is useful

* because there are high probabilities that HLLADD operations don't

* modify the actual data structure and hence the approximated cardinality).

*

* When the most significant bit in the most significant byte of the cached

* cardinality is set, it means that the data structure was modified and

* we can't reuse the cached value that must be recomputed.

*

* Dense representation

* ===

*

* The dense representation used by Redis is the following:

*

* +--------+--------+--------+------// //--+

* |11000000|22221111|33333322|55444444 .... |

* +--------+--------+--------+------// //--+

*

* The 6 bits counters are encoded one after the other starting from the

* LSB to the MSB, and using the next bytes as needed.

*

* Sparse representation

* ===

*

* The sparse representation encodes registers using a run length

* encoding composed of three opcodes, two using one byte, and one using

* of two bytes. The opcodes are called ZERO, XZERO and VAL.

*

* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented

* by the six bits 'xxxxxx', plus 1, means that there are N registers set

* to 0. This opcode can represent from 1 to 64 contiguous registers set

* to the value of 0.

*

* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit

* integer represented by the bits 'xxxxxx' as most significant bits and

* 'yyyyyyyy' as least significant bits, plus 1, means that there are N

* registers set to 0. This opcode can represent from 0 to 16384 contiguous

* registers set to the value of 0.

*

* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer

* representing the value of a register, and a 2-bit integer representing

* the number of contiguous registers set to that value 'vvvvv'.

* To obtain the value and run length, the integers vvvvv and xx must be

* incremented by one. This opcode can represent values from 1 to 32,

* repeated from 1 to 4 times.

*

* The sparse representation can't represent registers with a value greater

* than 32, however it is very unlikely that we find such a register in an

* HLL with a cardinality where the sparse representation is still more

* memory efficient than the dense representation. When this happens the

* HLL is converted to the dense representation.

*

* The sparse representation is purely positional. For example a sparse

* representation of an empty HLL is just: XZERO:16384.

*

* An HLL having only 3 non-zero registers at position 1000, 1020, 1021

* respectively set to 2, 3, 3, is represented by the following three

* opcodes:

*

* XZERO:1000 (Registers 0-999 are set to 0)

* VAL:2,1 (1 register set to value 2, that is register 1000)

* ZERO:19 (Registers 1001-1019 set to 0)

* VAL:3,2 (2 registers set to value 3, that is registers 1020,1021)

* XZERO:15362 (Registers 1022-16383 set to 0)

*

* In the example the sparse representation used just 7 bytes instead

* of 12k in order to represent the HLL registers. In general for low

* cardinality there is a big win in terms of space efficiency, traded

* with CPU time since the sparse representation is slower to access:

*

* The following table shows average cardinality vs bytes used, 100

* samples per cardinality (when the set was not representable because

* of registers with too big value, the dense representation size was used

* as a sample).

*

* 100 267

* 200 485

* 300 678

* 400 859

* 500 1033

* 600 1205

* 700 1375

* 800 1544

* 900 1713

* 1000 1882

* 2000 3480

* 3000 4879

* 4000 6089

* 5000 7138

* 6000 8042

* 7000 8823

* 8000 9500

* 9000 10088

* 10000 10591

*

* The dense representation uses 12288 bytes, so there is a big win up to

* a cardinality of ~2000-3000. For bigger cardinalities the constant times

* involved in updating the sparse representation is not justified by the

* memory savings. The exact maximum length of the sparse representation

* when this implementation switches to the dense representation is

* configured via the define server.hll_sparse_max_bytes.

*/