Redis數據結構之HperLogLog

1、HyperLogLogredis

HyperLogLog是用來作基數統計的。算法

其能夠很是省內存的去統計各類計數,好比註冊ip數、每日訪問IP數、頁面實時UV(PV確定字符串就搞定了)、在線用戶數等在對準確性不是很重要的應用場景。數據結構

 

HyperLogLog的優勢是:app

在輸入元素的數量或者體積很是很是大時,計算基數所需的空間老是固定的、而且是很小的,ide

HyperLogLog的缺點:ui

它是估計基數的算法,因此會有必定偏差0.81%。this

每一個HyperLogLog鍵只須要花費12KB內存,就能夠計算接近264個不一樣元素的基數。這和計算基數時,元素越多耗費內存就越多的集合造成鮮明對比。 idea

可是,由於 HyperLogLog 只會根據輸入元素來計算基數,而不會儲存輸入元素自己,因此 HyperLogLog 不能像集合那樣,返回輸入的各個元素即沒法知道統計的詳細內容。spa

 

2、基數和估算值code

一、基數

基數是集合中不一樣元素的數量。

好比數據集 {1, 3, 5, 7, 5, 7, 8}, 那麼這個數據集的基數集爲 {1, 3, 5 ,7, 8}, 基數(不重複元素)爲5。 

基數估計就是在偏差可接受的範圍內,快速計算基數。

 

二、估算值

算法給出的基數並非精確的,可能會比實際稍微多一些或者稍微少一些,但會控制在合理的範圍以內。

 

3、HperLogLog基本命令

redis HyperLogLog 的基本命令:

1 PFADD key element [element ...] 

添加指定元素到 HyperLogLog 中。

2 PFCOUNT key [key ...] 

返回給定 HyperLogLog 的基數估算值。

3 PFMERGE destkey sourcekey [sourcekey ...] 

將多個 HyperLogLog 合併爲一個 HyperLogLog

 

PFADD

將任意數量的元素添加到指定的 HyperLogLog 裏面。在執行這個命令以後,HyperLogLog內部的結構會被更新,並有所反饋,

若是執行完以後HyperLogLog內部的基數估算髮生了變化,那麼就會返回1,不然(認爲已經存在)就返回0。

這個命令還有一個比較神器的就是能夠只有鍵,沒有值,這樣的意思就是隻是建立空的鍵,不放值。

若是這個鍵存在,不作任何事情,返回0;不存在的話就建立,並返回1。

這個命令的時間複雜度爲O(1),因此就放心用吧~

 

PFCOUNT

當命令做用於單個鍵的時候,返回這個鍵的基數估算值。若是鍵不存在,則返回0。

當 PFCOUNT 命令做用於多個鍵時, 返回全部給定 HyperLogLog 的並集的近似基數, 這個近似基數是經過將全部給定 HyperLogLog 合併至一個臨時 HyperLogLog 來計算得出的。

這個命令在做用於單個值的時候,時間複雜度爲O(1),而且具備很是低的平均常數時間;在做用於N個值的時候,時間複雜度爲O(N),這個命令的常數複雜度會比較低些。

命令返回的可見集合(observed set)基數並非精確值, 而是一個帶有 0.81% 標準錯誤(standard error)的近似值。

舉個例子, 爲了記錄一天會執行多少次各不相同的搜索查詢, 一個程序能夠在每次執行搜索查詢時調用一次 PFADD , 並經過調用 PFCOUNT 命令來獲取這個記錄的近似結果。

 

PFMERGE

合併(merge)多個HyperLogLog爲一個HyperLogLog。 合併後的 HyperLogLog 的基數接近於全部輸入 HyperLogLog 的可見集合(observed set)的並集。

合併得出的 HyperLogLog 會被儲存在 destkey 鍵裏面, 若是該鍵並不存在, 那麼命令在執行以前, 會先爲該鍵建立一個空的 HyperLogLog 。

這個命令的第一個參數爲目標鍵,剩下的參數爲要合併的HyperLogLog。命令執行時,若是目標鍵不存在,則建立後再執行合併。

這個命令的時間複雜度爲O(N),其中N爲要合併的HyperLogLog的個數。不過這個命令的常數時間複雜度比較高。

 

redis> PFADD  ip:20170626  "192.168.0.10"  "192.168.0.20"  "192.168.0.30"

(integer) 1

redis> PFADD  ip:20170626 "192.168.0.20"  "192.168.0.40"  "192.168.0.50"  # 存在就只加新的

(integer) 1

redis> PFCOUNT ip:20170626  # 元素估計數量沒有變化

(integer) 5

redis> PFADD  ip:20170626 "192.168.0.20"  # 存在就不會增長

(integer) 0

edis> PFMERGE ip:20170626   ip:20170627   ip:20170628

OK

redis> PFCOUNT  ip:201706

(integer) 5

 

4、hperloglog 描述

因爲hperloglog,這種數據結構在實際應用場景中並很少。所以,這裏就再也不詳細討論了。

咱們看下hperloglog.c文件,對HperLogLog的描述

/* The Redis HyperLogLog implementation is based on the following ideas:

 *

 * * The use of a 64 bit hash function as proposed in [1], in order to don't

 *   limited to cardinalities up to 10^9, at the cost of just 1 additional

 *   bit per register.

 * * The use of 16384 6-bit registers for a great level of accuracy, using

 *   a total of 12k per key.

 * * The use of the Redis string data type. No new type is introduced.

 * * No attempt is made to compress the data structure as in [1]. Also the

 *   algorithm used is the original HyperLogLog Algorithm as in [2], with

 *   the only difference that a 64 bit hash function is used, so no correction

 *   is performed for values near 2^32 as in [1].

 *

 * [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic

 *     Engineering of a State of The Art Cardinality Estimation Algorithm.

 *

 * [2] P. Flajolet, éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The

 *     analysis of a near-optimal cardinality estimation algorithm.

 *

 * Redis uses two representations:

 *

 * 1) A "dense" representation where every entry is represented by

 *    a 6-bit integer.

 * 2) A "sparse" representation using run length compression suitable

 *    for representing HyperLogLogs with many registers set to 0 in

 *    a memory efficient way.

 *

 *

 * HLL header

 * ===

 *

 * Both the dense and sparse representation have a 16 byte header as follows:

 *

 * +------+---+-----+----------+

 * | HYLL | E | N/U | Cardin.  |

 * +------+---+-----+----------+

 *

 * The first 4 bytes are a magic string set to the bytes "HYLL".

 * "E" is one byte encoding, currently set to HLL_DENSE or

 * HLL_SPARSE. N/U are three not used bytes.

 *

 * The "Cardin." field is a 64 bit integer stored in little endian format

 * with the latest cardinality computed that can be reused if the data

 * structure was not modified since the last computation (this is useful

 * because there are high probabilities that HLLADD operations don't

 * modify the actual data structure and hence the approximated cardinality).

 *

 * When the most significant bit in the most significant byte of the cached

 * cardinality is set, it means that the data structure was modified and

 * we can't reuse the cached value that must be recomputed.

 *

 * Dense representation

 * ===

 *

 * The dense representation used by Redis is the following:

 *

 * +--------+--------+--------+------//      //--+

 * |11000000|22221111|33333322|55444444 ....     |

 * +--------+--------+--------+------//      //--+

 *

 * The 6 bits counters are encoded one after the other starting from the

 * LSB to the MSB, and using the next bytes as needed.

 *

 * Sparse representation

 * ===

 *

 * The sparse representation encodes registers using a run length

 * encoding composed of three opcodes, two using one byte, and one using

 * of two bytes. The opcodes are called ZERO, XZERO and VAL.

 *

 * ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented

 * by the six bits 'xxxxxx', plus 1, means that there are N registers set

 * to 0. This opcode can represent from 1 to 64 contiguous registers set

 * to the value of 0.

 *

 * XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit

 * integer represented by the bits 'xxxxxx' as most significant bits and

 * 'yyyyyyyy' as least significant bits, plus 1, means that there are N

 * registers set to 0. This opcode can represent from 0 to 16384 contiguous

 * registers set to the value of 0.

 *

 * VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer

 * representing the value of a register, and a 2-bit integer representing

 * the number of contiguous registers set to that value 'vvvvv'.

 * To obtain the value and run length, the integers vvvvv and xx must be

 * incremented by one. This opcode can represent values from 1 to 32,

 * repeated from 1 to 4 times.

 *

 * The sparse representation can't represent registers with a value greater

 * than 32, however it is very unlikely that we find such a register in an

 * HLL with a cardinality where the sparse representation is still more

 * memory efficient than the dense representation. When this happens the

 * HLL is converted to the dense representation.

 *

 * The sparse representation is purely positional. For example a sparse

 * representation of an empty HLL is just: XZERO:16384.

 *

 * An HLL having only 3 non-zero registers at position 1000, 1020, 1021

 * respectively set to 2, 3, 3, is represented by the following three

 * opcodes:

 *

 * XZERO:1000 (Registers 0-999 are set to 0)

 * VAL:2,1    (1 register set to value 2, that is register 1000)

 * ZERO:19    (Registers 1001-1019 set to 0)

 * VAL:3,2    (2 registers set to value 3, that is registers 1020,1021)

 * XZERO:15362 (Registers 1022-16383 set to 0)

 *

 * In the example the sparse representation used just 7 bytes instead

 * of 12k in order to represent the HLL registers. In general for low

 * cardinality there is a big win in terms of space efficiency, traded

 * with CPU time since the sparse representation is slower to access:

 *

 * The following table shows average cardinality vs bytes used, 100

 * samples per cardinality (when the set was not representable because

 * of registers with too big value, the dense representation size was used

 * as a sample).

 *

 * 100 267

 * 200 485

 * 300 678

 * 400 859

 * 500 1033

 * 600 1205

 * 700 1375

 * 800 1544

 * 900 1713

 * 1000 1882

 * 2000 3480

 * 3000 4879

 * 4000 6089

 * 5000 7138

 * 6000 8042

 * 7000 8823

 * 8000 9500

 * 9000 10088

 * 10000 10591

 *

 * The dense representation uses 12288 bytes, so there is a big win up to

 * a cardinality of ~2000-3000. For bigger cardinalities the constant times

 * involved in updating the sparse representation is not justified by the

 * memory savings. The exact maximum length of the sparse representation

 * when this implementation switches to the dense representation is

 * configured via the define server.hll_sparse_max_bytes.

 */

相關文章
相關標籤/搜索