Redis 數據結構的底層實現 (二) dict skiplist intset

時間 2019-11-18

標籤 redis 數據結構底層實現 dict skiplist intset 欄目 Redis 简体版

原文原文鏈接

1、REDIS_INCODING_HT (dict字典，hashtable）

dict是一個用於維護key和value映射關係的數據結構。redis的一個database中全部的key到value的映射，就是使用一個dict來維護的。不過，他在redis被使用的地方還不少，好比，一個redis hash結構，當它的field較多的時候，便會採用dict來存儲。再好比，redis配合使用dict和skiplist來共同維護一個zset。redis

在redis中，dict也是一個基於哈希表的算法。和傳統的哈希算法相似，它採用某個哈希函數從key計算獲得哈希表中的位置，用拉鍊發解決衝突，並在元素數量超過裝載因子的時候rehash。redis的hash表最顯著的一個特色，就在於它哈希表的重哈希，採用了一種增量式重哈希的方法，能夠避免在須要擴容時一次性對全部hash表中元素重哈希，致使正常iud操做阻塞。算法

dict的數據結構sql

typedef struct dictEntry {
    void *key;
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
        double d;
    } v;
    struct dictEntry *next;
} dictEntry;

typedef struct dictType {
    unsigned int (*hashFunction)(const void *key);
    void *(*keyDup)(void *privdata, const void *key);
    void *(*valDup)(void *privdata, const void *obj);
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);
    void (*keyDestructor)(void *privdata, void *key);
    void (*valDestructor)(void *privdata, void *obj);
} dictType;

/* This is our hash table structure. Every dictionary has two of this as we
 * implement incremental rehashing, for the old to the new table. */
typedef struct dictht {
    dictEntry **table;
    unsigned long size;
    unsigned long sizemask;
    unsigned long used;
} dictht;

typedef struct dict {
    dictType *type;
    void *privdata;
    dictht ht[2];
    long rehashidx; /* rehashing not in progress if rehashidx == -1 */
    int iterators; /* number of iterators currently running */
} dict;

　　具體結構圖以下數組

dict

dictType* type 一個指向dictType 結構的指針。表示key和value存儲何種數據類型的數據
void* privdata 私有數據指針，由調用者在建立dict的時候傳入。
dictht ht[1] 兩個hash表(dictht)。只有在rehash的過程當中，ht[1]纔有效，日常狀況下只有ht[0]生效。上圖就是表示rehash進行到中間某一步的狀況。
long rehashidx 重哈希(rehash)索引，當沒在重hash時 rehashidx =-1 不然表示正在rehash
int iterators

dictType 包含若干函數指針，用於對dict設計的key和value的各類操做進行自定義

hashFunction 對key進行hash的算法
keyDup和valDup key和value的拷貝函數，深拷貝。
keyCompare 定義兩個key的比較操做
keyDestructor 和 valDestructor key和value的析構函數

dictht 表示一個哈希表的結構

一個dictEntry數組，保存key和value，key的hash值映射到某個位置上，當衝突時，鏈表解決衝突。
size 表示dictEntry數組的長度老是2^x
sizemask =size-1 取模時（hashcode&sizemask超快）
used 記錄dictht中現有的數據個數，當過大時衝突會變高，超過size*load factor，會rehash。

dictEntry 表示一個key-value對

key 保存鍵的指針，一般指向一個sds
v(value) 是一個union 當它的值是uint64_t/int64_t/double時能夠直接存儲，不須要額外內存，固然也能夠是void* 以便保存任何類型的數據。

增量式哈希

static void _dictRehashStep(dict *d) {
if (d->iterators == 0) dictRehash(d,1);
}數據結構

int dictRehash(dict *d, int n) {
    int empty_visits = n*10; /* Max number of empty buckets to visit. */
    if (!dictIsRehashing(d)) return 0;

    while(n-- && d->ht[0].used != 0) {
        dictEntry *de, *nextde;

        /* Note that rehashidx can't overflow as we are sure there are more
         * elements because ht[0].used != 0 */
        assert(d->ht[0].size > (unsigned long)d->rehashidx);
        while(d->ht[0].table[d->rehashidx] == NULL) {
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }
        de = d->ht[0].table[d->rehashidx];
        /* Move all the keys in this bucket from the old to the new hash HT */
        while(de) {
            unsigned int h;

            nextde = de->next;
            /* Get the index in the new hash table */
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;
            d->ht[0].used--;
            d->ht[1].used++;
            de = nextde;
        }
        d->ht[0].table[d->rehashidx] = NULL;
        d->rehashidx++;
    }

    /* Check if we already rehashed the whole table... */
    if (d->ht[0].used == 0) {
        zfree(d->ht[0].table);
        d->ht[0] = d->ht[1];
        _dictReset(&d->ht[1]);
        d->rehashidx = -1;
        return 0;
    }

    /* More to rehash... */
    return 1;
}

　　上文說道，漸進性rehash不是一次性完成的，是穿插在每次對於hash操做的時候，漸進的完成的。app

1.爲ht[1] 分配空間，讓字典同時持有ht[0] ht[1]兩個hash表ide

2.在dict中把rehashidx設置爲0，表示下一個rehash操做ht[0]->table[0],rehash工做開始。函數

3.在rehash進行期間，每次對hash執行crud操做時，程序除了執行指定的操做前，要先調用_dictRehashStep(dict *d) 執行單次rehash。性能

4.dictRehash傳入的n表示每次rehash一個桶(有元素的桶，當沒有元素時，最多遍歷10*n個)，當找到一個非空桶時，開始把桶中的鏈表拆分紅兩半，分別存入ht[1]->table[]的兩個位置(取決於更高一個bit的值)，而後把rehashidx的值更新(實際上是在遍歷桶的時候更新的) 返回1 表示還需rehash。ui

5.隨着rehash的不斷操做，最終在某個操做後，rehash完成，rehashidx設置爲-1，返回 0.

在rehash期間 delete find update 會在ht[0]和ht[1]中都進行查找，add的操做只會添加到ht[1]中。

2、REDIS_ENCODING_SKIPLIST 跳錶

redis裏面是用skiplist是爲了實現zset這種對外的數據結構。zset提供的操做很是豐富，能夠知足許多業務場景，同時也意味着zset相對來講實現比較複雜。

skiplist數據結構簡介

如圖，跳錶的底層是一個順序鏈表，每隔一個節點有一個上層的指針指向下下一個節點，並層層向上遞歸。這樣設計成相似樹形的結構，可使得對於鏈表的查找能夠到達二分查找的時間複雜度。

按照上面的生成跳錶的方式上面每一層節點的個數是下層節點個數的一半，這種方式在插入數據的時候有很大的問題。就是插入一個新節點會打亂上下相鄰兩層鏈表節點個數嚴格的2:1的對應關係。若是要維持這種嚴格對應關係，就必須從新調整整個跳錶，這會讓插入/刪除的時間複雜度從新退化爲O(n)。

爲了解決這一問題，skiplist他不要求上下兩層鏈表之間個數的嚴格對應關係，他爲每一個節點隨機出一個層數。好比，一個節點的隨機出的層數是3，那麼就把它插入到三層的空間上，以下圖。

那麼，這就產生了一個問題，每次插入節點時隨機出一個層數，真的能保證跳錶良好的性能能麼，

首先跳錶隨機數的產生，不是一次執行就產生的，他有本身嚴格的計算過程，

1首先每一個節點都有最下層(第1層)指針

2若是一個節點有第i層指針，那麼他有第i層指針的機率爲p。

3節點的最大層數不超過MaxLevel

咱們注意到，每一個節點的插入過程都是隨機的，他不依賴於其餘節點的狀況，即skiplist造成的結構和節點插入的順序無關。

這樣造成的skiplist查找的時間複雜度約爲O(log n)。

redis中的skiplist

當數據較少的時候，zset是由一個ziplist來實現的
當數據較多的時候，zset是一個由dict 和一個 skiplist來實現的，dict用來查詢數據到分數的對應關係，而skiplist用來根據分數查詢數據。

爲了支持排名rank查詢，redis中對skiplist作了擴展，使得根據排名可以快速查到數據，或者根據分數查到數據以後容易得到排名，兩者都是O(log n)。

typedef struct zset{
     //跳躍表
     zskiplist *zsl;
     //字典
     dict *dice;
} zset;

　　dict的key保存元素的值，字典的value保存元素的score，跳錶節點的robj保存元素的成員，節點的score保存對應score。而且會經過指針來共享元素相同的robj和score。

skiplist的數據結構定義

//server.h
#define ZSKIPLIST_MAXLEVEL 32
#define ZSKIPLIST_P 0.25
typedef struct zskiplistNode {
    robj *obj;
    double score;
    struct zskiplistNode *backward;
    struct zskiplistLevel {
        struct zskiplistNode *forward;
        unsigned int span;
    } level[];
} zskiplistNode;
 
typedef struct zskiplist {
    struct zskiplistNode *header, *tail;
    unsigned long length;
    int level;
} zskiplist;

開頭定義了兩個常量 ZSKIPLIST_MAXLEVEL和ZSKIPLIST_P，即上文所提到的p和maxlevel。

zskiplistNode表示skiplist的節點結構

obj字段存放節點數據，存放string robj。
score字段對應的是節點的分數。
backward字段是指向前一個節點的指針，節點只有一個向前指針，最底層是一個雙向鏈表。
level[]存放各層鏈表的向後指針結構，包含一個forward ，指向對應層後一個節點；span字段指的是這層的指針跨越了多少個節點值，用於計算排名。(level是一個柔性數組，所以他佔用的內存不在zskiplistNode裏，也須要單獨爲其分配內存。)

zskiplist 定義了skiplist的外觀，包含

header和tail指針
鏈表長度 length
level表示跳錶的最大層數

上圖就是redis中一個skiplist可能的結構，括號中的數字表明 level數組中span的值，即跨越了多少個節點。

假設咱們在這個skiplist中查找score=89的元素，在查找路徑上，咱們只須要吧全部的level指針對應的span值求和，就能夠獲得對應的排名；相反，若是查找排名的時候，只須要不斷累加span保證他不超過指定的值就能夠求得對應的節點元素。

3、REDIS_ENCODING_INTSET

redis中使用intset實現數量較少數字的set。

set-max-intset-entries 512

　實際上 intset是一個由整數組成的有序集合，爲了快速查找元素，數組是有序的，用二分查找判斷一個元素是否在這個結合上。在內存分配上與ziplist相似，用一塊連續的內存保存數組元素，而且對於大整數和小證書採用了不一樣的編碼。

結構以下

//intset.h
typedef struct intset {
    uint32_t encoding;
    uint32_t length;
    int8_t contents[];
} intset;
 
#define INTSET_ENC_INT16 (sizeof(int16_t))
#define INTSET_ENC_INT32 (sizeof(int32_t))
#define INTSET_ENC_INT64 (sizeof(int64_t))

encoding 數據編碼表示intset中的每一個元素用幾個字節存儲。(INTSET_ENC_INT16 用兩個字節存儲，即兩個contents數組位置 INTSET_ENC_INT32表示4個字節 INTSET_ENC_INT64表示8個字節)

length 表示inset中元素的個數

contents 柔性數組，表示存儲的實際數據，數組長度 = encoding * length。

另外，intset可能會隨着數據的添加改編他的編碼，最開始建立的inset使用 INTSET_ENC_INT16編碼。

如上圖 intset採用小端存儲。

關於插入邏輯。

intset *intsetAdd(intset *is, int64_t value, uint8_t *success) {
    uint8_t valenc = _intsetValueEncoding(value);
    uint32_t pos;
    if (success) *success = 1;
 
    /* Upgrade encoding if necessary. If we need to upgrade, we know that
     * this value should be either appended (if > 0) or prepended (if < 0),
     * because it lies outside the range of existing values. */
    if (valenc > intrev32ifbe(is->encoding)) {
        /* This always succeeds, so we don't need to curry *success. */
        return intsetUpgradeAndAdd(is,value);
    } else {
        /* Abort if the value is already present in the set.
         * This call will populate "pos" with the right position to insert
         * the value when it cannot be found. */
        if (intsetSearch(is,value,&pos)) {
            if (success) *success = 0;
            return is;
        }
 
        is = intsetResize(is,intrev32ifbe(is->length)+1);
        if (pos < intrev32ifbe(is->length)) intsetMoveTail(is,pos,pos+1);
    }
 
    _intsetSet(is,pos,value);
    is->length = intrev32ifbe(intrev32ifbe(is->length)+1);
    return is;
}

intsetadd在intset中添加新元素value。若是value在添加前已經存在，則不會重複添加，這個時候success設置值爲0

若是要添加的元素編碼比當前intset的編碼大。調用intsetUpgradeAndAdd將intset的編碼進行增加，而後插入。

調用intsetSearch 若是能查找到，不會重複添加。沒查到調用intsetResize對其進行擴容(realloc)，同時intsetMoveTail將帶插入位置後面的元素統一貫後移動一個位置。返回值是一個新的intset指針，替換原來的intset指針，總的時間複雜度爲O(n)。