Redis源碼解析-基礎數據-sds(simple dynamic string)

時間 2019-12-08

標籤 redis 源碼解析基礎數據 sds simple dynamic string 欄目 Redis 简体版

原文原文鏈接

太長不看版html

redis內部使用sds(其實就是在字符串前邊加了額外的信息(sdshdr))做爲默認字符串類型

sdshdr中len字段記錄了字符串長度，能夠常數複雜度獲取長度

sds使用sdshdr中的len字段來判斷字符串是否結束，實現了二進制安全

修改sds時會按需分配內存大小，杜絕緩衝區溢出

sds空間擴展時會進行預申請，減小內存分配次數預申請策略: 1M之內, 每次爲len * 2，超過1M每次爲 len + 1M

sds支持部分c語言標準庫的裏的字符串操做函數(由於末尾特地加了'\0')

本篇解析基於redis 5.0.0版本，涉及源碼文件爲sds.c與sds.h。git

redis內部實現中沒有直接使用c語言中原始的字符串(字符數組)，而是對原始字符串進行了封裝，使用名爲sds(simple dynamic string)的抽象類型做爲默認字符串類型。github

sds定義:

// sds定義
typedef char *sds;

// sds的建立函數
sds sdsnewlen(const void *init, size_t initlen) {
    void *sh;
    sds s;
    char type = sdsReqType(initlen);
    // ...
    int hdrlen = sdsHdrSize(type);
    unsigned char *fp; /* flags pointer. */
    // 申請內存爲 頭部信息長度 + 字符串長度 + 1
    // sh 指向總體數據開頭
    sh = s_malloc(hdrlen+initlen+1);
    // ...
    // s實際類型爲char* 指向原始字符串開頭
    s = (char*)sh+hdrlen;
    fp = ((unsigned char*)s)-1;
    switch(type) {
        // ...
        case SDS_TYPE_8: {
            SDS_HDR_VAR(8,s);
            sh->len = initlen;
            sh->alloc = initlen;
            *fp = type;
            break;
        }
        // ...
    if (initlen && init)
        memcpy(s, init, initlen);
    // 此處末尾加上'\0'是爲了可以使用標準庫裏的一些操做函數
    s[initlen] = '\0';
    // 注意，此處返回的是指向原始字符串開始的指針
    return s;
}
複製代碼

從上述定義能夠看到，sds其實是對char*的重定義。而從建立函數的操做能夠看到，sds實際上就是在原始字符串的前邊加上了一些頭部信息。舉個例子:redis

sds 字符串 hello的數據結構(使用sdshdr8):算法

頭部信息(sdshdr)	字符串	\0
len(4) alloc(8) flags(1)	h e l l o	\0

而初始化函數返回的以及代碼中使用的是指向原始字符串的指針，因此常常會在代碼裏看到s[-1](*(s - 1的語法糖))這種操做來取頭部類型flags字段。數據庫

而獲取字符串頭部信息則是用s指針前移至頭部信息開始來獲取，定義以下:編程

#define SDS_HDR(T,s) ((struct sdshdr##T *)((s)-(sizeof(struct sdshdr##T))))
複製代碼

看完了sds的定義，咱們再來看看提了半天的頭部信息結構究竟是怎樣的。segmentfault

sds頭部信息(sdshdr)定義

/* Note: sdshdr5 is never used, we just access the flags byte directly. * However is here to document the layout of type 5 SDS strings. */
struct __attribute__ ((__packed__)) sdshdr5 {
    unsigned char flags; /* 3 lsb of type, and 5 msb of string length */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* used */
    uint8_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr16 {
    uint16_t len; /* used */
    uint16_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr32 {
    uint32_t len; /* used */
    uint32_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
struct __attribute__ ((__packed__)) sdshdr64 {
    uint64_t len; /* used */
    uint64_t alloc; /* excluding the header and null terminator */
    unsigned char flags; /* 3 lsb of type, 5 unused bits */
    char buf[];
};
複製代碼

sds的頭部信息一共有5個，根據長度選取對應的類型。其中sdshdr5由於沒有空間存儲alloc字段(長度信息和字符串信息存在一塊兒)，因此只用於不存在變動需求的短字符(小於32字節)中，好比小於32字節的鍵值。具體詳情戳這裏，而attribute ((packed))則是爲了告訴gcc取消優化對齊。例如：數組

struct sdshdr {char flag; int  len;};
//sizeof(sdshdr_test) == 8;
struct attribute ((packed)) sdshdr {char flag; int  len;}; 
//sizeof(sdshdr_test) == 5;
複製代碼

以上的操做歸根結底都是爲了減小head的實際長度。戳這裏瞭解詳情歸納一下sds頭部信息定義(實際上第一版的sds頭部信息定義就是這樣的)：緩存

struct sdshdr {
    unsigned int len; /* 字符串長度 */
    unsigned int alloc; /* 實際分配內存長度 */
    unsigned char flags; /* sds類型 */
    char buf[]; /* 實際的數據 */
};
複製代碼

開始咱們說了, sds和原始的字符串信息區別就是加了上述的這些頭部信息。那麼接下來咱們再來看看費了這麼多勁加這些額外的信息有什麼用？

sdshdr的引入解決了什麼問題

1. 二進制安全

維基百科: Binary-safe is a computer programming term mainly used in connection with string manipulating functions.A binary-safe function is essentially one that treats its input as a raw stream of data without any specific format. It should thus work with all 256 possible values that a character can take (assuming 8-bit characters). 二進制安全是一種主要用於字符串操做函數相關的計算機編程術語。一個二進制安全功能（函數），其本質上將操做輸入做爲原始的、無任何特殊格式意義的數據流。其在操做上應包含一個字符所能有的256種可能的值（假設爲8爲字符）

衆所周知，c語言中使用字符數組做爲字符串實現，使用'\0'表示結尾，由於'\0'這個‘魔數’的存在，使得c語言原始的字符串是不符合二進制安全的。

sds使用sdshdr中的len字段來判斷字符串是否結束，實現二進制安全(二進制中包含\0,原生字符串會截斷)。

2. 緩衝區溢出

緩衝區溢出指當某個數據超過了處理程序限制的範圍時，程序出現的異常操做。

c語言中字符串處理，字符串指針對應的內存都是由編碼者本身控制，保證已經分配了足夠多的內存。然而一旦當分配的內存不足，就出現緩存區溢出。 sds的修改函數在執行修改前會進行判斷內存，動態的分配內存，杜絕緩衝區溢出的可能。例如sds拷貝函數：

/* Destructively modify the sds string 's' to hold the specified binary * safe string pointed by 't' of length 'len' bytes. */
sds sdscpylen(sds s, const char *t, size_t len) {
    //若是目標sds已分配的內存長度小於須要拷貝的長度
    if (sdsalloc(s) < len) {
        //則進行內存從新分配
        s = sdsMakeRoomFor(s,len-sdslen(s));
        if (s == NULL) return NULL;
    }
    memcpy(s, t, len);
    //字符串尾部加上'\0',是爲了可使用部分標準庫函數，減小工做量
    s[len] = '\0';
    sdssetlen(s, len);
    return s;
}
複製代碼

3. 頻繁的內存分配

C字符串若是一直使用與長度匹配的內存，會致使每次字符串長度變化時，就必須進行內存的從新分配。內存重分配設計複雜算法，同時可能執行系統調用，會很耗時。對於redis做爲一個內存數據庫，勢必會面對數據頻繁改變的場景，因此若是使用原生的字符串數據，會引起頻繁的內存重分配，這個顯然是不可接受的。因此sds每次內存分配時，會經過內存的預申請減小由於修改字符串而引起的內存重分配次數。具體策略是1M(1024 * 1024)之內，每次len * 2, 超過1M每次 len + 1M 。

/* Enlarge the free space at the end of the sds string so that the caller * is sure that after calling this function can overwrite up to addlen * bytes after the end of the string, plus one more byte for nul term. * * Note: this does not change the *length* of the sds string as returned * by sdslen(), but only the free buffer space we have. */
sds sdsMakeRoomFor(sds s, size_t addlen) {
    void *sh, *newsh;
    //當前sds可用的內存長度,申請的 - 已用的
    size_t avail = sdsavail(s);
    size_t len, newlen;
    char type, oldtype = s[-1] & SDS_TYPE_MASK;
    int hdrlen;

    /* Return ASAP if there is enough space left. */
    if (avail >= addlen) return s;

    len = sdslen(s);
    sh = (char*)s-sdsHdrSize(oldtype);
    newlen = (len+addlen);
    // #define SDS_MAX_PREALLOC (1024*1024)
    if (newlen < SDS_MAX_PREALLOC)
        // 小於1m, 實際申請內存長度爲當前申請的2倍
        newlen *= 2;
    else
        // 大於1m, 實際申請內存長度爲當前申請 + 1m
        newlen += SDS_MAX_PREALLOC;

    type = sdsReqType(newlen);

    /* Don't use type 5: the user is appending to the string and type 5 is * not able to remember empty space, so sdsMakeRoomFor() must be called * at every appending operation. */
    if (type == SDS_TYPE_5) type = SDS_TYPE_8;

    hdrlen = sdsHdrSize(type);
    if (oldtype==type) {
        newsh = s_realloc(sh, hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        s = (char*)newsh+hdrlen;
    } else {
        /* Since the header size changes, need to move the string forward, * and can't use realloc */
        newsh = s_malloc(hdrlen+newlen+1);
        if (newsh == NULL) return NULL;
        memcpy((char*)newsh+hdrlen, s, len+1);
        s_free(sh);
        s = (char*)newsh+hdrlen;
        s[-1] = type;
        sdssetlen(s, len);
    }
    sdssetalloc(s, newlen);
    return s;
}
複製代碼

4. O(n)複雜度的長度獲取

原生的c語言字符串，獲取長度須要遍歷字符串直到遇到'\0',須要O(n)複雜度的操做。 sds頭部信息中len字段記錄了字符串長度，能夠常數複雜度獲取長度。

static inline void sdssetlen(sds s, size_t newlen) {
    unsigned char flags = s[-1];
    switch(flags&SDS_TYPE_MASK) {
        case SDS_TYPE_5:
            {
                // sdshdr5 沒有len字段，因此長度信息和頭部信息類型放在一塊兒
                unsigned char *fp = ((unsigned char*)s)-1;
                *fp = SDS_TYPE_5 | (newlen << SDS_TYPE_BITS);
            }
            break;
        case SDS_TYPE_8:
            SDS_HDR(8,s)->len = newlen;
            break;
        case SDS_TYPE_16:
            SDS_HDR(16,s)->len = newlen;
            break;
        case SDS_TYPE_32:
            SDS_HDR(32,s)->len = newlen;
            break;
        case SDS_TYPE_64:
            SDS_HDR(64,s)->len = newlen;
            break;
    }
}

/* Append the specified binary-safe string pointed by 't' of 'len' bytes to the * end of the specified sds string 's'. * * After the call, the passed sds string is no longer valid and all the * references must be substituted with the new pointer returned by the call. */
sds sdscatlen(sds s, const void *t, size_t len) {
    size_t curlen = sdslen(s);

    s = sdsMakeRoomFor(s,len);
    if (s == NULL) return NULL;
    memcpy(s+curlen, t, len);
    // 全部變動字符串的函數，執行變動操做以後，會進行長度的更新
    sdssetlen(s, curlen+len);
    s[curlen+len] = '\0';
    return s;
}
複製代碼