Memcached學習（四）--hashtable

時間 2019-12-08

標籤 memcached 學習 hashtable 欄目 Memcached 简体版

原文原文鏈接

數據結構

　　Memcached中實現了高性能的hashtable。其解決hash衝突的方法採用拉鍊法。當hashtable 中存儲的item個數大於容器大小的1.5倍的時候通知線程進行hashtable 擴容，爲了保證在擴容期間的讀寫性能，擴容線程默認每次只遷移一個bucket。設置一個變量標識當前的遷移進度，在進行讀寫操做時根據此變量肯定是去 old_hashtable 仍是 primary_hashtable 進行操做。算法

　　在瞭解hashtable的各類操做以前，咱們先了解下Memcached儲存數據的基本結構。數組

 1 //item的具體結構
 2 typedef struct _stritem {
 3     //記錄下一個item的地址,主要用於LRU鏈和freelist鏈
 4     struct _stritem *next;
 5     //記錄上一個item的地址,主要用於LRU鏈和freelist鏈
 6     struct _stritem *prev;
 7     //記錄HashTable的下一個Item的地址
 8     struct _stritem *h_next;
 9     //最近訪問的時間，只有set/add/replace等操做纔會更新這個字段
10     //當執行flush命令的時候，須要用這個時間和執行flush命令的時間相比較，來判斷是否失效
11     rel_time_t      time;       /* least recent access */
12     //緩存的過時時間。設置爲0的時候，則永久有效。
13     //若是Memcached不能分配新的item的時候，設置爲0的item也有可能被LRU淘汰
14     rel_time_t      exptime;    /* expire time */
15     //value數據大小
16     int             nbytes;     /* size of data */
17     //引用的次數。經過這個引用的次數，能夠判斷item是否被其它的線程在操做中。
18     //也能夠經過refcount來判斷當前的item是否能夠被刪除，只有refcount -1 = 0的時候才能被刪除
19     unsigned short  refcount;
20     uint8_t         nsuffix;    /* length of flags-and-length string */
21     uint8_t         it_flags;   /* ITEM_* above */
22     //slabs_class的ID。
23     uint8_t         slabs_clsid;/* which slab class we're in */
24     uint8_t         nkey;       /* key length, w/terminating null and padding */
25     /* this odd type prevents type-punning issues when we do
26      * the little shuffle to save space when not using CAS. */
27     //數據存儲結構
28     union {
29         uint64_t cas;
30         char end;
31     } data[];
32 } item;

說明：緩存

Memcached上存儲的每個元素都會有一個Item的結構。
tem結構主要記錄與HashTable之間的關係，以及存儲數據的slabs_class的關係以及key的信息，存儲的數據和長度等基本信息。
Item塊會被分配在slabclass上。
HashTable的主要做用是：用於經過key快速查詢緩存數據。

HashTable結構圖

說明：安全

Memcached在啓動的時候，會默認初始化一個HashTable，這個table的默認長度爲65536。
咱們將這個HashTable中的每個元素稱爲桶，每一個桶就是一個item結構的單向鏈表。
Memcached會將key值hash成一個變量名稱爲hv的uint32_t類型的值。
經過hv與桶的個數之間的按位與計算，hv & hashmask(hashpower)，就能夠獲得當前的key會落在哪一個桶上面。
而後會將item掛到這個桶的鏈表上面。鏈表主要是經過item結構中的h_next實現。

初始化HashTable

HashTable默認設置爲16，1 << 16後獲得65536個桶。若是用戶自定義設置，設置值在12-64之間。數據結構

 1 //初始化HahsTable表
 2 void assoc_init(const int hashtable_init) {
 3     //初始化的時候 hashtable_init值須要大於12 小於64
 4     //若是hashtable_init的值沒有設定，則hashpower使用默認值爲16
 5     if (hashtable_init) {
 6         hashpower = hashtable_init;
 7     }
 8     //primary_hashtable主要用來存儲這個HashTable
 9     //hashsize方法是求桶的個數，默認若是hashpower=16的話，桶的個數爲：65536
10     primary_hashtable = calloc(hashsize(hashpower), sizeof(void *));
11     if (! primary_hashtable) {
12         fprintf(stderr, "Failed to init hashtable.\n");
13         exit(EXIT_FAILURE);
14     }
15     STATS_LOCK();
16     stats.hash_power_level = hashpower;
17     stats.hash_bytes = hashsize(hashpower) * sizeof(void *);
18     STATS_UNLOCK();
19 }

查找

hash 查找的邏輯是優先使用hash 預算定位到bucket,而後循環bucket 鏈表找到指定的key。須要理解的地方在於查找時可能存在hashtable 正在進行擴展，因此須要肯定是在old_hashtable仍是primary_hashtable 進行查找多線程

 1 //尋找一個Item
 2 item *assoc_find(const char *key, const size_t nkey, const uint32_t hv) {
 3     item *it;
 4     unsigned int oldbucket;
 5  
 6     //判斷是否在擴容中...
 7     if (expanding &&
 8         (oldbucket = (hv & hashmask(hashpower - 1))) >= expand_bucket)
 9     {
10         it = old_hashtable[oldbucket];
11     } else {
12         //獲取獲得具體的桶的地址
13         it = primary_hashtable[hv & hashmask(hashpower)];
14     }
15  
16     item *ret = NULL;
17     int depth = 0; //循環的深度
18     while (it) {
19         //循環查找桶的list中的Item
20         if ((nkey == it->nkey) && (memcmp(key, ITEM_key(it), nkey) == 0)) {
21             ret = it;
22             break;
23         }
24         it = it->h_next;
25         ++depth;
26     }
27     MEMCACHED_ASSOC_FIND(key, nkey, depth);
28     return ret;
29 }

插入

插入的主要邏輯是找到指定桶的位置，將當前插入的節點設置爲桶中位置的鏈表頭結點位置，而且從新設置桶中元素的value。插入時，會判斷是否須要擴容，若是擴容，則會在單獨的線程中進行。（桶的個數(默認：65536) * 3） / 2。app

 1 //新增Item操做
 2 int assoc_insert(item *it, const uint32_t hv) {
 3     unsigned int oldbucket;
 4  
 5     assert(assoc_find(ITEM_key(it), it->nkey) == 0);  /* shouldn't have duplicately named things defined */
 6  
 7     //判斷是否在擴容，若是是擴容中，爲保證程序繼續可用，則須要使用舊的桶
 8     if (expanding &&
 9         (oldbucket = (hv & hashmask(hashpower - 1))) >= expand_bucket)
10     {
11         it->h_next = old_hashtable[oldbucket];
12         old_hashtable[oldbucket] = it;
13     } else {
14         //hv & hashmask(hashpower) 按位與計算是在哪一個桶上面
15         //將當前的item->h_next 指向桶中首個Item的位置
16         it->h_next = primary_hashtable[hv & hashmask(hashpower)];
17         //而後將hashtable中的首頁Item指向新的Item地址值
18         primary_hashtable[hv & hashmask(hashpower)] = it;
19     }
20  
21     hash_items++; //由於是新增操做，則就會增長一個Item
22     //若是hash_items的個數大於當前  （桶的個數(默認：65536) * 3） / 2的時候，就須要從新擴容
23     //由於初始化的桶自己就比較多了，因此擴容必須在單獨的線程中處理，每次擴容估計耗時比較長
24     if (! expanding && hash_items > (hashsize(hashpower) * 3) / 2) {
25         assoc_start_expand();
26     }
27  
28     MEMCACHED_ASSOC_INSERT(ITEM_key(it), it->nkey, hash_items);
29     return 1;
30 }

刪除

刪除接口的主要邏輯是使用_hashitem_before 函數找到要刪除item前一個item指針位置，而後將此指針的位置直接指向被刪除item 的下一個item 位置。less

 1 //該方法主要用於尋找
 2 static item** _hashitem_before (const char *key, const size_t nkey, const uint32_t hv) {
 3     item **pos;
 4     unsigned int oldbucket;
 5  
 6     //判斷是否在擴容中
 7     if (expanding &&
 8         (oldbucket = (hv & hashmask(hashpower - 1))) >= expand_bucket)
 9     {
10         pos = &old_hashtable[oldbucket];
11     } else {
12         //返回具體桶的地址
13         pos = &primary_hashtable[hv & hashmask(hashpower)];
14     }
15  
16     //在桶的list中匹配key值是否相同，相同則找到Item
17     while (*pos && ((nkey != (*pos)->nkey) || memcmp(key, ITEM_key(*pos), nkey))) {
18         pos = &(*pos)->h_next;
19     }
20     return pos;
21 }
22 //刪除一個桶上的Item
23 void assoc_delete(const char *key, const size_t nkey, const uint32_t hv) {
24     item **before = _hashitem_before(key, nkey, hv); //查詢Item是否存在
25  
26     //若是Item存在，則當前的Item值指向下一個Item的指針地址
27     if (*before) {
28         item *nxt;
29         hash_items--; //item個數減去1
30         /* The DTrace probe cannot be triggered as the last instruction
31          * due to possible tail-optimization by the compiler
32          */
33         MEMCACHED_ASSOC_DELETE(key, nkey, hash_items);
34         nxt = (*before)->h_next;
35         (*before)->h_next = 0;   /* probably pointless, but whatever. */
36         *before = nxt;
37         return;
38     }
39     /* Note:  we never actually get here.  the callers don't delete things
40        they can't find. */
41     assert(*before != 0);
42 }

_hashitem_before

函數的做用是查找給定item的前一個節點的指針，在delete 接口中調用。memcached

 1 static item** _hashitem_before (const char *key, const size_t nkey, const uint32_t hv) {
 2     item **pos;
 3     unsigned int oldbucket;
 4     // 同理是肯定是在old_hashtable 仍是在primary_hashtable
 5     if (expanding &&
 6         (oldbucket = (hv & hashmask(hashpower - 1))) >= expand_bucket)
 7     {
 8         pos = &old_hashtable[oldbucket];
 9     } else {
10         pos = &primary_hashtable[hv & hashmask(hashpower)];
11     }
12     // 從頭結點的位置開始順序遍歷單鏈表中的節點
13     while (*pos && ((nkey != (*pos)->nkey) || memcmp(key, ITEM_key(*pos), nkey))) {
14         pos = &(*pos)->h_next;
15     }
16     return pos;
17 }

assoc_expand

函數的做用是執行hash表的擴容，執行的過程是將當前primary_hashtable 指定爲old_hashtable, 爲primary_hashtable 分配內存,primary_hashtable的大小是old_hashtable 的兩倍，將標識是否在擴展的bool型變量 expanding 設置爲true。將標識擴展進度的變量expand_bucket 設置爲0。函數

 1 /* grows the hashtable to the next power of 2. */
 2 static void assoc_expand(void) {
 3     old_hashtable = primary_hashtable;
 4 
 5     primary_hashtable = calloc(hashsize(hashpower + 1), sizeof(void *));
 6     if (primary_hashtable) {
 7         if (settings.verbose > 1)
 8             fprintf(stderr, "Hash table expansion starting\n");
 9         hashpower++;
10         expanding = true;
11         expand_bucket = 0;
12         STATS_LOCK();
13         stats_state.hash_power_level = hashpower;
14         stats_state.hash_bytes += hashsize(hashpower) * sizeof(void *);
15         stats_state.hash_is_expanding = true;
16         STATS_UNLOCK();
17     } else {
18         primary_hashtable = old_hashtable;
19         /* Bad news, but we can keep running. */
20     }
21 }

assoc_start_expand

函數的做用判斷是否進行擴展，進行擴展的臨界條件是hashtable 中item 個數大於hash 桶數的1.5倍。知足此臨界條件時通知擴展線程進行擴展。

 1 void assoc_start_expand(uint64_t curr_items) {
 2     if (started_expanding)
 3         return;
 4 
 5     if (curr_items > (hashsize(hashpower) * 3) / 2 &&
 6           hashpower < HASHPOWER_MAX) {
 7         started_expanding = true;
 8         pthread_cond_signal(&maintenance_cond);
 9     }
10 }

start_assoc_maintenance_thread

函數的做用是建立hash 擴展線程，能夠根據用戶指定的參數設置每次擴展多少個bucket。若是不指定此參數的話，默認每次只擴展一個bucket。

 1 int start_assoc_maintenance_thread() {
 2     int ret;
 3     char *env = getenv("MEMCACHED_HASH_BULK_MOVE");
 4     if (env != NULL) {
 5         hash_bulk_move = atoi(env);
 6         if (hash_bulk_move == 0) {
 7             hash_bulk_move = DEFAULT_HASH_BULK_MOVE;
 8         }
 9     }
10     pthread_mutex_init(&maintenance_lock, NULL);
11     if ((ret = pthread_create(&maintenance_tid, NULL,
12                               assoc_maintenance_thread, NULL)) != 0) {
13         fprintf(stderr, "Can't create thread: %s\n", strerror(ret));
14         return -1;
15     }
16     return 0;
17 }

assoc_maintenance_thread

函數的做用是執行實際的bucket 擴展。具體解釋見註釋。

 1 static void *assoc_maintenance_thread(void *arg) {
 2 
 3     mutex_lock(&maintenance_lock);
 4     while (do_run_maintenance_thread) {
 5         int ii = 0;
 6 
 7         /* There is only one expansion thread, so no need to global lock. */
 8         // 循環每次擴展的所有bucket
 9         for (ii = 0; ii < hash_bulk_move && expanding; ++ii) {
10             item *it, *next;
11             unsigned int bucket;
12             void *item_lock = NULL;
13 
14             /* bucket = hv & hashmask(hashpower) =>the bucket of hash table
15              * is the lowest N bits of the hv, and the bucket of item_locks is
16              *  also the lowest M bits of hv, and N is greater than M.
17              *  So we can process expanding with only one item_lock. cool!
18              */
19             /* expand_bucket須要鎖保護,因爲處於同一個bucket 中的特性是
20             這些item 的hv 的低N位是徹底相同，對應的item_lock 的位置靠hv 
21             的低M位肯定，因爲item_lock數組大小小於桶數組的大小，因此有 M < N ,
22             也就是說處於同一個桶中的item擁有相同item_lock,因此在遍歷桶中
23             全部的item 的時候不須要在額外獲取item_lock。這裏的設計很是精妙~*/
24             if ((item_lock = item_trylock(expand_bucket))) {
25                     // 遍歷bucket 中所有item,插入到primary_hashtable 中相應bucket
26                     for (it = old_hashtable[expand_bucket]; NULL != it; it = next) {
27                         next = it->h_next;
28                         bucket = hash(ITEM_key(it), it->nkey) & hashmask(hashpower);
29                         it->h_next = primary_hashtable[bucket];
30                         primary_hashtable[bucket] = it;
31                     }
32                     // old_hashtable 中bucket 內容設置爲空
33                     old_hashtable[expand_bucket] = NULL;
34                     // 維護當前擴展的進度
35                     expand_bucket++;
36                     /* 若是擴展已經所有完成則設置expanding爲false,
37                     釋放old_hashtable 的內存*/
38                     if (expand_bucket == hashsize(hashpower - 1)) {
39                         expanding = false;
40                         free(old_hashtable);
41                         STATS_LOCK();
42                         stats_state.hash_bytes -= hashsize(hashpower - 1) * sizeof(void *);
43                         stats_state.hash_is_expanding = false;
44                         STATS_UNLOCK();
45                         if (settings.verbose > 1)
46                             fprintf(stderr, "Hash table expansion done\n");
47                     }
48 
49             } else {
50                 usleep(10*1000);
51             }
52             // 釋放資源
53             if (item_lock) {
54                 item_trylock_unlock(item_lock);
55                 item_lock = NULL;
56             }
57         }
58         // 若是不在進行擴展，則設置條件變量，等待被觸發擴展
59         if (!expanding) {
60             /* We are done expanding.. just wait for next invocation */
61             started_expanding = false;
62             pthread_cond_wait(&maintenance_cond, &maintenance_lock);
63             /* assoc_expand() swaps out the hash table entirely, so we need
64              * all threads to not hold any references related to the hash
65              * table while this happens.
66              * This is instead of a more complex, possibly slower algorithm to
67              * allow dynamic hash table expansion without causing significant
68              * wait times.
69              */
70             pause_threads(PAUSE_ALL_THREADS);
71             assoc_expand();
72             pause_threads(RESUME_ALL_THREADS);
73         }
74     }
75     return NULL;
76 }

線程安全

memcached 使用分段鎖實現hashtable 線程安全，分段鎖避免了hashtable 中所有的item公用一個鎖，公用一個鎖的會下降hashtable 的讀寫性能。下面部分代碼是memcached 初始化分段鎖數組的邏輯。

    if (nthreads < 3) {
        power = 10;
    } else if (nthreads < 4) {
        power = 11;
    } else if (nthreads < 5) {
        power = 12;
    } else if (nthreads <= 10) {
        power = 13;
    } else if (nthreads <= 20) {
        power = 14;
    } else {
        /* 32k buckets. just under the hashpower default. */
        power = 15;
    }
    /* 保證分段鎖的數目小於hashtable 桶的個數，這樣設計的好處之一是在擴展
    的時候針對一個桶中的全部item 對應的是同一個item_lock*/
    if (power >= hashpower) {
        fprintf(stderr, "Hash table power size (%d) cannot be equal to or less than item lock table (%d)\n", hashpower, power);
        fprintf(stderr, "Item lock table grows with `-t N` (worker threadcount)\n");
        fprintf(stderr, "Hash table grows with `-o hashpower=N` \n");
        exit(1);
    }

    item_lock_count = hashsize(power);
    item_lock_hashpower = power;
    // 分配分段鎖數組
    item_locks = calloc(item_lock_count, sizeof(pthread_mutex_t));
    if (! item_locks) {
        perror("Can't allocate item locks");
        exit(1);
    }

在對hashtable 進行多線程讀寫時，首先須要根據hash 算法計算出hv 值，而後根據hv 獲取item_lock,獲取到item_lock 以後再進行讀寫操做。這也從側面解釋了爲何memcached在擴展時默認每次只擴展一個bucket，由於在進行擴展的時候須要佔有item_lock，每次執行擴展的bucket 數多會影響讀寫性能。

^總結

memcached 的hashtable是典型的拉鍊式hashtable,實現代碼短小易讀，使用一個線程進行hashtable的擴展以保證不會出現item增多致使哈希衝突激增下降讀寫性能的現象，除此以外使用分段鎖來保證多線程的讀寫安全，相比全局鎖也能夠提高讀寫性能。memcached hashsize設置爲2的整數次冪的設計很是精妙，首先這樣能夠將查找hash bucket索引的取餘操做轉化爲對（hashsize-1）取按位與操做，在加上分段鎖的數目大小小於hashsize 的設置能夠保證一個bucket 中全部的item 對應於同一個分段鎖，進而保證在擴展bucket中所有內容時只須要獲取一次分段鎖!

相關標籤/搜索

nginx+memcached+memcached

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。