Python3中對Dict的內存優化

時間 2019-11-12

標籤 python3 python dict 內存優化欄目 Python 简体版

原文原文鏈接

衆所周知，python3.6這個版本對dict的實現是作了較大優化的，特別是在內存使用率方面，所以我以爲有必要研究一下最新的dict的源碼實現。html

先後斷斷續續看了大概一週多一點，主要在研究dict和建立實例對象那部分的代碼，在此將所得記錄下來。python

值得一提的事，新版的dict使用的算法仍是同樣的，好比說hash值計算、衝突解決策略（open addressing）等。所以這一部分也不是我關注的重點，我關注的主要是在新的dict如何下降內存使用這方面。算法

btw，本文的分析是基於python的3.6.1這個版本。數組

話很少說，先看 PyDictObject 結構的定義：緩存

 1 typedef struct _dictkeysobject PyDictKeysObject;
 2 
 3 /* The ma_values pointer is NULL for a combined table
 4  * or points to an array of PyObject* for a split table
 5  */
 6 typedef struct {
 7     PyObject_HEAD
 8 
 9     /* Number of items in the dictionary */
10     Py_ssize_t ma_used;
11 
12     /* Dictionary version: globally unique, value change each time
13        the dictionary is modified */
14     uint64_t ma_version_tag;
15 
16     PyDictKeysObject *ma_keys;
17 
18     /* If ma_values is NULL, the table is "combined": keys and values
19        are stored in ma_keys.
20 
21        If ma_values is not NULL, the table is splitted:
22        keys are stored in ma_keys and values are stored in ma_values */
23     PyObject **ma_values;
24 } PyDictObject;

說下新增的 PyDictKeysObject 這個對象，其定義以下：app

 1 /* See dictobject.c for actual layout of DictKeysObject */
 2 struct _dictkeysobject {
 3     Py_ssize_t dk_refcnt;
 4 
 5     /* Size of the hash table (dk_indices). It must be a power of 2. */
 6     Py_ssize_t dk_size;
 7 
 8     /* Function to lookup in the hash table (dk_indices):
 9 
10        - lookdict(): general-purpose, and may return DKIX_ERROR if (and
11          only if) a comparison raises an exception.
12 
13        - lookdict_unicode(): specialized to Unicode string keys, comparison of
14          which can never raise an exception; that function can never return
15          DKIX_ERROR.
16 
17        - lookdict_unicode_nodummy(): similar to lookdict_unicode() but further
18          specialized for Unicode string keys that cannot be the <dummy> value.
19 
20        - lookdict_split(): Version of lookdict() for split tables. */
21     dict_lookup_func dk_lookup;
22 
23     /* Number of usable entries in dk_entries. */
24     Py_ssize_t dk_usable;
25 
26     /* Number of used entries in dk_entries. */
27     Py_ssize_t dk_nentries;
28 
29     /* Actual hash table of dk_size entries. It holds indices in dk_entries,
30        or DKIX_EMPTY(-1) or DKIX_DUMMY(-2).
31 
32        Indices must be: 0 <= indice < USABLE_FRACTION(dk_size).
33 
34        The size in bytes of an indice depends on dk_size:
35 
36        - 1 byte if dk_size <= 0xff (char*)
37        - 2 bytes if dk_size <= 0xffff (int16_t*)
38        - 4 bytes if dk_size <= 0xffffffff (int32_t*)
39        - 8 bytes otherwise (int64_t*)
40 
41        Dynamically sized, 8 is minimum. */
42     union {
43         int8_t as_1[8];
44         int16_t as_2[4];
45         int32_t as_4[2];
46 #if SIZEOF_VOID_P > 4
47         int64_t as_8[1];
48 #endif
49     } dk_indices;
50 
51     /* "PyDictKeyEntry dk_entries[dk_usable];" array follows:
52        see the DK_ENTRIES() macro */
53 };

新版的dict在內存佈局上和舊版有了很大的差別，其中一點就是分離存儲了key和value。設計思路能夠看看這個：More compact dictionaries with faster iteration函數

還有一點須要說明的是，新版的dict有兩種形式，分別是 combined 和 split。其中後者主要用在優化對象存儲屬性的tp_dict上，這個在後面討論。佈局

對於舊版的hash table，其每一個slot存儲的是一個 PyDictKeyEntry 對象（PyDictKeyEntry是一個三元組，包含了hash、key、value），這樣帶來的問題就是，多佔用了一些非必要的內存。對於狀態爲EMPTY的slot，實際可能存儲爲（0，NULL，NULL）這種形式，但其實這些數據都是冗餘的。優化

所以新版的hash table對此做出了優化，slot（也便是 dk_indices）存儲的再也不是一個 PyDictKeyEntry，而是一個數組的index，這個數組存儲了具體且必要的 PyDictKeyEntry對象。對於那些EMPTY、DUMMY狀態的這類slot，只須要用個負數（區分大於0的index）表示便可。ui

實際上，優化還不止於此。實際上還會根據須要索引 PyDictKeyEntry 對象的數量，動態的決定是用什麼類型的變量來表示index。例如，若是所存儲的 PyDictKeyEntry 數量不超過127，那麼實際上用長度爲一個字節的帶符號整數（char）存儲index便可。須要說明的是，index的值是有可能爲負的（EMPTY、DUMMY、ERROR），所以須要用帶符號的整數存儲。具體能夠看 new_keys_object 這個函數，這個函數在建立 dict 的時候會被調用：

 1 PyObject *
 2 PyDict_New(void)
 3 {
 4     PyDictKeysObject *keys = new_keys_object(PyDict_MINSIZE);
 5     if (keys == NULL)
 6         return NULL;
 7     return new_dict(keys, NULL);
 8 }
 9 
10 static PyDictKeysObject *new_keys_object(Py_ssize_t size)
11 {
12     PyDictKeysObject *dk;
13     Py_ssize_t es, usable;
14 
15     assert(size >= PyDict_MINSIZE);
16     assert(IS_POWER_OF_2(size));
17 
18     usable = USABLE_FRACTION(size);
19     if (size <= 0xff) {
20         es = 1;
21     }
22     else if (size <= 0xffff) {
23         es = 2;
24     }
25 #if SIZEOF_VOID_P > 4
26     else if (size <= 0xffffffff) {
27         es = 4;
28     }
29 #endif
30     else {
31         es = sizeof(Py_ssize_t);
32     }
33 
34     if (size == PyDict_MINSIZE && numfreekeys > 0) {
35         dk = keys_free_list[--numfreekeys];
36     }
37     else {
38         dk = PyObject_MALLOC(sizeof(PyDictKeysObject)
39                              - Py_MEMBER_SIZE(PyDictKeysObject, dk_indices)
40                              + es * size
41                              + sizeof(PyDictKeyEntry) * usable);
42         if (dk == NULL) {
43             PyErr_NoMemory();
44             return NULL;
45         }
46     }
47     DK_DEBUG_INCREF dk->dk_refcnt = 1;
48     dk->dk_size = size;
49     dk->dk_usable = usable;
50     dk->dk_lookup = lookdict_unicode_nodummy;
51     dk->dk_nentries = 0;
52     memset(&dk->dk_indices.as_1[0], 0xff, es * size);
53     memset(DK_ENTRIES(dk), 0, sizeof(PyDictKeyEntry) * usable);
54     return dk;
55 }

有幾點須要說明一下：

（1）受限於裝填因子，所以給定一個hash table 的 size 就能肯定出最多可容納多少個有效對象（上圖代碼18行），所以存儲的 PyDictKeyEntry 對象的數組的長度是能夠在一開始便肯定下來的。PyDictKeysObject 對象上的 dk_usable 表示hash table還能存儲多少個對象，其值小於等於0的時候，再插入元素須要執行 rehash 操做。

（2）傳入的size的值必須是2的冪，所以若是 size <= 0xff(255) 成立，則說明 size <= 128，所以用1個字節長度來表示index足矣。

（3）CPython的代碼處處存在着緩存策略，keys_free_list 也是如此，目的是減小實際執行malloc的次數。

（4）當申請內存時，在計算一個 PyDictKeysObject 對象實際須要的內存時，須要減去 dk_indices 成員默認的大小，默認大小是8字節。這部份內存是根據size動態肯定下來的。

如今來講說以前說起的split形式的dict。這種字典的key是共享的，有一個引用計數器 dk_refcnt 來維護當前被引用的個數。而之因此設計出split形式的字典，是由於觀察到了python虛擬機中，會有大量key相同而value不一樣的字典的存在。而這個特定的狀況就是實例對象上存儲屬性的 tp_dict 字典！

所以split形式的dict主要是出於對優化實例對象上存儲屬性這種狀況考慮的。設計思路這裏有所說起：PEP 412 -- Key-Sharing Dictionary

咱們都知道，python使用dict來存儲對象的屬性。考慮一個這樣的場景：

（1）一個類會建立出不少個對象。

（2）這些對象的屬性，能在一開始就肯定下來，而且後續不會增長刪除。

若是能知足上述兩個條件，那麼其實咱們可使用一種更高效、更省內存的方式，來存儲對象的屬性。方法就是，屬於一個類的全部對象共享同一份屬性字典的key，而value以數組的方式存儲在每一個對象的身上。優化的好處是顯而易見的，原來須要爲每個對象維持一份屬性key，而如今只需爲全部對象維持一份便可，而且屬性的值（value）也以更加緊湊的方式組織在內存中。新版的dict的設計使得實現這種共享key的策略變得更簡單！

看看具體的代碼：

 1 int
 2 _PyObjectDict_SetItem(PyTypeObject *tp, PyObject **dictptr,
 3                       PyObject *key, PyObject *value)
 4 {
 5     PyObject *dict;
 6     int res;
 7     PyDictKeysObject *cached;
 8 
 9     assert(dictptr != NULL);
10     if ((tp->tp_flags & Py_TPFLAGS_HEAPTYPE) && (cached = CACHED_KEYS(tp))) {
11         assert(dictptr != NULL);
12         dict = *dictptr;
13         if (dict == NULL) {
14             DK_INCREF(cached);
15             dict = new_dict_with_shared_keys(cached); // importance!!!
16             if (dict == NULL)
17                 return -1;
18             *dictptr = dict;
19         }
20         if (value == NULL) {
21             res = PyDict_DelItem(dict, key);
22             // Since key sharing dict doesn't allow deletion, PyDict_DelItem()
23             // always converts dict to combined form.
24             if ((cached = CACHED_KEYS(tp)) != NULL) {
25                 CACHED_KEYS(tp) = NULL;
26                 DK_DECREF(cached);
27             }
28         }
29         else {
30             int was_shared = (cached == ((PyDictObject *)dict)->ma_keys);
31             res = PyDict_SetItem(dict, key, value);
32             if (was_shared &&
33                     (cached = CACHED_KEYS(tp)) != NULL &&
34                     cached != ((PyDictObject *)dict)->ma_keys) {
35                 /* PyDict_SetItem() may call dictresize and convert split table
36                  * into combined table.  In such case, convert it to split
37                  * table again and update type's shared key only when this is
38                  * the only dict sharing key with the type.
39                  *
40                  * This is to allow using shared key in class like this:
41                  *
42                  *     class C:
43                  *         def __init__(self):
44                  *             # one dict resize happens
45                  *             self.a, self.b, self.c = 1, 2, 3
46                  *             self.d, self.e, self.f = 4, 5, 6
47                  *     a = C()
48                  */
49                 if (cached->dk_refcnt == 1) {
50                     CACHED_KEYS(tp) = make_keys_shared(dict);
51                 }
52                 else {
53                     CACHED_KEYS(tp) = NULL;
54                 }
55                 DK_DECREF(cached);
56                 if (CACHED_KEYS(tp) == NULL && PyErr_Occurred())
57                     return -1;
58             }
59         }
60     } else {
61         dict = *dictptr;
62         if (dict == NULL) {
63             dict = PyDict_New();
64             if (dict == NULL)
65                 return -1;
66             *dictptr = dict;
67         }
68         if (value == NULL) {
69             res = PyDict_DelItem(dict, key);
70         } else {
71             res = PyDict_SetItem(dict, key, value);
72         }
73     }
74     return res;
75 }

當咱們在類的 __init__ 方法中經過 self.a = v 初始化一個對象的屬性時，最終會調用到函數_PyObjectDict_SetItem。此函數會初始化對象的tp_dict，也便是對象的屬性字典。從上述的第15行代碼能夠看出，在特定狀況下，會將對象的屬性字典初始化爲共享key的split式字典。所以也驗證了以前的分析。