筆記-集合NSSet、字典NSDictionary的底層實現原理

時間 2019-11-07

標籤筆記集合 nsset 字典 nsdictionary 底層實現原理欄目快樂工作简体版

原文原文鏈接

預備知識點

Foundation框架下提供了不少高級數據結構，不少都是和Core Foundation下的相對應，例如NSSet就是和_CFSet相對應，NSDictionary就是和_CFDictionary相對應。源碼html

瞭解集合NSSet和字典NSDictionary的底層實現原理前，若是不瞭解Hash表數據結構的話，建議先去了解一下
筆記-數據結構之 Hash（OC的粗略實現）算法

hash

這裏說的hash並非以前說的hash表，而是一個方法。爲何要有hash方法？數組

這個問題須要從hash表數據結構提及，首先看下如何在數組中查找某個成員bash

先遍歷數組中的成員
將取出的值與目標值比較，若是相等，則返回改爲員

在數組未排序的狀況下，查找的時間複雜度是O(n)（n爲數組長度）。hash表的出現，提升了查找速度，當成員被加入到hash表中時，會計算出一個hash值，hash值對數組長度取模，會獲得該成員在數組中的位置。數據結構

經過這個位置能夠將查找的時間複雜度優化到O(1)，前提是在不發生衝突的狀況下。 這裏的hash值是經過hash方法計算出來的，且hash方法返回的hash值最好惟一app

和數組相比，基於hash值索引的hash表查找某個成員的過程：框架

經過hash值直接查找到目標值的位置
若是目標上有不少相同hash值成員，在利用hash表解決衝突的方式進行查找

能夠看出優點比較明顯，最壞的狀況和數組也相差無幾。函數

hash方法何時被調用

先看下幾個例子：post

Person *person1 = [Person personWithName:kName1 birthday:self.date1];
Person *person2 = [Person personWithName:kName2 birthday:self.date2];

NSMutableArray *array1 = [NSMutableArray array];
[array1 addObject:person1];
NSMutableArray *array2 = [NSMutableArray array];
[array2 addObject:person2];
NSLog(@"array end -------------------------------");

NSMutableSet *set1 = [NSMutableSet set];
[set1 addObject:person1];
NSMutableSet *set2 = [NSMutableSet set];
[set2 addObject:person2];
NSLog(@"set end -------------------------------");

NSMutableDictionary *dictionaryValue1 = [NSMutableDictionary dictionary];
[dictionaryValue1 setObject:person1 forKey:kKey1];
NSMutableDictionary *dictionaryValue2 = [NSMutableDictionary dictionary];
[dictionaryValue2 setObject:person2 forKey:kKey2];
NSLog(@"dictionary value end -------------------------------");

NSMutableDictionary *dictionaryKey1 = [NSMutableDictionary dictionary];
[dictionaryKey1 setObject:kValue1 forKey:person1];
NSMutableDictionary *dictionaryKey2 = [NSMutableDictionary dictionary];
[dictionaryKey2 setObject:kValue2 forKey:person2];
NSLog(@"dictionary key end -------------------------------");
複製代碼

重寫hash方法，方便查看hash方法是否被調用：優化

- (NSUInteger)hash {
    NSUInteger hash = [super hash];
    NSLog(@"走過 hash");
    return hash;
}
複製代碼

打印結果：

array end -------------------------------
走過 hash
走過 hash
走過 hash
走過 hash
set end -------------------------------
dictionary value end -------------------------------
走過 hash
走過 hash
走過 hash
走過 hash
dictionary key end -------------------------------
複製代碼

能夠了解到：hash方法只在對象被添加到NSSet和設置爲NSDictionary的key時被調用

NSSet添加新成員時，須要根據hash值來快速查找成員，以保證集合中是否已經存在該成員。
NSDictionary在查找key時，也是利用了key的hash值來提升查找的效率。

關於上面知識點詳細可參考 iOS開發之不要告訴我你真的懂isEqual與hash!

這裏能夠獲得這個結論：
相等變量的hash結果老是相同的，不相等變量的hash結果有可能相同

集合NSSet

struct __CFSet {
    CFRuntimeBase _base;
    CFIndex _count;		/* number of values */
    CFIndex _capacity;		/* maximum number of values */
    CFIndex _bucketsNum;	/* number of slots */
    uintptr_t _marker;
    void *_context;		/* private */
    CFIndex _deletes;
    CFOptionFlags _xflags;      /* bits for GC */
    const void **_keys;		/* can be NULL if not allocated yet */
};
複製代碼

根據數據結構能夠發現set內部使用了指針數組來保存keys，能夠從源碼中瞭解到採用的是連續存儲的方式存儲。

基於不一樣的初始化，hash值存在不一樣的運算，簡化源碼可知道：

static CFIndex __CFSetFindBucketsXX(CFSetRef set, const void *key) {
    CFHashCode keyHash = (CFHashCode)key;
    
    const CFSetCallBacks *cb = __CFSetGetKeyCallBacks(set);
    CFHashCode keyHash = cb->hash ? (CFHashCode)INVOKE_CALLBACK2(((CFHashCode (*)(const void *, void *))cb->hash), key, set->_context) : (CFHashCode)key;
    
    const void **keys = set->_keys;
    CFIndex probe = keyHash % set->_bucketsNum;
}
複製代碼

這個過程確定會出現衝突，在筆記-數據結構之 Hash（OC的粗略實現）文章中，我也說明了兩種解決衝突的方法開放定址法、鏈表法。

在數組長度不大的狀況下，鏈表法衍生出來的鏈表會很是龐大，並且須要二次遍歷，匹配損耗同樣很大，這樣等於沒有優化。官方說查找算法接近O(1)，因此確定不是鏈表法，那就是開放定址法。

開放定址法能夠經過動態擴容數組長度解決表存儲滿沒法插入的問題，也符合O(1)的查詢速度。
也能夠經過AddValue的實現，證明這一點，下面代碼除去了無關的邏輯：

void CFSetAddValue(CFMutableSetRef set, const void *key) {
    // 經過 match、nomatch 判斷Set是否存在key
    CFIndex match, nomatch;
    
    __CFSetFindBuckets2(set, key, &match, &nomatch);
    if (kCFNotFound != match) {
        // 存在key，則什麼都不作
    } else {
        // 不存在，則添加到set中
        CF_OBJC_KVO_WILLCHANGE(set, key);
	    CF_WRITE_BARRIER_ASSIGN(keysAllocator, set->_keys[nomatch], newKey);
	    set->_count++;
	    CF_OBJC_KVO_DIDCHANGE(set, key);
    }
}

static void __CFSetFindBuckets2(CFSetRef set, const void *key, CFIndex *match, CFIndex *nomatch) {
    const CFSetCallBacks *cb = __CFSetGetKeyCallBacks(set);
    // 獲取hash值
    CFHashCode keyHash = cb->hash ? (CFHashCode)INVOKE_CALLBACK2(((CFHashCode (*)(const void *, void *))cb->hash), key, set->_context) : (CFHashCode)key;
    const void **keys = set->_keys;
    uintptr_t marker = set->_marker;
    CFIndex probe = keyHash % set->_bucketsNum;
    CFIndex probeskip = 1;	// See RemoveValue() for notes before changing this value
    CFIndex start = probe;
    *match = kCFNotFound;
    *nomatch = kCFNotFound;
    for (;;) {
	uintptr_t currKey = (uintptr_t)keys[probe];
	// 若是hash值對應的是空閒區域，那麼標記nomatch，返回不存在key
	if (marker == currKey) {		/* empty */
	    if (nomatch) *nomatch = probe;
	    return;
	} else if (~marker == currKey) {	/* deleted */
	    if (nomatch) {
		*nomatch = probe;
		nomatch = NULL;
	    }
	} else if (currKey == (uintptr_t)key || (cb->equal && INVOKE_CALLBACK3((Boolean (*)(const void *, const void *, void*))cb->equal, (void *)currKey, key, set->_context))) {
	    // 標記match，返回存在key
	    *match = probe;
	    return;
	}
	// 沒有匹配，說明發生了衝突，那麼將數組下標後移，知道找到空閒區域位置
	probe = probe + probeskip;
    
	if (set->_bucketsNum <= probe) {
	    probe -= set->_bucketsNum;
	}
	if (start == probe) {
	    return;
	}
    }
}
複製代碼

這裏涉及到的擴容，在筆記-數據結構之 Hash（OC的粗略實現）中，我也使用OC代碼具體的實現了，上一篇實現的是鏈表法，其實仔細看的小夥伴就知道原理是如出一轍的。在CFSet內部結構裏還有個_capacity表示當前數組的擴容閾值，當count達到這個值就擴容，看下源碼，除去了無關邏輯：

// 新增元素的時候會判斷
void CFSetAddValue(CFMutableSetRef set, const void *key) {
        ...
	if (set->_count == set->_capacity || NULL == set->_keys) {
	    // 調用擴容
	    __CFSetGrow(set, 1);
	}
	...
}

// 擴容
static void __CFSetGrow(CFMutableSetRef set, CFIndex numNewValues) {
    // 保存舊值key的數據
    const void **oldkeys = set->_keys;
    CFIndex idx, oldnbuckets = set->_bucketsNum;
    CFIndex oldCount = set->_count;
    CFAllocatorRef allocator = __CFGetAllocator(set), keysAllocator;
    void *keysBase;
    set->_capacity = __CFSetRoundUpCapacity(oldCount + numNewValues);
    set->_bucketsNum = __CFSetNumBucketsForCapacity(set->_capacity);
    set->_deletes = 0;
    void *buckets = _CFAllocatorAllocateGC(allocator, set->_bucketsNum * sizeof(const void *), (set->_xflags & __kCFSetWeakKeys) ? AUTO_MEMORY_UNSCANNED : AUTO_MEMORY_SCANNED);
    // 擴容key
    CF_WRITE_BARRIER_BASE_ASSIGN(allocator, set, set->_keys, buckets);
    keysAllocator = allocator;
    keysBase = set->_keys;
    if (NULL == set->_keys) HALT;
    if (__CFOASafe) __CFSetLastAllocationEventName(set->_keys, "CFSet (store)");
    
    // 從新計算key的hash值，存放到新數組中
    for (idx = set->_bucketsNum; idx--;) {
        set->_keys[idx] = (const void *)set->_marker;
    }
    if (NULL == oldkeys) return;
    for (idx = 0; idx < oldnbuckets; idx++) {
        if (set->_marker != (uintptr_t)oldkeys[idx] && ~set->_marker != (uintptr_t)oldkeys[idx]) {
            CFIndex match, nomatch;
            __CFSetFindBuckets2(set, oldkeys[idx], &match, &nomatch);
            CFAssert3(kCFNotFound == match, __kCFLogAssertion, "%s(): two values (%p, %p) now hash to the same slot; mutable value changed while in table or hash value is not immutable", __PRETTY_FUNCTION__, oldkeys[idx], set->_keys[match]);
            if (kCFNotFound != nomatch) {
                CF_WRITE_BARRIER_BASE_ASSIGN(keysAllocator, keysBase, set->_keys[nomatch], oldkeys[idx]);
            }
        }
    }
    CFAssert1(set->_count == oldCount, __kCFLogAssertion, "%s(): set count differs after rehashing; error", __PRETTY_FUNCTION__);
    _CFAllocatorDeallocateGC(allocator, oldkeys);
}

複製代碼

能夠看出，NSSet添加key，key值會根據特定的hash函數算出hash值，而後存儲數據的時候，會根據hash函數算出來的值，找到對應的下標，若是該下標下已有數據，開放定址法後移動插入，若是數組到達閾值，這個時候就會進行擴容，而後從新hash插入。查詢速度就能夠和連續性存儲的數據同樣接近O(1)了

字典NSDictionary

話很少說，先看下dictionary的數據結構：

struct __CFDictionary {
    CFRuntimeBase _base;
    CFIndex _count;		/* number of values */
    CFIndex _capacity;		/* maximum number of values */
    CFIndex _bucketsNum;	/* number of slots */
    uintptr_t _marker;
    void *_context;		/* private */
    CFIndex _deletes;
    CFOptionFlags _xflags;      /* bits for GC */
    const void **_keys;		/* can be NULL if not allocated yet */
    const void **_values;	/* can be NULL if not allocated yet */
};
複製代碼

是否是感受特別的熟悉，和上面的集合NSSet相比較，多了一個指針數組values。

經過比較集合NSSet和字典NSDictionary的源碼能夠知道二者實現的原理差很少，而字典則用了兩個數組keys和values，說明這兩個數據是被分開存儲的。

一樣的也是利用開放定址法來動態擴容數組來解決數組滿了沒法插入的問題，也能夠經過setValue的實現證明這一點，下面代碼已除去無關邏輯：

void CFDictionarySetValue(CFMutableDictionaryRef dict, const void *key, const void *value) {
    // 經過match，nomatch來判斷是否存在key
    CFIndex match, nomatch;
    __CFDictionaryFindBuckets2(dict, key, &match, &nomatch);
    。。。
    if (kCFNotFound != match) {
        // key已存在，覆蓋newValue
	CF_OBJC_KVO_WILLCHANGE(dict, key);
	CF_WRITE_BARRIER_ASSIGN(valuesAllocator, dict->_values[match], newValue);
	CF_OBJC_KVO_DIDCHANGE(dict, key);
    } else {
        // key不存在，新增value
	CF_OBJC_KVO_WILLCHANGE(dict, key);
	CF_WRITE_BARRIER_ASSIGN(keysAllocator, dict->_keys[nomatch], newKey);
	CF_WRITE_BARRIER_ASSIGN(valuesAllocator, dict->_values[nomatch], newValue);
	dict->_count++;
	CF_OBJC_KVO_DIDCHANGE(dict, key);
    }
}

// 查找key存儲的位置
static void __CFDictionaryFindBuckets2(CFDictionaryRef dict, const void *key, CFIndex *match, CFIndex *nomatch) {
    const CFDictionaryKeyCallBacks *cb = __CFDictionaryGetKeyCallBacks(dict);
    // 獲取hash值
    CFHashCode keyHash = cb->hash ? (CFHashCode)INVOKE_CALLBACK2(((CFHashCode (*)(const void *, void *))cb->hash), key, dict->_context) : (CFHashCode)key;
    const void **keys = dict->_keys;
    uintptr_t marker = dict->_marker;
    CFIndex probe = keyHash % dict->_bucketsNum;
    CFIndex probeskip = 1;	// See RemoveValue() for notes before changing this value
    CFIndex start = probe;
    *match = kCFNotFound;
    *nomatch = kCFNotFound;
    for (;;) {
	uintptr_t currKey = (uintptr_t)keys[probe];
	// 空桶，返回nomatch，未匹配
	if (marker == currKey) {		/* empty */
	    if (nomatch) *nomatch = probe;
	    return;
	} else if (~marker == currKey) {	/* deleted */
	    if (nomatch) {
		*nomatch = probe;
		nomatch = NULL;
	    }
	} else if (currKey == (uintptr_t)key || (cb->equal && INVOKE_CALLBACK3((Boolean (*)(const void *, const void *, void*))cb->equal, (void *)currKey, key, dict->_context))) {
	    // 匹配成功，返回match
	    *match = probe;
	    return;
	}
	
	// 未匹配，發生碰撞，將數組下標後移，直到找到空閒區域位置
	probe = probe + probeskip;
	
	if (dict->_bucketsNum <= probe) {
	    probe -= dict->_bucketsNum;
	}
	if (start == probe) {
	    return;
	}
    }
}
複製代碼

經過源碼能夠看到，當有重複的key插入到字典NSDictionary時，會覆蓋舊值，而集合NSSet則什麼都不作，保證了裏面的元素不會重複。

你們都知道，字典裏的鍵值對key-value是一一對應的關係，從數據結構能夠看出，key和value是分別存儲在兩個不一樣的數組裏，這裏面是如何對key、value進行綁定的呢？

首先key利用hash函數算出hash值，而後對數組的長度取模，獲得數組下標的位置，一樣將這個地址對應到values數組的下標，就匹配到相應的value。注意到上面的這句話，要保證一點，就是keys和values這兩個數組的長度要一致。因此擴容的時候，須要對keys和values兩個數組一塊兒擴容。

// setValue時判斷
void CFDictionarySetValue(CFMutableDictionaryRef dict, const void *key, const void *value) {
    ...
    if (dict->_count == dict->_capacity || NULL == dict->_keys) {
         __CFDictionaryGrow(dict, 1);
    }
    ...
}

// 擴容
static void __CFDictionaryGrow(CFMutableDictionaryRef dict, CFIndex numNewValues) {
    // 保存舊值
    const void **oldkeys = dict->_keys;
    const void **oldvalues = dict->_values;
    CFIndex idx, oldnbuckets = dict->_bucketsNum;
    CFIndex oldCount = dict->_count;
    CFAllocatorRef allocator = __CFGetAllocator(dict), keysAllocator, valuesAllocator;
    void *keysBase, *valuesBase;
    dict->_capacity = __CFDictionaryRoundUpCapacity(oldCount + numNewValues);
    dict->_bucketsNum = __CFDictionaryNumBucketsForCapacity(dict->_capacity);
    dict->_deletes = 0;
    ...
    CF_WRITE_BARRIER_BASE_ASSIGN(allocator, dict, dict->_keys, _CFAllocatorAllocateGC(allocator, 2 * dict->_bucketsNum * sizeof(const void *), AUTO_MEMORY_SCANNED));
        dict->_values = (const void **)(dict->_keys + dict->_bucketsNum);
        keysAllocator = valuesAllocator = allocator;
        keysBase = valuesBase = dict->_keys;
    if (NULL == dict->_keys || NULL == dict->_values) HALT;
    ...
    // 從新計算keys數據的hash值，存放到新的數組中
    for (idx = dict->_bucketsNum; idx--;) {
        dict->_keys[idx] = (const void *)dict->_marker;
        dict->_values[idx] = 0;
    }
    if (NULL == oldkeys) return;
    for (idx = 0; idx < oldnbuckets; idx++) {
        if (dict->_marker != (uintptr_t)oldkeys[idx] && ~dict->_marker != (uintptr_t)oldkeys[idx]) {
            CFIndex match, nomatch;
            __CFDictionaryFindBuckets2(dict, oldkeys[idx], &match, &nomatch);
            CFAssert3(kCFNotFound == match, __kCFLogAssertion, "%s(): two values (%p, %p) now hash to the same slot; mutable value changed while in table or hash value is not immutable", __PRETTY_FUNCTION__, oldkeys[idx], dict->_keys[match]);
            if (kCFNotFound != nomatch) {
                CF_WRITE_BARRIER_BASE_ASSIGN(keysAllocator, keysBase, dict->_keys[nomatch], oldkeys[idx]);
                CF_WRITE_BARRIER_BASE_ASSIGN(valuesAllocator, valuesBase, dict->_values[nomatch], oldvalues[idx]);
            }
        }
    }
    ...
}
複製代碼

經過上面能夠看出，字典把無序和龐大的數據進行了空間hash表對應，下次查找的複雜度接近於O(1)，可是不斷擴容的空間就是其弊端，所以開放地址法最好存儲的是臨時須要，儘快的釋放資源。

對於字典NSDictionary設置的key和value，key值會根據特定的hash函數算出hash值，keys和values一樣多，利用hash值對數組長度取模，獲得其對應的下標index，若是下標已有數據，開放定址法後移插入，若是數組達到閾值，就擴容，而後從新hash插入。這樣的機制就把一些不連續的key-value值插入到能創建起關係的hash表中。
查找的時候，key根據hash函數以及數組長度，獲得下標，而後根據下標直接訪問hash表的keys和values，這樣查詢速度就能夠和連續線性存儲的數據同樣接近O(1)了。