上一遍源碼分析,關注swift-ring-bin文件,其中最爲複雜,也是最爲重要操做要數rebalance方法了,它是用來從新生成ring文件,再你修改builder文件後(例如增減設備)使系統中的partition分佈平衡(固然,在rebalance後,須要從新啓動系統的各個服務)。其中一致性的哈希算法,副本的概念,zone的概念,weight的概念都是經過它來實現的。python
源碼片斷:算法
swift-ring-builder rebalance方法。 swift
def rebalance(): """ swift-ring-builder <builder_file> rebalance Attempts to rebalance the ring by reassigning partitions that haven't been recently reassigned. """ devs_changed = builder.devs_changed #devs_changed表明builder中的devs是否改變,默認是Flase,當調用add_dev,set_dev_weight,remove_dev,會把devs_changed設置爲True。 try: last_balance = builder.get_balance()#調用builder.get_balance方法,返回ring的banlance 也就是平衡度 例如0.83%。 parts, balance = builder.rebalance()#主要的重平衡方法,返回從新分配的partition的數目和新的balance。 except exceptions.RingBuilderError, e: print '-' * 79 print ("An error has occurred during ring validation. Common\n" "causes of failure are rings that are empty or do not\n" "have enough devices to accommodate the replica count.\n" "Original exception message:\n %s" % e.message ) print '-' * 79 exit(EXIT_ERROR) if not parts: print 'No partitions could be reassigned.' print 'Either none need to be or none can be due to ' \ 'min_part_hours [%s].' % builder.min_part_hours exit(EXIT_WARNING) if not devs_changed and abs(last_balance - balance) < 1: print 'Cowardly refusing to save rebalance as it did not change ' \ 'at least 1%.' exit(EXIT_WARNING) try: builder.validate()#安全功能方法,捕捉bugs,確保partition發配到真正的device上,不被分配兩次等等一些功能。 except exceptions.RingValidationError, e: print '-' * 79 print ("An error has occurred during ring validation. Common\n" "causes of failure are rings that are empty or do not\n" "have enough devices to accommodate the replica count.\n" "Original exception message:\n %s" % e.message ) print '-' * 79 exit(EXIT_ERROR) print 'Reassigned %d (%.02f%%) partitions. Balance is now %.02f.' % \ (parts, 100.0 * parts / builder.parts, balance)#打印rebalance結果 status = EXIT_SUCCESS if balance > 5: #balnce大於5會提示,最小的系統平衡時間。 print '-' * 79 print 'NOTE: Balance of %.02f indicates you should push this ' % \ balance print ' ring, wait at least %d hours, and rebalance/repush.' \ % builder.min_part_hours print '-' * 79 status = EXIT_WARNING ts = time()#截取時間。 builder.get_ring().save( #保存新生成的builder ring文件 pathjoin(backup_dir, '%d.' % ts + basename(ring_file))) pickle.dump(builder.to_dict(), open(pathjoin(backup_dir, '%d.' % ts + basename(argv[1])), 'wb'), protocol=2) builder.get_ring().save(ring_file) pickle.dump(builder.to_dict(), open(argv[1], 'wb'), protocol=2) exit(status)
其中我加入了一些本身的註釋,方便理解。其實是調用了builder.py中的rebalance方法。安全
builder.py 中的rebalance方法:app
def rebalance(self): """ Rebalance the ring. This is the main work function of the builder, as it will assign and reassign partitions to devices in the ring based on weights, distinct zones, recent reassignments, etc. The process doesn't always perfectly assign partitions (that'd take a lot more analysis and therefore a lot more time -- I had code that did that before). Because of this, it keeps rebalancing until the device skew (number of partitions a device wants compared to what it has) gets below 1% or doesn't change by more than 1% (only happens with ring that can't be balanced no matter what -- like with 3 zones of differing weights with replicas set to 3). :returns: (number_of_partitions_altered, resulting_balance) """ self._ring = None #令實例中的ring爲空 if self._last_part_moves_epoch is None: self._initial_balance() #增長一些初始化設置的balance方法, self.devs_changed = False return self.parts, self.get_balance() retval = 0 self._update_last_part_moves()#更新part moved時間。 last_balance = 0 while True: reassign_parts = self._gather_reassign_parts()#返回一個list(part,replica)對,須要從新分配。 self._reassign_parts(reassign_parts) #從新分配的實際動做 retval += len(reassign_parts) while self._remove_devs: self.devs[self._remove_devs.pop()['id']] = None #刪除相應的dev balance = self.get_balance()#獲取新的平衡比 if balance < 1 or abs(last_balance - balance) < 1 or \ retval == self.parts: break last_balance = balance self.devs_changed = False self.version += 1 return retval, balance
程序會根據_last_part_moves_epoch是否爲None來決定,程序執行的路線。若是爲None(說明是第一次rebalance),程序會調用_initial_balance()方法,而後返回結果,其實它的操做跟_last_part_moves_epoch不爲None時,進行的操做大致相同,只是_initial_balance會作一些初始化的操做。而真正執行rebalance操做動做的是_reassign_parts方法。dom
builder.py中的_reassign_parts分配part的動做方法。ide
def _reassign_parts(self, reassign_parts): """ For an existing ring data set, partitions are reassigned similarly to the initial assignment. The devices are ordered by how many partitions they still want and kept in that order throughout the process. The gathered partitions are iterated through, assigning them to devices according to the "most wanted" while keeping the replicas as "far apart" as possible. Two different zones are considered the farthest-apart things, followed by different ip/port pairs within a zone; the least-far-apart things are different devices with the same ip/port pair in the same zone. If you want more replicas than devices, you won't get all your replicas. :param reassign_parts: An iterable of (part, replicas_to_replace) pairs. replicas_to_replace is an iterable of the replica (an int) to replace for that partition. replicas_to_replace may be shared for multiple partitions, so be sure you do not modify it. """ for dev in self._iter_devs(): dev['sort_key'] = self._sort_key_for(dev)#設置每個dev的sort_key available_devs = \ #迭代出可用的devs根據sort_key排序 sorted((d for d in self._iter_devs() if d['weight']), key=lambda x: x['sort_key']) tier2children = build_tier_tree(available_devs)#生產層結構devs tier2devs = defaultdict(list)#devs層 tier2sort_key = defaultdict(list)#sort_key層 tiers_by_depth = defaultdict(set)#深度層 for dev in available_devs:#安裝不一樣方式分類排序。 for tier in tiers_for_dev(dev): tier2devs[tier].append(dev) # <-- starts out sorted! tier2sort_key[tier].append(dev['sort_key']) tiers_by_depth[len(tier)].add(tier) for part, replace_replicas in reassign_parts: # Gather up what other tiers (zones, ip_ports, and devices) the # replicas not-to-be-moved are in for this part. other_replicas = defaultdict(lambda: 0)#不一樣的zone ip_port device_id標識 for replica in xrange(self.replicas): if replica not in replace_replicas: dev = self.devs[self._replica2part2dev[replica][part]] for tier in tiers_for_dev(dev): other_replicas[tier] += 1#不須要從新分配的會被+1 def find_home_for_replica(tier=(), depth=1): # Order the tiers by how many replicas of this # partition they already have. Then, of the ones # with the smallest number of replicas, pick the # tier with the hungriest drive and then continue # searching in that subtree. # # There are other strategies we could use here, # such as hungriest-tier (i.e. biggest # sum-of-parts-wanted) or picking one at random. # However, hungriest-drive is what was used here # before, and it worked pretty well in practice. # # Note that this allocator will balance things as # evenly as possible at each level of the device # layout. If your layout is extremely unbalanced, # this may produce poor results. candidate_tiers = tier2children[tier]#逐層的找最少的part min_count = min(other_replicas[t] for t in candidate_tiers) candidate_tiers = [t for t in candidate_tiers if other_replicas[t] == min_count] candidate_tiers.sort( key=lambda t: tier2sort_key[t][-1]) if depth == max(tiers_by_depth.keys()): return tier2devs[candidate_tiers[-1]][-1] return find_home_for_replica(tier=candidate_tiers[-1], depth=depth + 1) for replica in replace_replicas:#對於要分配的dev作相應的處理 dev = find_home_for_replica() dev['parts_wanted'] -= 1 dev['parts'] += 1 old_sort_key = dev['sort_key'] new_sort_key = dev['sort_key'] = self._sort_key_for(dev) for tier in tiers_for_dev(dev): other_replicas[tier] += 1 index = bisect.bisect_left(tier2sort_key[tier], old_sort_key) tier2devs[tier].pop(index) tier2sort_key[tier].pop(index) new_index = bisect.bisect_left(tier2sort_key[tier], new_sort_key) tier2devs[tier].insert(new_index, dev) tier2sort_key[tier].insert(new_index, new_sort_key) self._replica2part2dev[replica][part] = dev['id']#某個part的某個replica分配到dev['id'] # Just to save memory and keep from accidental reuse. for dev in self._iter_devs(): del dev['sort_key']
這個函數實現了從新分配的功能,其中重要的概念是三層結構,也就是utrls.py文件,會針對一個dev 或者一個devs,返回三層結構的字典。函數
源碼中給咱們舉了一個例子:源碼分析
Example:ui
zone 1 -+---- 192.168.1.1:6000 -+---- device id 0
| |
| +---- device id 1
| |
| +---- device id 2
|
+---- 192.168.1.2:6000 -+---- device id 3
|
+---- device id 4
|
+---- device id 5
zone 2 -+---- 192.168.2.1:6000 -+---- device id 6
| |
| +---- device id 7
| |
| +---- device id 8
|
+---- 192.168.2.2:6000 -+---- device id 9
|
+---- device id 10
|
+---- device id 11
The tier tree would look like:
{
(): [(1,), (2,)],
(1,): [(1, 192.168.1.1:6000),
(1, 192.168.1.2:6000)],
(2,): [(1, 192.168.2.1:6000),
(1, 192.168.2.2:6000)],
(1, 192.168.1.1:6000): [(1, 192.168.1.1:6000, 0),
(1, 192.168.1.1:6000, 1),
(1, 192.168.1.1:6000, 2)],
(1, 192.168.1.2:6000): [(1, 192.168.1.2:6000, 3),
(1, 192.168.1.2:6000, 4),
(1, 192.168.1.2:6000, 5)],
(2, 192.168.2.1:6000): [(1, 192.168.2.1:6000, 6),
(1, 192.168.2.1:6000, 7),
(1, 192.168.2.1:6000, 8)],
(2, 192.168.2.2:6000): [(1, 192.168.2.2:6000, 9),
(1, 192.168.2.2:6000, 10),
(1, 192.168.2.2:6000, 11)],
}
經過zone,ip_port,device_id 分紅三層,以後的操做會根據層次,進行相關的操做(這其中就實現了zone,副本等概念)。
這樣一個ring rebalance操做就作好了,最後會保存新的 builder文件,和ring文件,ring文件時根據生產的builder文件調用了RingData類中的方法保存的比較簡單,這裏不作分析。
這樣大致上就分析了swift-ring-builder, /swift/common/ring/下的文件,其中具體的函數具體的功能與實現,能夠查看源碼。下一篇文章我會分析一下swift-init,用經過start方法來講明服務啓動的流程。