《OpenStack 虛擬機的磁盤文件類型與存儲方式》
《Libvirt Live Migration 與 Pre-Copy 實現原理》
《OpenStack 虛擬機冷/熱遷移功能實踐與流程分析》html
在通過上述文章的鋪墊以後,終於來到了代碼實現部分,經過對代碼實現的分析,幫助咱們洞穿 OpenStack 虛擬機遷移的本質。node
NOTE:block_device_info 保存的並不是只是 OpenStack 塊設備(Volume)信息,而是虛擬機塊設備信息,即磁盤信息(包含 image、volume),這一點認識不清很容易在代碼中被混淆。python
MariaDB [nova]> select device_name,destination_type,device_type,source_type,image_id from block_device_mapping where instance_uuid="1935fcf7-ba9b-437c-a7d3-5d54c6d0d6d3"; +-------------+------------------+-------------+-------------+--------------------------------------+ | device_name | destination_type | device_type | source_type | image_id | +-------------+------------------+-------------+-------------+--------------------------------------+ | /dev/vda | local | disk | image | 0aff2888-47f8-4133-928a-9c54414b3afb | +-------------+------------------+-------------+-------------+--------------------------------------+
# nova/nova/api/openstack/compute/migrate_server.py def _migrate(self, req, id, body): """Permit admins to migrate a server to a new host.""" ... # 判斷用戶是否有權重執行 migrate 操做 context.can(ms_policies.POLICY_ROOT % 'migrate') # 獲取 instance 資源模型對象 instance = common.get_instance(self.compute_api, context, id) try: # 實際調用的是 instance Resize 接口 self.compute_api.resize(req.environ['nova.context'], instance) ... # nova/nova/compute/api.py def resize(self, context, instance, flavor_id=None, clean_shutdown=True, **extra_instance_updates): """Resize (ie, migrate) a running instance. If flavor_id is None, the process is considered a migration, keeping the original flavor_id. If flavor_id is not None, the instance should be migrated to a new host and resized to the new flavor_id. """ # 從註釋能夠看出,是 Migrate 仍是 Resize 主要看是否傳入了 New Flavor ... # 獲取虛擬機當前的 Flavor current_instance_type = instance.get_flavor() # If flavor_id is not provided, only migrate the instance. if not flavor_id: LOG.debug("flavor_id is None. Assuming migration.", instance=instance) # 保證遷移先後虛擬機 Flavor 不會發生改變 new_instance_type = current_instance_type ... filter_properties = {'ignore_hosts': []} # 經過配置項 allow_resize_to_same_host 來決定是否會 resize 到同一個計算節點 # 實際上,當 Migrate 到同一個計算節點時,nova-compute 會觸發 UnableToMigrateToSelf 異常, # 再繼續 Retry Scheduler,直至調度到合適的計算節點或異常退出,前提是 nova-scheduler 啓用了 RetryFilter if not CONF.allow_resize_to_same_host: filter_properties['ignore_hosts'].append(instance.host) ... scheduler_hint = {'filter_properties': filter_properties} self.compute_task_api.resize_instance(context, instance, extra_instance_updates, scheduler_hint=scheduler_hint, flavor=new_instance_type, reservations=quotas.reservations or [], clean_shutdown=clean_shutdown, request_spec=request_spec) # nova/compute/manager.py def resize_instance(self, context, instance, image, reservations, migration, instance_type, clean_shutdown): """Starts the migration of a running instance to another host.""" ... # 獲取虛擬機的網絡信息 network_info = self.network_api.get_instance_nw_info(context, instance) ... # 獲取虛擬機磁盤信息 bdms = objects.BlockDeviceMappingList.get_by_instance_uuid( context, instance.uuid) block_device_info = self._get_instance_block_device_info( context, instance, bdms=bdms) # 獲取虛擬機的停機超時和重試信息 timeout, retry_interval = self._get_power_off_values(context, instance, clean_shutdown) # 關閉虛擬機電源並遷移虛擬機磁盤文件 disk_info = self.driver.migrate_disk_and_power_off( context, instance, migration.dest_host, instance_type, network_info, block_device_info, timeout, retry_interval) # 斷開虛擬機的共享塊設備鏈接 self._terminate_volume_connections(context, instance, bdms) # 遷移虛擬機網絡 migration_p = obj_base.obj_to_primitive(migration) self.network_api.migrate_instance_start(context, instance, migration_p) ... # 修改虛擬機的主機記錄 instance.host = migration.dest_compute instance.node = migration.dest_node instance.task_state = task_states.RESIZE_MIGRATED instance.save(expected_task_state=task_states.RESIZE_MIGRATING) ... # nova/nova/virt/libvirt/driver.py def migrate_disk_and_power_off(self, context, instance, dest, flavor, network_info, block_device_info=None, timeout=0, retry_interval=0): # 獲取臨時盤信息 ephemerals = driver.block_device_info_get_ephemerals(block_device_info) # 檢查是否要調整磁盤大小 # Checks if the migration needs a disk resize down. root_down = flavor.root_gb < instance.flavor.root_gb ephemeral_down = flavor.ephemeral_gb < eph_size # 檢查虛擬機是否經過卷啓動 booted_from_volume = self._is_booted_from_volume(block_device_info) # 本地磁盤文件不能 Resize if (root_down and not booted_from_volume) or ephemeral_down: reason = _("Unable to resize disk down.") raise exception.InstanceFaultRollback( exception.ResizeError(reason=reason)) # Cinder LVM Backend & Boot from volume 的虛擬機不能遷移 # NOTE(dgenin): Migration is not implemented for LVM backed instances. if CONF.libvirt.images_type == 'lvm' and not booted_from_volume: reason = _("Migration is not supported for LVM backed instances") raise exception.InstanceFaultRollback( exception.MigrationPreCheckError(reason=reason)) # copy disks to destination # rename instance dir to +_resize at first for using # shared storage for instance dir (eg. NFS). inst_base = libvirt_utils.get_instance_path(instance) inst_base_resize = inst_base + "_resize" # 判斷是否爲共享存儲 shared_storage = self._is_storage_shared_with(dest, inst_base) # try to create the directory on the remote compute node # if this fails we pass the exception up the stack so we can catch # failures here earlier if not shared_storage: try: # 非共享存儲:經過 SSH 在目的主機上建立虛擬機目錄 self._remotefs.create_dir(dest, inst_base) except processutils.ProcessExecutionError as e: reason = _("not able to execute ssh command: %s") % e raise exception.InstanceFaultRollback( exception.ResizeError(reason=reason)) # 關閉虛擬機電源 self.power_off(instance, timeout, retry_interval) # 卸載共享塊設備 block_device_mapping = driver.block_device_info_get_mapping( block_device_info) for vol in block_device_mapping: connection_info = vol['connection_info'] disk_dev = vol['mount_device'].rpartition("/")[2] self._disconnect_volume(connection_info, disk_dev, instance) # 獲取 disk.info 配置文件內容 # 記錄了 Root Disk、Ephemeral Disk、Swap Disk 的 file paths disk_info_text = self.get_instance_disk_info( instance, block_device_info=block_device_info) disk_info = jsonutils.loads(disk_info_text) try: # 預刪除虛擬機目錄 utils.execute('mv', inst_base, inst_base_resize) # if we are migrating the instance with shared storage then # create the directory. If it is a remote node the directory # has already been created if shared_storage: # 共享存儲:目的主機看做是本身 dest = None # 共享存儲:直接在本地文件系統建立虛擬機目錄 utils.execute('mkdir', '-p', inst_base) ... active_flavor = instance.get_flavor() # 塊遷移虛擬機本地磁盤文件 for info in disk_info: # assume inst_base == dirname(info['path']) img_path = info['path'] fname = os.path.basename(img_path) from_path = os.path.join(inst_base_resize, fname) ... # We will not copy over the swap disk here, and rely on # finish_migration/_create_image to re-create it for us. if not (fname == 'disk.swap' and active_flavor.get('swap', 0) != flavor.get('swap', 0)): # 是否啓用壓縮 compression = info['type'] not in NO_COMPRESSION_TYPES # 非共享存儲:使用 scp 遠程拷貝 # 共享存儲:使用 cp 本地拷貝 libvirt_utils.copy_image(from_path, img_path, host=dest, on_execute=on_execute, on_completion=on_completion, compression=compression) # Ensure disk.info is written to the new path to avoid disks being # reinspected and potentially changing format. # 拷貝 diks.inof 配置文件 src_disk_info_path = os.path.join(inst_base_resize, 'disk.info') if os.path.exists(src_disk_info_path): dst_disk_info_path = os.path.join(inst_base, 'disk.info') libvirt_utils.copy_image(src_disk_info_path, dst_disk_info_path, host=dest, on_execute=on_execute, on_completion=on_completion) except Exception: with excutils.save_and_reraise_exception(): self._cleanup_remote_migration(dest, inst_base, inst_base_resize, shared_storage) return disk_info_text
在《Libvirt Live Migration 與 Pre-Copy 實現原理》一文中咱們提到了 Libvirt Live Migration 的實現原理,和 KVM Pre-Copy Live Migration 的實現原理。簡單的說,可分爲 3 個階段:linux
能夠想到,其中最關鍵的階段就是 Stage 2,即退出條件的實現。Libvirt 早期的原生退出條件有:算法
而 Nova 選擇的是退出條件就是動態配置 max downtime,Libvirt Pre-Copy Live Migration 每次迭代都會從新計算虛擬機新的髒內存以及每次迭代所花掉的時間來估算帶寬,再根據帶寬和當前迭代的髒頁數計算出傳輸剩餘數據的時間,這個時間就是 downtime。若是 downtime 在管理員配置的 Live Migration Max Downtime 範圍以內,則退出,進入 Stage 3。json
NOTE:Live Migration Max Downtime(熱遷移最大停機時間,單位是 ms),表示可被容許的虛擬機靜態數據持續時間,描述業務中斷的容忍區間,通常小到能夠忽略不計。可經過 nova.conf 配置項指定 CONF.libvirt.live_migration_downtime
。api
須要注意的是,動態配置 downtime 的退出條件存在一個問題,若是虛擬機持續處於高業務狀態(不斷產生新的髒內存),就意味着每次迭代遷移數據量都很大,downtime 就會一直沒法進入退出範圍。因此,你應該要有心理準備,使用熱遷移多是一個漫長的過程。針對這種狀況,Libvirt 引入了一些新特性:網絡
除了 Pre-Copy(預拷貝)模式以外,Libvirt 還支持 Post-Copy(後拷貝)模式。前者要求全部數據都必須在虛擬機切換到目標主機以前拷貝完;相對的,Post-Copy 則會優先考慮儘快的切換到目標主機,而後再拷貝內存數據。Port-Copy 模式先把虛擬機的設備狀態信息和一部分(10%)髒內存數據到目標主機,而後虛擬機就切換到目標主機上運行。當 GuestOS 發現訪問的某些內存頁不存在時,就會觸發一個遠程頁錯誤,進而觸發從源主機上面拉取該內存頁的動做。顯然,Post-Copy 模式也存在一些問題:若是其中一臺主機宕機,或出現故障,或網絡不通都會致使整個虛擬機異常。Post-Copy 對於核心業務而言不是推薦的 Live Migration 方式,能夠經過 nova.conf 配置項 live_migration_permit_post_copy
指定是否開啓。app
除此以外,Nova 採用的 Libvirt Live Migration 控制模型是 「Client 直連控制」,因此做爲 Libvirt Client 的 Nova 就須要輪詢訪問 libvirtd 以獲取數據遷移的狀態信息做爲控制遷移的依據。故此,Nova 還須要實現一套數據遷移監控機制。dom
簡而言之,Nova 對於 Libvirt Live Migration 的主要實現有兩點:
# nova/api/openstack/compute/migrate_server.py def _migrate_live(self, req, id, body): """Permit admins to (live) migrate a server to a new host.""" ... # 是否執行塊遷移 block_migration = body["os-migrateLive"]["block_migration"] ... # 是否異步執行 async = api_version_request.is_supported(req, min_version='2.34') ... # 是否強制執行 force = self._get_force_param_for_live_migration(body, host) ... # 是否支持磁盤超額 disk_over_commit = body["os-migrateLive"]["disk_over_commit"] ... self.compute_api.live_migrate(context, instance, block_migration, disk_over_commit, host, force, async) ... # nova/nova/compute/api.py def live_migrate(self, context, instance, block_migration, disk_over_commit, host_name, force=None, async=False): """Migrate a server lively to a new host.""" ... # NOTE(sbauza): Force is a boolean by the new related API version if force is False and host_name: ... # 非強制執行:設定目的主機信息 destination = objects.Destination( host=target.host, node=target.hypervisor_hostname ) request_spec.requested_destination = destination ... self.compute_task_api.live_migrate_instance(context, instance, host_name, block_migration=block_migration, disk_over_commit=disk_over_commit, request_spec=request_spec, async=async) # nova/nova/conductor/manager.py def _live_migrate(self, context, instance, scheduler_hint, block_migration, disk_over_commit, request_spec): # 獲取目的主機 destination = scheduler_hint.get("host") ... task = self._build_live_migrate_task(context, instance, destination, block_migration, disk_over_commit, migration, request_spec) ... task.execute() ... # nova/nova/conductor/tasks/live_migrate.py class LiveMigrationTask(base.TaskBase): ... def _execute(self): # 檢查虛擬機是否正常運行 self._check_instance_is_active() # 檢查源主機服務進程是否正常 self._check_host_is_up(self.source) # 熱遷移必定會指定目的主機 if not self.destination: self.destination = self._find_destination() self.migration.dest_compute = self.destination self.migration.save() else: # 檢查目的主機和源主機是否爲同一個 # 檢查目的主機服務進程是否正常 # 檢查目的主機是否有足夠的內存空間 # 檢查目的主機和源主機的 Hypervisor 是否一致 # 檢查目的主機是否能夠進行熱遷移 self._check_requested_destination() # TODO(johngarbutt) need to move complexity out of compute manager # TODO(johngarbutt) disk_over_commit? return self.compute_rpcapi.live_migration(self.context, host=self.source, instance=self.instance, dest=self.destination, block_migration=self.block_migration, migration=self.migration, migrate_data=self.migrate_data) # nova/compute/manager.py def live_migration(self, context, dest, instance, block_migration, migration, migrate_data): ... # 設定 migration 狀態爲 '隊列中' self._set_migration_status(migration, 'queued') def dispatch_live_migration(*args, **kwargs): with self._live_migration_semaphore: self._do_live_migration(*args, **kwargs) # Spawn 一個熱遷移隊列消息(任務) utils.spawn_n(dispatch_live_migration, context, dest, instance, block_migration, migration, migrate_data) def _do_live_migration(self, context, dest, instance, block_migration, migration, migrate_data): ... # 設定 migration 狀態爲 '準備' self._set_migration_status(migration, 'preparing') got_migrate_data_object = isinstance(migrate_data, migrate_data_obj.LiveMigrateData) if not got_migrate_data_object: migrate_data = \ migrate_data_obj.LiveMigrateData.detect_implementation( migrate_data) try: if ('block_migration' in migrate_data and migrate_data.block_migration): # 進行塊遷移:獲取 disk.info 中記錄的本地磁盤文件信息 block_device_info = self._get_instance_block_device_info( context, instance) disk = self.driver.get_instance_disk_info( instance, block_device_info=block_device_info) else: disk = None # 讓目的主機執行熱遷移前的準備 migrate_data = self.compute_rpcapi.pre_live_migration( context, instance, block_migration, disk, dest, migrate_data) ... # 設定 migration 狀態爲 '進行中' self._set_migration_status(migration, 'running') ... self.driver.live_migration(context, instance, dest, self._post_live_migration, self._rollback_live_migration, block_migration, migrate_data) ... # nova/nova/virt/libvirt/driver.py def _live_migration(self, context, instance, dest, post_method, recover_method, block_migration, migrate_data): ... # nova.virt.libvirt.guest.Guest 對象 guest = self._host.get_guest(instance) disk_paths = [] device_names = [] if migrate_data.block_migration: # 塊遷移:獲取本地磁盤文件路徑 # 若是不須要塊遷移,則只內存數據 # e.g. /var/lib/nova/instances/bf6824e9-1dac-466c-ab53-69f82d8adf73/disk disk_paths, device_names = self._live_migration_copy_disk_paths( context, instance, guest) # Spawn 一個熱遷移執行函數 opthread = utils.spawn(self._live_migration_operation, context, instance, dest, block_migration, migrate_data, guest, device_names) ... # 監控 libvirtd 數據遷移進度 self._live_migration_monitor(context, instance, guest, dest, post_method, recover_method, block_migration, migrate_data, finish_event, disk_paths) ...
def _live_migration_operation(self, context, instance, dest, block_migration, migrate_data, guest, device_names): ... # 獲取 live migration URI migrate_uri = None if ('target_connect_addr' in migrate_data and migrate_data.target_connect_addr is not None): dest = migrate_data.target_connect_addr if (migration_flags & libvirt.VIR_MIGRATE_TUNNELLED == 0): migrate_uri = self._migrate_uri(dest) # 獲取 GuestOS XML new_xml_str = None params = None if (self._host.is_migratable_xml_flag() and ( listen_addrs or migrate_data.bdms)): new_xml_str = libvirt_migrate.get_updated_guest_xml( # TODO(sahid): It's not a really well idea to pass # the method _get_volume_config and we should to find # a way to avoid this in future. guest, migrate_data, self._get_volume_config) ... # 調用 libvirt.virDomain.migrate 的封裝函數 # 向 libvirtd 發出 Live Migration 指令 guest.migrate(self._live_migration_uri(dest), migrate_uri=migrate_uri, flags=migration_flags, params=params, domain_xml=new_xml_str, bandwidth=CONF.libvirt.live_migration_bandwidth) ...
Libvirt Python Client 的遷移函數原型是 libvirt.virDomain.migrate
。
migrate(self, dconn, flags, dname, uri, bandwidth) method of libvirt.virDomain instance Migrate the domain object from its current host to the destination host given by dconn (a connection to the destination host).
Nova Libvirt Driver 對 libvirt.virDomain.migrate
進行了封裝:
# nova/virt/libvirt/guest.py def migrate(self, destination, migrate_uri=None, params=None, flags=0, domain_xml=None, bandwidth=0): """Migrate guest object from its current host to the destination """ if domain_xml is None: self._domain.migrateToURI( destination, flags=flags, bandwidth=bandwidth) else: if params: ... if migrate_uri: # In migrateToURI3 this paramenter is searched in # the `params` dict params['migrate_uri'] = migrate_uri params['bandwidth'] = bandwidth self._domain.migrateToURI3( destination, params=params, flags=flags) else: self._domain.migrateToURI2( destination, miguri=migrate_uri, dxml=domain_xml, flags=flags, bandwidth=bandwidth)
經過 Flags 來配置 Libvirt 遷移細節:
這些 Flags 經過 nova.conf 配置項 live_migration_flag 定義,e.g.
live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED
# nova/nova/virt/libvirt/driver.py def _live_migration_monitor(self, context, instance, guest, dest, post_method, recover_method, block_migration, migrate_data, finish_event, disk_paths): # 獲取須要進行熱遷移的總數據量,包括 RAM 和本地磁盤文件 # data_gb: total GB of RAM and disk to transfer data_gb = self._live_migration_data_gb(instance, disk_paths) # e.g. downtime_steps = [(0, 46), (300, 47), (600, 48), (900, 51), (1200, 57), (1500, 66), (1800, 84), (2100, 117), (2400, 179), (2700, 291), (3000, 500)] # downtime_steps 經過一個算法得出,參與計算的參數有: # data_gb # CONF.libvirt.live_migration_downtime # CONF.libvirt.live_migration_downtime_steps # CONF.libvirt.live_migration_downtime_delay # downtime_steps 的含義: # 一個元組表示一個 Step,分 Steps 次給 libvirtd 傳輸 downtime # (delay, downtime),即:(下一次傳遞時間間隔,傳遞的 downtime 值) # 直到最後一次 Step 傳遞的元組是 (CONF.libvirt.live_migration_downtime_delay, CONF.libvirt.live_migration_downtime_steps) # 若是最後一次 libvirtd 迭代計算出來的 downtime 在傳遞的 downtime 範圍內,則知足退出條件 # NOTE:downtime_steps 每一個 Step 的 max downtime 都在遞增直到真正用戶設定的最大可容忍 downtime, # 這是由於 Nova 在不斷的試探實際最小的 max downtime,儘量早的進入退出狀態。 downtime_steps = list(self._migration_downtime_steps(data_gb)) ... # 輪詢次數 n = 0 # 監控開始時間 start = time.time() progress_time = start # progress_watermark 用來標記上次查詢到的剩餘數據量,若是數據有在遷移,那麼髒數據水位(watermark)老是遞減的 progress_watermark = None # 是否啓用了 Port-Copy 模型 is_post_copy_enabled = self._is_post_copy_enabled(migration_flags) while True: # 獲取 Live Migration Job 的信息 info = guest.get_job_info() ... elif info.type == libvirt.VIR_DOMAIN_JOB_UNBOUNDED: # Migration is still running # # This is where we wire up calls to change live # migration status. eg change max downtime, cancel # the operation, change max bandwidth libvirt_migrate.run_tasks(guest, instance, self.active_migrations, on_migration_failure, migration, is_post_copy_enabled) now = time.time() elapsed = now - start if ((progress_watermark is None) or (progress_watermark == 0) or (progress_watermark > info.data_remaining)): progress_watermark = info.data_remaining progress_time = now # progress_timeout 這個變量的設計用來防止因爲 libvirtd 異常致使的數據遷移卡殼 # progress_timeout 標記遷移卡殼的超時時間,停止遷移 progress_timeout = CONF.libvirt.live_migration_progress_timeout # completion_timeout 這個變量的設計用來防止 libvirtd 長時間處在遷移狀態 # 可能因爲網絡帶寬過低等緣由,libvirtd 就會長時間處於遷移狀態,可能會致使管理帶寬擁堵 # completion_timeout 從第一次輪詢開始計算,一旦超時沒有完成遷移,停止遷移 completion_timeout = int( CONF.libvirt.live_migration_completion_timeout * data_gb) # 判斷遷移過程是否應該終止 if libvirt_migrate.should_abort(instance, now, progress_time, progress_timeout, elapsed, completion_timeout, migration.status): try: guest.abort_job() except libvirt.libvirtError as e: LOG.warning(_LW("Failed to abort migration %s"), e, instance=instance) self._clear_empty_migration(instance) raise # 判斷是否啓動 Port-Copy 模式 if (is_post_copy_enabled and libvirt_migrate.should_switch_to_postcopy( info.memory_iteration, info.data_remaining, previous_data_remaining, migration.status)): # 進行 Port-Copy 轉換 libvirt_migrate.trigger_postcopy_switch(guest, instance, migration) previous_data_remaining = info.data_remaining # 迭代的動態傳遞 Max Downtime Step curdowntime = libvirt_migrate.update_downtime( guest, instance, curdowntime, downtime_steps, elapsed) if (n % 10) == 0: remaining = 100 if info.memory_total != 0: # 計算剩餘遷移數據量 remaining = round(info.memory_remaining * 100 / info.memory_total) libvirt_migrate.save_stats(instance, migration, info, remaining) # 每輪詢 60 次打印一次 info # 沒輪詢 10 次打印一次 debug lg = LOG.debug if (n % 60) == 0: lg = LOG.info # 打印已經遷移了幾秒、內存數據剩餘量、遷移進度 lg(_LI("Migration running for %(secs)d secs, " "memory %(remaining)d%% remaining; " "(bytes processed=%(processed_memory)d, " "remaining=%(remaining_memory)d, " "total=%(total_memory)d)"), {"secs": n / 2, "remaining": remaining, "processed_memory": info.memory_processed, "remaining_memory": info.memory_remaining, "total_memory": info.memory_total}, instance=instance) if info.data_remaining > progress_watermark: lg(_LI("Data remaining %(remaining)d bytes, " "low watermark %(watermark)d bytes " "%(last)d seconds ago"), {"remaining": info.data_remaining, "watermark": progress_watermark, "last": (now - progress_time)}, instance=instance) n = n + 1 # 遷移完成 elif info.type == libvirt.VIR_DOMAIN_JOB_COMPLETED: # Migration is all done LOG.info(_LI("Migration operation has completed"), instance=instance) post_method(context, instance, dest, block_migration, migrate_data) break # 遷移失敗 elif info.type == libvirt.VIR_DOMAIN_JOB_FAILED: # Migration did not succeed LOG.error(_LE("Migration operation has aborted"), instance=instance) libvirt_migrate.run_recover_tasks(self._host, guest, instance, on_migration_failure) recover_method(context, instance, dest, block_migration, migrate_data) break # 遷移取消 elif info.type == libvirt.VIR_DOMAIN_JOB_CANCELLED: # Migration was stopped by admin LOG.warning(_LW("Migration operation was cancelled"), instance=instance) libvirt_migrate.run_recover_tasks(self._host, guest, instance, on_migration_failure) recover_method(context, instance, dest, block_migration, migrate_data, migration_status='cancelled') break else: LOG.warning(_LW("Unexpected migration job type: %d"), info.type, instance=instance) time.sleep(0.5) self._clear_empty_migration(instance) def _live_migration_data_gb(self, instance, disk_paths): '''Calculate total amount of data to be transferred :param instance: the nova.objects.Instance being migrated :param disk_paths: list of disk paths that are being migrated with instance Calculates the total amount of data that needs to be transferred during the live migration. The actual amount copied will be larger than this, due to the guest OS continuing to dirty RAM while the migration is taking place. So this value represents the minimal data size possible. :returns: data size to be copied in GB ''' ram_gb = instance.flavor.memory_mb * units.Mi / units.Gi if ram_gb < 2: ram_gb = 2 disk_gb = 0 for path in disk_paths: try: size = os.stat(path).st_size size_gb = (size / units.Gi) if size_gb < 2: size_gb = 2 disk_gb += size_gb except OSError as e: LOG.warning(_LW("Unable to stat %(disk)s: %(ex)s"), {'disk': path, 'ex': e}) # Ignore error since we don't want to break # the migration monitoring thread operation # 返回 RAM + Disks 的數據量總和 return ram_gb + disk_gb def _migration_downtime_steps(data_gb): '''Calculate downtime value steps and time between increases. :param data_gb: total GB of RAM and disk to transfer This looks at the total downtime steps and upper bound downtime value and uses an exponential backoff. So initially max downtime is increased by small amounts, and as time goes by it is increased by ever larger amounts For example, with 10 steps, 30 second step delay, 3 GB of RAM and 400ms target maximum downtime, the downtime will be increased every 90 seconds in the following progression: - 0 seconds -> set downtime to 37ms - 90 seconds -> set downtime to 38ms - 180 seconds -> set downtime to 39ms - 270 seconds -> set downtime to 42ms - 360 seconds -> set downtime to 46ms - 450 seconds -> set downtime to 55ms - 540 seconds -> set downtime to 70ms - 630 seconds -> set downtime to 98ms - 720 seconds -> set downtime to 148ms - 810 seconds -> set downtime to 238ms - 900 seconds -> set downtime to 400ms This allows the guest a good chance to complete migration with a small downtime value. ''' # 經過配置項來控制 Live Migration 的執行細節 downtime = CONF.libvirt.live_migration_downtime steps = CONF.libvirt.live_migration_downtime_steps delay = CONF.libvirt.live_migration_downtime_delay # TODO(hieulq): Need to move min/max value into the config option, # currently oslo_config will raise ValueError instead of setting # option value to its min/max. if downtime < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN: downtime = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN if steps < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN: steps = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN if delay < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN: delay = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN delay = int(delay * data_gb) offset = downtime / float(steps + 1) base = (downtime - offset) ** (1 / float(steps)) for i in range(steps + 1): yield (int(delay * i), int(offset + base ** i)) # nova/nova/virt/libvirt/migration.py def update_downtime(guest, instance, olddowntime, downtime_steps, elapsed): """Update max downtime if needed :param guest: a nova.virt.libvirt.guest.Guest to set downtime for :param instance: a nova.objects.Instance :param olddowntime: current set downtime, or None :param downtime_steps: list of downtime steps :param elapsed: total time of migration in secs Determine if the maximum downtime needs to be increased based on the downtime steps. Each element in the downtime steps list should be a 2 element tuple. The first element contains a time marker and the second element contains the downtime value to set when the marker is hit. The guest object will be used to change the current downtime value on the instance. Any errors hit when updating downtime will be ignored :returns: the new downtime value """ LOG.debug("Current %(dt)s elapsed %(elapsed)d steps %(steps)s", {"dt": olddowntime, "elapsed": elapsed, "steps": downtime_steps}, instance=instance) thisstep = None for step in downtime_steps: # elapsed 是當前的已遷移時長 if elapsed > step[0]: # 若是已遷移時長大於 downtime_delay,那麼這次 Step 就是 current step thisstep = step if thisstep is None: LOG.debug("No current step", instance=instance) return olddowntime if thisstep[1] == olddowntime: LOG.debug("Downtime does not need to change", instance=instance) return olddowntime LOG.info(_LI("Increasing downtime to %(downtime)d ms " "after %(waittime)d sec elapsed time"), {"downtime": thisstep[1], "waittime": thisstep[0]}, instance=instance) try: # 向 libvirtd 傳遞 current max downtime guest.migrate_configure_max_downtime(thisstep[1]) except libvirt.libvirtError as e: LOG.warning(_LW("Unable to increase max downtime to %(time)d" "ms: %(e)s"), {"time": thisstep[1], "e": e}, instance=instance) return thisstep[1]
在《OpenStack 虛擬機冷/熱遷移功能實踐與流程分析》中咱們嘗試遷移過具備 NUMA 親和、CPU 綁定的虛擬機,結果是遷移以後虛擬機依舊可以保持這些特性。這裏咱們再進行一個更加極端的測試 —— 將一個具備 NUMA 親和、CPU 獨佔綁定的虛擬機遷移到一個 NUMA、CPU 資源都已經已經耗盡的目的主機。
[stack@undercloud (overcloudrc) ~]$ openstack server show VM1 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | overcloud-ovscompute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-000000d6 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 | | config_drive | | | created | 2019-03-20T10:44:52Z | | flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) | | hostId | 9f1230901ddf3fe0e1a41e1c650a784c122b791f89fdf66a40cff3d6 | | id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 | | image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) | | key_name | stack | | name | VM1 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | project_id | a6c78435075246f3aa5ab946b87086c5 | | properties | | | security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] | | status | ACTIVE | | updated | 2019-03-20T10:45:56Z | | user_id | 4fe574569664493bbd660abfe762a630 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ [stack@undercloud (overcloudrc) ~]$ openstack server migrate --block-migration --live overcloud-ovscompute-0.localdomain --wait VM1 Complete [stack@undercloud (overcloudrc) ~]$ openstack server show VM1 +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | ovs | | OS-EXT-SRV-ATTR:host | overcloud-ovscompute-0.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-ovscompute-0.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-000000d6 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-03-20T10:45:55.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19 | | config_drive | | | created | 2019-03-20T10:44:52Z | | flavor | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788) | | hostId | 0f2ec590cd73fe0e9522f1ba715dae7a7d4b884e15aa8254defe85d0 | | id | a17ddcbf-d936-4c77-9ea6-2e684c41cc39 | | image | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb) | | key_name | stack | | name | VM1 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | project_id | a6c78435075246f3aa5ab946b87086c5 | | properties | | | security_groups | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] | | status | ACTIVE | | updated | 2019-03-20T10:51:47Z | | user_id | 4fe574569664493bbd660abfe762a630 | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
遷移過程當中的異常信息:
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager [req-566373ae-5282-4378-9678-d8d08e121cdb - - - - -] Error updating resources for node overcloud-ovscompute-0.localdomain. 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager Traceback (most recent call last): 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6590, in update_available_resource_for_node 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager rt.update_available_resource(context) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 536, in update_available_resource 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_available_resource(context, resources) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager return f(*args, **kwargs) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 896, in _update_available_resource 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instances(context, instances) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1393, in _update_usage_from_instances 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self._update_usage_from_instance(context, instance) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1273, in _update_usage_from_instance 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager sign, is_periodic) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1119, in _update_usage 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.compute_node, usage, free) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1574, in get_host_numa_usage_from_instance 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free)) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1447, in numa_usage_from_instances 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 86, in pin_cpus 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager self.pinned_cpus)) 2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [0, 1] must be a subset of free CPU set [8]
遷移後的 NUMA 親和、CPU 綁定信息:
# 遷移虛擬機 [root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d6 VCPU: CPU Affinity ---------------------------------- 0: 0 1: 1 # 已存在虛擬機 [root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d0 VCPU: CPU Affinity ---------------------------------- 0: 0 1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7
遷移虛擬機的 XML 文件局部:
<cpu mode='custom' match='exact' check='full'> <model fallback='forbid'>IvyBridge</model> <topology sockets='1' cores='2' threads='1'/> <feature policy='require' name='hypervisor'/> <feature policy='require' name='arat'/> <feature policy='require' name='xsaveopt'/> <numa> <cell id='0' cpus='0-1' memory='1048576' unit='KiB'/> </numa> </cpu>
結論:虛擬機能夠成功遷移而且依舊保持原有的 NUMA、CPU 特性。這是由於 Dedicated CPU Policy 是 Nova 層的概念,但從上述代碼分析能夠看出 Nova 是徹底的 NUMA-Non-aware。Hypervisor 層就更不會買這些參數的單了,Hypervisor 徹底忠於 XML 的描述,只要 XML 說了用 0,1 pCPU,那麼即使 0,1 pCPU 已經被別的虛擬機佔用了,Hypervisor 也依舊會安排下去。固然了,從 Nova 層面來看這就是一個 Bug,社區也已經有人描述來了這個問題並提出 BP:《NUMA-aware live migration》,《NUMA-aware live migration》。
至於 SR-IOV,Nova 官方文檔明確提到了不支持 SR-IOV 虛擬機的 Live Migration。我曾在《啓用 SR-IOV 解決 Neutron 網絡 I/O 性能瓶頸》中分析過,SR-IOV 的 vf 設備對於 KVM 虛擬機來講就是一個 XML 標籤段。e.g.
<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address bus='0x81' slot='0x10' function='0x2'/> </source> </hostdev>
只要在目的計算節點能夠找到與這個標籤段匹配的 vf 設備便可實現 SR-IOV 網卡的遷移。問題是,原則上 Live Migration 虛擬機的 XML 文件理應不被修改,但實際上修改一段 vf 標籤也許並沒有大礙,主要是要作好遷移失敗的回滾備案和 Nova 的 SR-IOV-aware(感知和管理),寫到這裏我是愈加的但願 OpenStack Placement 可以快快發展,畢竟 Nova 對 NUMA、SR-IOV 等資源的 「黑盒」 管理是那麼的痛苦。
經過對 OpenStack 虛擬機冷/熱遷移的實現原理與代碼分析能夠感覺到,Nova 只是對傳統的遷移方式或對底層 Hypervisor 支撐軟件的遷移功能進行封裝和調度,使虛擬機的冷、熱遷移功能可以達到企業級雲平臺的業務需求水平。主要的技術價值仍是體如今底層技術支撐上,一如其餘 OpenStack 項目。
https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/
http://www.javashuo.com/article/p-scpajatm-mr.html
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html
https://docs.openstack.org/nova/pike/admin/live-migration-usage.html
https://blog.csdn.net/lemontree1945/article/details/79901874
https://www.ibm.com/developerworks/cn/linux/l-cn-mgrtvm1/index.html
https://blog.csdn.net/hawkerou/article/details/53482268