部署物理機跟部署虛擬機的概念在nova來看是同樣,都是nova經過建立虛擬機的方式來觸發,只是底層nova-scheduler和nova-compute的驅動不同。虛擬機的底層驅動採用的libvirt的虛擬化技術,而物理機是採用Ironic技術,ironic能夠當作一組Hypervisor API的集合,其功能與libvirt相似。html
下面咱們分別介紹每一個概念:java
爲何須要initrd?node
在早期的Linux系統中,通常就只有軟盤或者硬盤被用來做爲Linux的根文件系統,所以很容易把這些設備的驅動程序集成到內核中。可是如今根文件系統 可能保存在各類存儲設備上,包括SCSI, SATA, U盤等等。所以把這些設備驅動程序所有編譯到內核中顯得不太方便,違背了「內核」的精神。在Linux內核模塊自動加載機制中能夠利用udevd能夠實現內核模塊的自動加載,所以咱們但願根文件系統的設備驅動程序也可以實現自動加載。可是這裏有一個矛盾,udevd是一個可執行文件,在根文件系統被掛載前,是不可能執行udevd的,可是若是udevd沒有啓動,那就沒法自動加載根根據系統設備的驅動程序,同時也沒法在/dev目錄下創建相應的設備節點。python
爲了解決這個矛盾,因而出現了initrd(boot loader initialized RAM disk)。initrd是一個被壓縮過的小型根目錄,這個目錄中包含了啓動階段中必須的驅動模塊,可執行文件和啓動腳本。包括上面提到的udevd,當系統啓動的時候,bootload會把內核和initrd文件讀到內存中,而後把initrd的起始地址告訴內核。內核在運行過程當中會解壓initrd,而後把initrd掛載爲根目錄,而後執行根目錄中的/initrc腳本,能夠在這個腳本中運行initrd中的udevd,讓它來自動加載設備驅動程序以及 在/dev目錄下創建必要的設備節點。在udevd自動加載磁盤驅動程序以後,就能夠mount真正的根目錄,並切換到這個根目錄中。linux
Linux啓動必定要用initrd麼?ios
若是把須要的功能全都編譯到內核中(非模塊方式),只須要一個內核文件便可。initrd 可以減少啓動內核的體積並增長靈活性,若是你的內核以模塊方式支持某種文件系統(例如ext3, UFS),而啓動階段的驅動模塊放在這些文件系統上,內核是沒法讀取文件系統的,從而只能經過initrd的虛擬文件系統來裝載這些模塊。這裏有些人會問: 既然內核此時不能讀取文件系統,那內核的文件是怎麼裝入內存中的呢?答案很簡單,Grub是file-system sensitive的,可以識別常見的文件系統。通用安裝流程以下:git
PXE協議分爲client和server兩端,PXE client在網卡的ROM中,當計算機啓動時,BIOS把PXE client調入內存執行,並顯示出命令菜單,經用戶選擇後,PXE client將放置在遠端的操做系統經過網絡下載到本地運行。github
安裝流程以下:數據庫
流程小結:json
客戶端廣播dhcp請求——服務器相應請求,創建連接——由dhcp和tftp配置獲得ip還有引導程序所在地點——客戶端下載引導程序並開始運行——引導程序讀取系統鏡像-安裝操做系統
相關文件位置與內容:
參考資料: PXE網絡安裝操做系統過程
此圖是Liberty版的官方裸機部署過程圖,部署過程描述以下:
下面咱們經過代碼來分析Ironic的部署流程。
在/etc/nova/nova.conf中修改manager和driver,好比修改爲以下:
[DEFAULT] scheduler_host_manager = nova.scheduler.ironic_host_manager.IronicHostManager compute_driver = nova.virt.ironic.driver.IronicDriver compute_manager = ironic.nova.compute.manager.ClusteredComputeManager [ironic] admin_username = ironic admin_password = unset admin_url = http://127.0.0.1:35357/v2.0 admin_tenant_name = service
compute_manager的代碼實現是在ironic項目裏面。
第一步, nova-api接收到nova boot的請求,經過消息隊列到達nova-scheduler
第二步, nova-scheduler收到請求後,在scheduler_host_manager裏面處理。nova-scheduler會使用flavor裏面的額外屬性extra_specs,像cpu_arch,baremetal:deploy_kernel_id,baremetal:deploy_ramdisk_id等過濾條件找到相匹配的物理節點,而後發送RPC消息到nova-computer。
第三步,nova-computer拿到消息調用指定的driver的spawn方法進行部署,即調用nova.virt.ironic.driver.IronicDriver.spawn(), 該方法作了什麼操做呢?咱們來對代碼進行分析(下面的代碼只保留了主要的調用)。
def spawn(self, context, instance, image_meta, injected_files, admin_password, network_info=None, block_device_info=None): #獲取鏡像信息 image_meta = objects.ImageMeta.from_dict(image_meta) ...... #調用ironic的node.get方法查詢node的詳細信息,鎖定物理機,獲取該物理機的套餐信息 node = self.ironicclient.call("node.get", node_uuid) flavor = instance.flavor #將套餐裏面的baremetal:deploy_kernel_id和baremetal:deploy_ramdisk_id信息 #更新到driver_info,將image_source、root_gb、swap_mb、ephemeral_gb、 #ephemeral_format、preserve_ephemeral信息更新到instance_info中, #而後將driver_info和instance_info更新到ironic的node節點對應的屬性上。 self._add_driver_fields(node, instance, image_meta, flavor) ....... # 驗證是否能夠部署,只有當deply和power都準備好了才能部署 validate_chk = self.ironicclient.call("node.validate", node_uuid) ..... # 準備部署 try: #將節點的虛擬網絡接口和物理網絡接口鏈接起來並調用ironic API #進行更新,以便neutron能夠鏈接 self._plug_vifs(node, instance, network_info) self._start_firewall(instance, network_info) except Exception: .... # 配置驅動 onfigdrive_value = self._generate_configdrive( instance, node, network_info, extra_md=extra_md, files=injected_files) # 觸發部署請求 try: #調用ironic API,設置provision_state的狀態ACTIVE self.ironicclient.call("node.set_provision_state", node_uuid, ironic_states.ACTIVE, configdrive=configdrive_value) except Exception as e: .... #等待node provision_state爲ATCTIVE timer = loopingcall.FixedIntervalLoopingCall(self._wait_for_active, self.ironicclient, instance) try: timer.start(interval=CONF.ironic.api_retry_interval).wait() except Exception: ...
nova-compute的spawn的步驟包括:
第四步
ironic-api接收到了provision_state的設置請求,而後返回202的異步請求,那咱們下來看下ironic在作什麼?
首先,設置ironic node的provision_stat爲ACTIVE至關於發了一個POST請求:PUT /v1/nodes/(node_uuid)/states/provision。那根據openstack的wsgi的框架,註冊了app爲ironic.api.app.VersionSelectorApplication的類爲ironic的消息處理接口,那PUT /v1/nodes/(node_uuid)/states/provision的消息處理就在ironic.api.controllers.v1.node.NodeStatesController的provision方法。
@expose.expose(None, types.uuid_or_name, wtypes.text, wtypes.text, status_code=http_client.ACCEPTED) def provision(self, node_ident, target, configdrive=None): .... if target == ir_states.ACTIVE: #RPC調用do_node_deploy方法 pecan.request.rpcapi.do_node_deploy(pecan.request.context, rpc_node.uuid, False, configdrive, topic) ...
而後RPC調用的ironic.condutor.manager.ConductorManager.do_node_deploy方法,在方法中會先檢查電源和部署信息,其中部署信息檢查指定的節點的屬性是否包含驅動的要求,包括檢查boot、鏡像大小是否大於內存大小、解析根設備。檢查完以後調用ironic.condutor.manager.do_node_deploy方法
def do_node_deploy(task, conductor_id, configdrive=None): """Prepare the environment and deploy a node.""" node = task.node ... try: try: if configdrive: _store_configdrive(node, configdrive) except exception.SwiftOperationError as e: with excutils.save_and_reraise_exception(): handle_failure( e, task, _LE('Error while uploading the configdrive for ' '%(node)s to Swift'), _('Failed to upload the configdrive to Swift. ' 'Error: %s')) try: #調用驅動的部署模塊的prepare方法,不一樣驅動的動做不同 #1. pxe_* 驅動使用的是iscsi_deploy.ISCSIDeploy.prepare, #而後調用pxe.PXEBoot.prepare_ramdisk()準備部署進行和環境,包括cache images、 update DHCP、 #switch pxe_config、set_boot_device等操做 #cache images 是從glance上取鏡像緩存到condutor本地, #update DHCP指定bootfile文件地址爲condutor #switch pxe_config將deploy mode設置成service mode #set_boot_device設置節點pxe啓動 #2. agent_* 生成鏡像swift_tmp_url加入節點的instance_info中 #而後調用pxe.PXEBoot.prepare_ramdisk()準備部署鏡像和環境 task.driver.deploy.prepare(task) except Exception as e: ... try: #調用驅動的deploy方法,不一樣驅動動做不同 #1. pxe_* 驅動調用iscsi_deploy.ISCSIDeploy.deploy() #進行拉取用戶鏡像,而後重啓物理機 #2. agent_*驅動,直接重啓 new_state = task.driver.deploy.deploy(task) except Exception as e: ... # NOTE(deva): Some drivers may return states.DEPLOYWAIT # eg. if they are waiting for a callback if new_state == states.DEPLOYDONE: task.process_event('done') elif new_state == states.DEPLOYWAIT: task.process_event('wait') finally: node.save()
至此,ironic-conductor的動做完成,等待物理機進行上電。
值得說明的是,task是task_manager.TaskManager的一個對象,這個對象在初始化的時候將self.driver初始化了 self.driver = driver_factory.get_driver(driver_name or self.node.driver)
driver_name是傳入的參數,默認爲空;這個self.node.driver指物理機使用的驅動,不一樣物理機使用的驅動可能不一樣,這是在註冊物理機時指定的。
第五步
在上一步中已經設置好了啓動方式和相關網絡信和給機器上電了,那麼下一步就是機器啓動,進行部署了。下面以PXE和agent兩種部署方式分別來講明。
狀況1、使用PXE驅動部署
咱們知道安裝操做系統的通用流程是:首先,bios啓動,選擇操做系統的啓動(安裝)模式(此時,內存是空白的),而後根據相關的安裝模式,尋找操做系統的引導程序(不一樣的模式,對應不一樣的引導程序固然也對應着不一樣的引導程序存在的位置),引導程序加載文件系統初始化程序(initrd)和內核初始鏡像(vmlinuz),完成操做系統安裝前的初始化;接着,操做系統開始安裝相關的系統和應用程序。
PXE啓動方式的過程爲:
啓動後運行init啓動腳本,那麼init啓動腳本是什麼樣子的。
首先,咱們須要知道當前建立deploy-ironic的鏡像,使用的diskimage-build命令,參考diskimage-builder/elements/deploy-ironic這個元素,最重要的是init.d/80-deploy-ironic這個腳本,這個腳本主要其實就是作如下幾個步驟:
# 安裝bootloader function install_bootloader { #此處省略不少 ... } #向Ironic Condutor發送消息,開啓socket端口10000等待通知PXE結束 function do_vendor_passthru_and_wait { local data=$1 local vendor_passthru_name=$2 eval curl -i -X POST \ "$TOKEN_HEADER" \ "-H 'Accept: application/json'" \ "-H 'Content-Type: application/json'" \ -d "$data" \ "$IRONIC_API_URL/v1/nodes/$DEPLOYMENT_ID/vendor_passthru/$vendor_passthru_name" echo "Waiting for notice of complete" nc -l -p 10000 } readonly IRONIC_API_URL=$(get_kernel_parameter ironic_api_url) readonly IRONIC_BOOT_OPTION=$(get_kernel_parameter boot_option) readonly IRONIC_BOOT_MODE=$(get_kernel_parameter boot_mode) readonly ROOT_DEVICE=$(get_kernel_parameter root_device) if [ -z "$ISCSI_TARGET_IQN" ]; then err_msg "iscsi_target_iqn is not defined" troubleshoot fi #獲取當前linux的本地硬盤 target_disk= if [[ $ROOT_DEVICE ]]; then target_disk="$(get_root_device)" else t=0 while ! target_disk=$(find_disk "$DISK"); do if [ $t -eq 60 ]; then break fi t=$(($t + 1)) sleep 1 done fi if [ -z "$target_disk" ]; then err_msg "Could not find disk to use." troubleshoot fi #將找到的本地磁盤做爲iSCSI磁盤啓動,暴露給Ironic Condutor echo "start iSCSI target on $target_disk" start_iscsi_target "$ISCSI_TARGET_IQN" "$target_disk" ALL if [ $? -ne 0 ]; then err_msg "Failed to start iscsi target." troubleshoot fi #獲取到相關的token文件,從tftp服務器上獲取,token文件在ironic在prepare階段就生成好的。 if [ "$BOOT_METHOD" = "$VMEDIA_BOOT_TAG" ]; then TOKEN_FILE="$VMEDIA_DIR/token" if [ -f "$TOKEN_FILE" ]; then TOKEN_HEADER="-H 'X-Auth-Token: $(cat $TOKEN_FILE)'" else TOKEN_HEADER="" fi else TOKEN_FILE=token-$DEPLOYMENT_ID # Allow multiple versions of the tftp client if tftp -r $TOKEN_FILE -g $BOOT_SERVER || tftp $BOOT_SERVER -c get $TOKEN_FILE; then TOKEN_HEADER="-H 'X-Auth-Token: $(cat $TOKEN_FILE)'" else TOKEN_HEADER="" fi fi #向Ironic請求部署鏡像,POST node的/vendor_passthru/pass_deploy_info請求 echo "Requesting Ironic API to deploy image" deploy_data="'{\"address\":\"$BOOT_IP_ADDRESS\",\"key\":\"$DEPLOYMENT_KEY\",\"iqn\":\"$ISCSI_TARGET_IQN\",\"error\":\"$FIRST_ERR_MSG\"}'" do_vendor_passthru_and_wait "$deploy_data" "pass_deploy_info" #部署鏡像下載結束,中止iSCSI設備 echo "Stopping iSCSI target on $target_disk" stop_iscsi_target #若是是本地啓動,安裝bootloarder # If localboot is set, install a bootloader if [ "$IRONIC_BOOT_OPTION" = "local" ]; then echo "Installing bootloader" error_msg=$(install_bootloader) if [ $? -eq 0 ]; then status=SUCCEEDED else status=FAILED fi echo "Requesting Ironic API to complete the deploy" bootloader_install_data="'{\"address\":\"$BOOT_IP_ADDRESS\",\"status\":\"$status\",\"key\":\"$DEPLOYMENT_KEY\",\"error\":\"$error_msg\"}'" do_vendor_passthru_and_wait "$bootloader_install_data" "pass_bootloader_install_info" fi
下面咱們來看一下node的/vendor_passthru/pass_deploy_info都幹了什麼?Ironic-api在接受到請求後,是在ironic.api.controllers.v1.node.NodeVendorPassthruController._default()方法處理的,這個方法將調用的方法轉發到ironic.condutor.manager.CondutorManager.vendor_passthro()去處理,進而調用相應task.driver.vendor.pass_deploy_info()去處理,這裏不一樣驅動不同,能夠根據源碼查看到,好比使用pxe_ipmptoos驅動, 則是轉發給ironic.drivers.modules.iscsi_deploy.VendorPassthru.pass_deploy_info()處理,其代碼是
@base.passthru(['POST']) @task_manager.require_exclusive_lock def pass_deploy_info(self, task, **kwargs): """Continues the deployment of baremetal node over iSCSI. This method continues the deployment of the baremetal node over iSCSI from where the deployment ramdisk has left off. :param task: a TaskManager instance containing the node to act on. :param kwargs: kwargs for performing iscsi deployment. :raises: InvalidState """ node = task.node LOG.warning(_LW("The node %s is using the bash deploy ramdisk for " "its deployment. This deploy ramdisk has been " "deprecated. Please use the ironic-python-agent " "(IPA) ramdisk instead."), node.uuid) task.process_event('resume') #設置任務狀態 LOG.debug('Continuing the deployment on node %s', node.uuid) is_whole_disk_image = node.driver_internal_info['is_whole_disk_image'] #繼續部署的函數,鏈接到iSCSI設備,將用戶鏡像寫到iSCSI設備上,退出刪除iSCSI設備, #而後在Condutor上刪除鏡像文件 uuid_dict_returned = continue_deploy(task, **kwargs) root_uuid_or_disk_id = uuid_dict_returned.get( 'root uuid', uuid_dict_returned.get('disk identifier')) # save the node's root disk UUID so that another conductor could # rebuild the PXE config file. Due to a shortcoming in Nova objects, # we have to assign to node.driver_internal_info so the node knows it # has changed. driver_internal_info = node.driver_internal_info driver_internal_info['root_uuid_or_disk_id'] = root_uuid_or_disk_id node.driver_internal_info = driver_internal_info node.save() try: #再一次設置PXE引導,爲準備進入用戶系統作準備 task.driver.boot.prepare_instance(task) if deploy_utils.get_boot_option(node) == "local": if not is_whole_disk_image: LOG.debug('Installing the bootloader on node %s', node.uuid) deploy_utils.notify_ramdisk_to_proceed(kwargs['address']) task.process_event('wait') return except Exception as e: LOG.error(_LE('Deploy failed for instance %(instance)s. ' 'Error: %(error)s'), {'instance': node.instance_uuid, 'error': e}) msg = _('Failed to continue iSCSI deployment.') deploy_utils.set_failed_state(task, msg) else: #結束部署,通知ramdisk重啓,將物理機設置爲ative finish_deploy(task, kwargs.get('address'))
在continue_deploy函數中,先解析iscsi部署的信息,而後在進行分區、格式化、寫入鏡像到磁盤。 而後調用prepare_instance在設置一遍PXE環境,爲進入系統作準備,咱們知道在instance_info上設置了ramdisk、kernel、image_source 3個鏡像,其實就是內核、根文件系統、磁盤鏡像。這裏就是設置了ramdisk和kernel,磁盤鏡像上面已經寫到磁盤中去了,調用switch_pxe_config方法將當前的操做系統的啓動項設置爲ramdisk和kernel做爲引導程序。 最後向節點的10000發送一個‘done’通知節點關閉iSCSI設備,最後節點重啓安裝用戶操做系統,至此部署結束。
在部署過程當中,節點和驅動的信息都會被存入ironic數據庫,以便後續管理。
狀況2、使用agent驅動部署
在部署階段的prepare階段與PXE同樣,可是因爲建立的ramdisk不同因此部署方式則不同,在PXE中,開機執行的是一段init腳本,而在Agent開機執行的是IPA。
機器上電後,ramdisk在內存中執行,而後啓動IPA,入口爲cmd.agent.run(),而後調用ironic-python-agent.agent.run(),其代碼以下
def run(self): """Run the Ironic Python Agent.""" # Get the UUID so we can heartbeat to Ironic. Raises LookupNodeError # if there is an issue (uncaught, restart agent) self.started_at = _time() #加載hardware manager # Cached hw managers at runtime, not load time. See bug 1490008. hardware.load_managers() if not self.standalone: # Inspection should be started before call to lookup, otherwise # lookup will fail due to unknown MAC. uuid = inspector.inspect() #利用Ironic API給Condutor發送lookup()請求,用戶獲取UUID,至關於自發現 content = self.api_client.lookup_node( hardware_info=hardware.dispatch_to_managers( 'list_hardware_info'), timeout=self.lookup_timeout, starting_interval=self.lookup_interval, node_uuid=uuid) self.node = content['node'] self.heartbeat_timeout = content['heartbeat_timeout'] wsgi = simple_server.make_server( self.listen_address[0], self.listen_address[1], self.api, server_class=simple_server.WSGIServer) #發送心跳包 if not self.standalone: # Don't start heartbeating until the server is listening self.heartbeater.start() try: wsgi.serve_forever() except BaseException: self.log.exception('shutting down') #部署完成後中止心跳包 if not self.standalone: self.heartbeater.stop()
其中self.api_client.lookup_node調用到ironic-python-api._do_lookup(),而後發送一個GET /{api_version}/drivers/{driver}/vendor_passthru/lookup請求。 Condutor API在接受到lookup請求後調用指定驅動的lookup函數處理,返回節點UUID。
IPA收到UUID後調用Ironic-API發送Heartbeat請求(/{api_version}/nodes/{uuid}/vendor_passthru/heartbeat),Ironic-API把消息路由給節點的驅動heartbeat函數處理。Ironic-Condutor週期執行該函數,每隔一段時間執行該函數檢查IPA部署是否完成,若是完成則進入以後的動做.目前agent*驅動使用的是ironic.drivers.modouls.agent.AgentVendorInterface類實現的接口,代碼以下。
@base.passthru(['POST']) def heartbeat(self, task, **kwargs): """Method for agent to periodically check in. The agent should be sending its agent_url (so Ironic can talk back) as a kwarg. kwargs should have the following format:: { 'agent_url': 'http://AGENT_HOST:AGENT_PORT' } AGENT_PORT defaults to 9999. """ node = task.node driver_internal_info = node.driver_internal_info LOG.debug( 'Heartbeat from %(node)s, last heartbeat at %(heartbeat)s.', {'node': node.uuid, 'heartbeat': driver_internal_info.get('agent_last_heartbeat')}) driver_internal_info['agent_last_heartbeat'] = int(_time()) try: driver_internal_info['agent_url'] = kwargs['agent_url'] except KeyError: raise exception.MissingParameterValue(_('For heartbeat operation, ' '"agent_url" must be ' 'specified.')) node.driver_internal_info = driver_internal_info node.save() # Async call backs don't set error state on their own # TODO(jimrollenhagen) improve error messages here msg = _('Failed checking if deploy is done.') try: if node.maintenance: # this shouldn't happen often, but skip the rest if it does. LOG.debug('Heartbeat from node %(node)s in maintenance mode; ' 'not taking any action.', {'node': node.uuid}) return elif (node.provision_state == states.DEPLOYWAIT and not self.deploy_has_started(task)): msg = _('Node failed to get image for deploy.') self.continue_deploy(task, **kwargs) #調用continue_deploy函數,下載鏡像 elif (node.provision_state == states.DEPLOYWAIT and self.deploy_is_done(task)): #查看IPA執行下載鏡像是否結束 msg = _('Node failed to move to active state.') self.reboot_to_instance(task, **kwargs) #若是鏡像已經下載完成,即部署完成,設置從disk啓動,重啓進入用戶系統, elif (node.provision_state == states.DEPLOYWAIT and self.deploy_has_started(task)): node.touch_provisioning() #更新數據庫,將節點的設置爲alive # TODO(lucasagomes): CLEANING here for backwards compat # with previous code, otherwise nodes in CLEANING when this # is deployed would fail. Should be removed once the Mitaka # release starts. elif node.provision_state in (states.CLEANWAIT, states.CLEANING): node.touch_provisioning() if not node.clean_step: LOG.debug('Node %s just booted to start cleaning.', node.uuid) msg = _('Node failed to start the next cleaning step.') manager.set_node_cleaning_steps(task) self._notify_conductor_resume_clean(task) else: msg = _('Node failed to check cleaning progress.') self.continue_cleaning(task, **kwargs) except Exception as e: err_info = {'node': node.uuid, 'msg': msg, 'e': e} last_error = _('Asynchronous exception for node %(node)s: ' '%(msg)s exception: %(e)s') % err_info LOG.exception(last_error) if node.provision_state in (states.CLEANING, states.CLEANWAIT): manager.cleaning_error_handler(task, last_error) elif node.provision_state in (states.DEPLOYING, states.DEPLOYWAIT): deploy_utils.set_failed_state(task, last_error)
根據上面bearthead函數,首先根據當前節點的狀態node.provision_state==DEPLOYWAIT,調用continue_deploy()函數進行部署.
@task_manager.require_exclusive_lock def continue_deploy(self, task, **kwargs): task.process_event('resume') node = task.node image_source = node.instance_info.get('image_source') LOG.debug('Continuing deploy for node %(node)s with image %(img)s', {'node': node.uuid, 'img': image_source}) image_info = { 'id': image_source.split('/')[-1], 'urls': [node.instance_info['image_url']], 'checksum': node.instance_info['image_checksum'], # NOTE(comstud): Older versions of ironic do not set # 'disk_format' nor 'container_format', so we use .get() # to maintain backwards compatibility in case code was # upgraded in the middle of a build request. 'disk_format': node.instance_info.get('image_disk_format'), 'container_format': node.instance_info.get( 'image_container_format') } #通知IPA下載swift上的鏡像,並寫入本地磁盤 # Tell the client to download and write the image with the given args self._client.prepare_image(node, image_info) task.process_event('wait')
Condutor而後依次調用:
至此,使用agent方式進行部署操做系統的過程處處結束。下面咱們用兩張圖來回顧一下部署過程:
下圖是Liberty版的狀態裝換圖