k8s的擴展資源設計和device-plugin

時間 2019-11-07

標籤 k8s 擴展資源設計 device plugin 简体版

原文原文鏈接

extended-resources

extended-resources在k8s1.9中是一個stable的特性。能夠用一句話來歸納這個特性：node

經過向apiserver發送一個patch node 的請求，爲這個node增長一個自定義的資源類型，用於以該資源的配額統計和相應的QoS的配置。git

patch node 的請求：

舉例：github

curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1dongle", "value": "4"}]' \
http://localhost:8001/api/v1/nodes/10.123.123.123/status

如上，咱們爲10.123.123.123這個node增長了一個resource：example.com/dongle (命令中的 ~1 會轉化爲 / ) ,這個node的capicity/allocable中會展現其有4個example.com/dongle資源：算法

"capacity": {
  "alpha.kubernetes.io/nvidia-gpu": "0",
  "cpu": "2",
  "memory": "2049008Ki",
  "example.com/dongle": "4",

若是咱們要清除這個資源可使用：json

curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "remove", "path": "/status/capacity/example.com~1dongle"}]' \
http://localhost:8001/api/v1/nodes/<your-node-name>/status

QoS配置：

若是對QoS的含義不瞭解，能夠參考我以前的文章segmentfault

先假設整個k8s集羣中咱們只對10.123.123.123這個node動了手腳，當咱們建立pod時，在spec.containers.resources.requests/limits中能夠設置api

"example.com/dongle": "2"

從而讓pod被調度到10.123.123.123上並消耗其2個example.com/dongle資源。這個資源將與cpu、memory同樣，被調度器進行統計，並用在pod的調度算法中。若是node上的example.com/dongle資源耗盡，這類pod將沒法成功調度。數組

device-plugin插件

設備插件從1.8版本開始加入，到1.9目前還是alpha特性，設備插件的做用是在不更改k8s代碼的狀況下，向k8s提供各類資源的統計信息和使用預備工做。這裏說的資源如GPU、高性能NIC、FPGA、infiniBand或其餘。緩存

device-plugin的註冊和實施

device-plugin功能由DevicePlugins這個參數控制，默認是禁用的，啓用這個參數後就能夠令kubelet開放Register 的grpc服務。 device-plugin能夠經過這個服務向kubelet註冊本身，註冊時要告知kubelet：app

本device-plugin的Unix socket 名稱。用於kubelet做爲grpc 客戶端向本device-plugin發請求；
本device-plugin的API版本；
本device-plugin要開放的資源名，此處資源名必須遵循必定格式，形如：nvidia.com/gpu

註冊成功後，kubelet會向device-plugin調用Listandwatch方法獲取設備的列表，此處設備的列表以該資源全部設備的描述信息（id、健康狀態）組成數組返回。kubelet將這個資源及其對應的設備個數記錄到node.status.capicity/allocable 更新到apiserver。該方法會一直循環檢查，一旦設備異常或者從機器上拔出，會將最新的設備列表返回給kubelet。

如此一來，建立pod時，spec.containers.resource.limits/requests 中就能夠增長如 "nvidia.com/gpu" : 2 這樣的字段，來告知k8s將pod調度到有超過2個nvidia.com/gpu資源餘量的nodes上（這裏與上文的extended-resources中QoS是一個道理）。當node上要運行該pod時，kubelet會向device-plugin調用Allocate方法，device-plugin在這裏可能會作一些初始化的操做，好比GPU清理或QRNG初始化之類。若是初始化成功。該方法會返回分配給該pod使用的設備在容器建立時須要如何配置，這個配置會被傳遞到container runtime。用於run 容器時做爲參數進行配置。

完整的使用流程以下圖（圖片來源：https://github.com/kubernetes...）

device-plugin 使用的代碼解析

咱們從建立pod的整個流程中一步步解析代碼執行：

建立帶特殊資源設備的pod；
調度器從cache中選擇知足要求的node；
node收到ADD POD，對pod執行admit方法進行可運行的判斷。

kubelet初始化時增長了一個admitHandler：

klet.admitHandlers.AddPodAdmitHandler(lifecycle.NewPredicateAdmitHandler(klet.getNodeAnyWay, criticalPodAdmissionHandler, klet.containerManager.UpdatePluginResources))

其中就包括了klet.containerManager.UpdatePluginResources方法，該方法會執行devicepluginManager中的Allocate方法：

func (cm *containerManagerImpl) UpdatePluginResources(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
      return cm.devicePluginManager.Allocate(node, attrs)
}

上述的Allocate方法，會將kubelet自己緩存記錄的資源可用量進行判斷和計算；
而後選定要使用的設備，向device-plugin發送Allocate調用，device-plugin會針對request中的設備id，檢查是否可用，並將使用這幾個設備須要的使用參數返回給kubelet，返回的格式是：

type AllocateResponse struct {
 // List of environment variable to be set in the container to access one of more devices.
 Envs map[string]string
 // Mounts for the container.
 Mounts []*Mount
 // Devices for the container.
 Devices []*DeviceSpec
}

最後將要這個pod要使用哪幾個資源設備（設備id、以及deviceplugin返回的設備使用參數）記錄在podDevices中,podDevices就是一個從pod到資源設備詳細信息的映射，是一個多層次的map結構。

kubelet要建立pod的容器時，會調用到GenerateRunContainerOptions方法，用於生成容器runtime要的參數，該方法中會首先調用：

opts, err := kl.containerManager.GetResources(pod, container)

而containerManager中GetResources會調用devicePluginManager中的GetDeviceRunContainerOptions方法，最後執行deviceRunContainerOptions方法，從podDevices中獲取這個pod相應的容器須要使用的設備，並組織成容器運行時參數的對象opts，最終run container時會被用到。好比gpu容器，會在opts中增長devices參數的指定，最後容器建立時會帶有須要的設備。