Kubernetes安裝GPU支持插件

時間 2019-11-07

標籤 kubernetes 安裝 gpu 支持插件简体版

原文原文鏈接

Kubernetes安裝GPU支持插件

Kubernetes1.10.x能夠直接支持GPU的容器調度運行了，經過安裝該插件便可。
這裏的方法基於NVIDIA device plugin，僅支持Nvidia的顯卡和Tesla計算卡。
主要步驟：
1. 安裝圖形卡的Nvidia Drivers。
2. 安裝Nvidia-Docker2容器運行時。
3. 啓用Nvidia-Docker2爲容器引擎默認運行時。
4. 啓用Docker服務的GPU加速支持。
5. 安裝NVIDIA device plugin。
6. 啓用Kubernetes的GPU支持。

1、安裝NVidia支持的Docker引擎

安裝NVidia支持的Docker引擎，就能夠在容器中使用GPU了。具體步驟以下：git

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

注意，如今像上面運行 Docker 能夠直接支持GPU了，不用再單獨運行Docker-Nvidia命令了，大大加強了與各類容器編排系統的兼容性，Kubernetes目前也已經能夠支持Docker容器運行GPU了。github

參考 http://www.javashuo.com/article/p-gsqaalml-ex.html

目前版本依賴Docker 18.03版，若是已經安裝了其它版本，能夠指定安裝的版本，以下：docker

sudo apt install docker-ce=18.03.1~ce-0~ubuntu

詳細的參考：https://github.com/NVIDIA/nvidia-docker

2、安裝Nvidia的Kubernetes GPU插件

Kubernetes的NVIDIA device plugin是Daemonset，容許自動地：json

公開集羣中每個節點的GPU數量。
對GPU運行健康情況保持跟蹤。
在Kubernetes中運行支持的GPU容器實例。

Kubernetes device plugin 項目包含Nvidia的官方實現。bootstrap

2.1 環境要求

運行NVIDIA device plugin的環境要求以下：ubuntu

NVIDIA drivers ~= 361.93，安裝參考 http://www.javashuo.com/article/p-kkyxebwm-ey.html
nvidia-docker version > 2.0 (參考上面第一部分)
docker 配置 nvidia-docker爲 default runtime.
Kubernetes version = 1.10，安裝參考 http://www.javashuo.com/article/p-rqehlpup-bz.html
The DevicePlugins feature gate enabled

2.2 準備 GPU Nodes

下面的步驟須要在全部的GPU節點上執行。此外，在此以前 NVIDIA drivers 和 nvidia-docker 必須已經成功安裝。api

首先，檢查每個節點，啓用 nvidia runtime爲缺省的容器運行時。咱們將編輯docker daemon config文件，位於/etc/docker/daemon.jsondom

{
    "exec-opts": ["native.cgroupdriver=systemd"],
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

上面的這一行（"exec-opts": ["native.cgroupdriver=systemd"]）是在Ubuntu16.04+DockerCE上面必需要的，不然kubelet沒法成功啓動（參考 http://www.javashuo.com/article/p-rqehlpup-bz.html）。curl

若是 runtimes 沒有, 到nvidia-docker 參考，首先進行安裝。ide

第二步，啓用 DevicePlugins feature gate，在每個GPU節點都要設置。

若是你的 Kubernetes cluster是經過kubeadm部署的，而且節點運行systemd，須要打開kubeadm 的systemd unit文件，位於 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf 而後添加下面的參數做爲環境變量：

Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true"

If you spot the Accelerators feature gate you should remove it as it might interfere with the DevicePlugins feature gate

完整的文件以下（/etc/systemd/system/kubelet.service.d/10-kubeadm.conf）：

[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki"

Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=systemd"
Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false"
Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true"

ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS $KUBELET_GPU_ARGS

從新載入配置文件，而後從新啓動服務：

$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet

In this guide we used kubeadm and kubectl as the method for setting up and administering the Kubernetes cluster, but there are many ways to deploy a Kubernetes cluster. To enable the DevicePlugins feature gate if you are not using the kubeadm + systemd configuration, you will need to make sure that the arguments that are passed to Kubelet include the following --feature-gates=DevicePlugins=true.

2.3 在Kubernetes中啓用GPU支持

完成全部的GPU節點的選項啓用，而後就能夠在在Kubernetes中啓用GPU支持，經過安裝Nvidia提供的Daemonset服務來實現，方法以下：

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

2.4 運行須要GPU的工做負載

NVIDIA GPUs 調用能夠經過容器級別資源請求來實現，使用resource name爲 nvidia.com/gpu：

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

⚠️注意: 若是未指定GPU資源請求，在使用 device plugin 的 NVIDIA images，將使用容器中公開的全部GPU資源。

2.5 文檔

⚠️請注意：

device plugin feature is still alpha which is why it requires the feature gate to be enabled.
the NVIDIA device plugin is still considered alpha and is missing
- Security features
- More comprehensive GPU health checking features
- GPU cleanup features
- ...
support will only be provided for the official NVIDIA device plugin.

下面將重點介紹如何構建device plugin和運行。

2.6 經過Docker

構建-Build

選項 1, 拉取預先編譯的容器鏡像，到 Docker Hub：

$ docker pull nvidia/k8s-device-plugin:1.10

選項 2, 不拉取代碼庫自行構建容器鏡像：

$ docker build -t nvidia/k8s-device-plugin:1.10 https://github.com/NVIDIA/k8s-device-plugin.git#v1.10

選項 3, 拉取代碼庫自行構建，能夠修改：

$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ docker build -t nvidia/k8s-device-plugin:1.10 .

本地運行

$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10

部署爲DaemonSet:

$ kubectl create -f nvidia-device-plugin.yml

2.7 不用Docker

構建-Build

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

本地運行

$ ./k8s-device-plugin

2.8 問題反饋和貢獻

報告Bug和建議： filing a new issue
貢獻代碼： pull request
源碼主頁：https://github.com/NVIDIA/k8s-device-plugin

3、Spark的GPU支持方法研究

只要容器能夠支持GPU，在裏面運行Spark也就不是事兒了。具體方案能夠參考：

參考：http://www.javashuo.com/article/p-devapfwj-ec.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。