安裝NVidia支持的Docker引擎,就能夠在容器中使用GPU了。具體步驟以下:git
# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f sudo apt-get purge -y nvidia-docker # Add the package repositories curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update # Install nvidia-docker2 and reload the Docker daemon configuration sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd # Test nvidia-smi with the latest official CUDA image docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
注意,如今像上面運行 Docker 能夠直接支持GPU了,不用再單獨運行Docker-Nvidia命令了,大大加強了與各類容器編排系統的兼容性,Kubernetes目前也已經能夠支持Docker容器運行GPU了。github
目前版本依賴Docker 18.03版,若是已經安裝了其它版本,能夠指定安裝的版本,以下:docker
sudo apt install docker-ce=18.03.1~ce-0~ubuntu
Kubernetes的NVIDIA device plugin是Daemonset,容許自動地:json
Kubernetes device plugin 項目包含Nvidia的官方實現。bootstrap
運行NVIDIA device plugin的環境要求以下:ubuntu
DevicePlugins
feature gate enabled下面的步驟須要在全部的GPU節點上執行。此外,在此以前 NVIDIA drivers 和 nvidia-docker 必須已經成功安裝。api
首先,檢查每個節點,啓用 nvidia runtime爲缺省的容器運行時。咱們將編輯docker daemon config文件,位於/etc/docker/daemon.json
dom
{ "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
上面的這一行("exec-opts": ["native.cgroupdriver=systemd"])是在Ubuntu16.04+DockerCE上面必需要的,不然kubelet沒法成功啓動(參考 http://www.javashuo.com/article/p-rqehlpup-bz.html)。curl
若是
runtimes
沒有, 到nvidia-docker 參考,首先進行安裝。ide
第二步,啓用 DevicePlugins
feature gate,在每個GPU節點都要設置。
若是你的 Kubernetes cluster是經過kubeadm部署的,而且節點運行systemd,須要打開kubeadm 的systemd unit文件,位於 /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
而後添加下面的參數做爲環境變量:
Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true"
If you spot the Accelerators feature gate you should remove it as it might interfere with the DevicePlugins feature gate
完整的文件以下(/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
):
[Service] Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf" Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true" Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin" Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local" Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt" Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0" Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/var/lib/kubelet/pki" Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=systemd" Environment="KUBELET_EXTRA_ARGS=--fail-swap-on=false" Environment="KUBELET_GPU_ARGS=--feature-gates=DevicePlugins=true" ExecStart= ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_CGROUP_ARGS $KUBELET_EXTRA_ARGS $KUBELET_GPU_ARGS
從新載入配置文件,而後從新啓動服務:
$ sudo systemctl daemon-reload $ sudo systemctl restart kubelet
In this guide we used kubeadm and kubectl as the method for setting up and administering the Kubernetes cluster, but there are many ways to deploy a Kubernetes cluster. To enable the
DevicePlugins
feature gate if you are not using the kubeadm + systemd configuration, you will need to make sure that the arguments that are passed to Kubelet include the following--feature-gates=DevicePlugins=true
.
完成全部的GPU節點的選項啓用,而後就能夠在在Kubernetes中啓用GPU支持,經過安裝Nvidia提供的Daemonset服務來實現,方法以下:
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml
NVIDIA GPUs 調用能夠經過容器級別資源請求來實現,使用resource name爲 nvidia.com/gpu:
apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs
⚠️注意: 若是未指定GPU資源請求,在使用 device plugin 的 NVIDIA images,將使用容器中公開的全部GPU資源。
⚠️請注意:
下面將重點介紹如何構建device plugin和運行。
選項 1, 拉取預先編譯的容器鏡像,到 Docker Hub:
$ docker pull nvidia/k8s-device-plugin:1.10
選項 2, 不拉取代碼庫自行構建容器鏡像:
$ docker build -t nvidia/k8s-device-plugin:1.10 https://github.com/NVIDIA/k8s-device-plugin.git#v1.10
選項 3, 拉取代碼庫自行構建,能夠修改:
$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin $ docker build -t nvidia/k8s-device-plugin:1.10 .
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10
$ kubectl create -f nvidia-device-plugin.yml
$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build
$ ./k8s-device-plugin
只要容器能夠支持GPU,在裏面運行Spark也就不是事兒了。具體方案能夠參考: