Kubernetes中調度GPU資源

時間 2019-11-07

標籤 kubernetes 調度 gpu 資源简体版

原文原文鏈接

Kubernetes中調度GPU資源

Kubernetes 包含一個體驗性的功能，支持 AMD和NVIDIA GPUs 跨節點調度。對 NVIDIA GPUs 支持從 v1.6開始，而後通過幾回不兼容的疊代修改，對AMD GPUs 的支持從 v1.9 開始，經過 device plugin提供。node

在Ubuntu上安裝GPU驅動，參考：
安裝NVidia-docker引擎，參考：

本文描述了用戶在不一樣版本的kubernetes使用GPUs的方法及其當前版本的限制。git

v1.8 之後

從1.8開始, 建議調用 GPUs 的方法是經過使用 device plugins。github

爲了啓用 GPU支持，在1.10以前, 該DevicePlugins feature gate 須要經過系統設置來激活： --feature-gates="DevicePlugins=true". 但在 1.10及之後，再也不須要這一設置。docker

您還須要安裝 GPU drivers到各個節點，驅動和device plugin都由相應的GPU生產廠家提供 (AMD, NVIDIA)。shell

當上述條件知足時, Kubernetes 服務將提供名稱爲 nvidia.com/gpu 和 amd.com/gpu 做爲可調度的資源。ubuntu

You can consume these GPUs from your containers by requesting <vendor>.com/gpu just like you request cpu or memory. However, there are some limitations in how you specify the resource requirements when using GPUs:api

GPUs are only supposed to be specified in the limits section, which means:
- You can specify GPU limits without specifying requests because Kubernetes will use the limit as the request value by default.
- You can specify GPU in both limits and requests but these two values must be equal.
- You cannot specify GPU requests without specifying limits.
Containers (and pods) do not share GPUs. There’s no overcommitting of GPUs.
Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.

Here’s an example:app

apiVersion: v1 kind: Pod metadata:  name: cuda-vector-add spec:  restartPolicy: OnFailure  containers:  - name: cuda-vector-add  # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile  image: "k8s.gcr.io/cuda-vector-add:v0.1"  resources:  limits:  nvidia.com/gpu: 1 # requesting 1 GPU

Deploying AMD GPU device plugin

The official AMD GPU device plugin has the following requirements:機器學習

Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.

To deploy the AMD device plugin once your cluster is running and the above requirements are satisfied:學習

# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.9/k8s-ds-amdgpu-dp.yaml

# For Kubernetes v1.10
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml

Report issues with this device plugin to RadeonOpenCompute/k8s-device-plugin.

Deploying NVIDIA GPU device plugin

There are currently two device plugin implementations for NVIDIA GPUs:

Official NVIDIA GPU device plugin

The official NVIDIA GPU device plugin has the following requirements:

Kubernetes nodes have to be pre-installed with NVIDIA drivers.
Kubernetes nodes have to be pre-installed with nvidia-docker 2.0
nvidia-container-runtime must be configured as the default runtime for docker instead of runc.
NVIDIA drivers ~= 361.93

To deploy the NVIDIA device plugin once your cluster is running and the above requirements are satisfied:

# For Kubernetes v1.8
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml

# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml

Report issues with this device plugin to NVIDIA/k8s-device-plugin.

NVIDIA GPU device plugin used by GCE

The NVIDIA GPU device plugin used by GCE doesn’t require using nvidia-docker and should work with any container runtime that is compatible with the Kubernetes Container Runtime Interface (CRI). It’s tested on Container-Optimized OS and has experimental code for Ubuntu from 1.9 onwards.

On your 1.12 cluster, you can use the following commands to install the NVIDIA drivers and device plugin:

# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml

# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml

# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.12/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Report issues with this device plugin and installation method to GoogleCloudPlatform/container-engine-accelerators.

Instructions for using NVIDIA GPUs on GKE are here

Clusters containing different types of NVIDIA GPUs

If different nodes in your cluster have different types of NVIDIA GPUs, then you can use Node Labels and Node Selectors to schedule pods to appropriate nodes.

For example:

# Label your nodes with the accelerator type they have. kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100

Specify the GPU type in the pod spec:

apiVersion: v1 kind: Pod metadata:  name: cuda-vector-add spec:  restartPolicy: OnFailure  containers:  - name: cuda-vector-add  # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile  image: "k8s.gcr.io/cuda-vector-add:v0.1"  resources:  limits:  nvidia.com/gpu: 1  nodeSelector:  accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.