Serverless助力AI計算：阿里雲ACK Serverless/ECI發佈GPU容器實例

時間 2019-12-06

標籤 serverless 助力計算阿里 ack eci 發佈 gpu 容器實例欄目阿里巴巴简体版

原文原文鏈接

ACK Serverless（Serverless Kubernetes）近期基於ECI（彈性容器實例）正式推出GPU容器實例支持，讓用戶以serverless的方式快速運行AI計算任務，極大下降AI平臺運維的負擔，顯著提高總體計算效率。html

AI計算離不開GPU已是行業共識，然而從零開始搭建GPU集羣環境是件相對複雜的任務，包括GPU規格購買、機器準備、驅動安裝、容器環境安裝等。GPU資源的serverless交付方式，充分的展示了serverless的核心優點，其向用戶提供標準化並且「開箱即用」的資源供給能力，用戶無需購買機器也無需登陸到節點安裝GPU驅動，極大下降了AI平臺的部署複雜度，讓客戶關注在AI模型和應用自己而非基礎設施的搭建和維護，讓使用GPU/CPU資源就如同打開水龍頭同樣簡單方便，同時按需計費的方式讓客戶按照計算任務進行消費，避免包年包月帶來的高成本和資源浪費。node

在ACK Serverless中建立掛載GPU的pod也很是簡單，經過annotation指定所需GPU的類型，同時在resource.limits中指定GPU的個數便可（也可指定instance-type）。每一個pod獨佔GPU，暫不支持vGPU，GPU實例的收費與ECS GPU類型收費一致，不產生額外費用，目前阿里雲ECI提供以下幾種規格的GPU類型：（詳情請參考https://help.aliyun.com/document_detail/114581.html）python

vCPU	內存(GiB)	GPU類型	GPU count
2	8.0	P4	1
4	16.0	P4	1
8	32.0	P4	1
16	64.0	P4	1
32	128.0	P4	2
56	224.0	P4	4
8	32.0	V100	1
32	128.0	V100	4
64	256.0	V100	8

下面讓咱們經過一個簡單的圖片識別示例，展現如何在ACK Serverless中快速進行深度學習任務的計算。git

建立Serverless Kubernetes集羣

使用tensorflow進行圖片識別

對於咱們人類此圖片的識別是極其簡單不過的，然而對於機器而言則不是一件輕鬆的事情，其中依賴大量數據的輸入和模型算法的訓練，下面咱們將基於已有的tensorflow模型對上個圖片進行識別。github

在這裏咱們選用了tensorflow的入門示例
鏡像registry-vpc.cn-hangzhou.aliyuncs.com/ack-serverless/tensorflow是基於tensorflow官方鏡像tensorflow/tensorflow:1.13.1-gpu-py3構建，在裏面已經下載了示例所需models倉庫：https://github.com/tensorflow/models算法

在serverless集羣控制檯基於模版建立或者使用kubectl部署以下yaml文件，pod中指定GPU類型爲P4，GPU個數爲1。docker

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow
  annotations:
    k8s.aliyun.com/eci-gpu-type : "P4"
spec:
  containers:
  - image: registry-vpc.cn-hangzhou.aliyuncs.com/ack-serverless/tensorflow
    name: tensorflow
    command:
    - "sh"
    - "-c"
    - "python models/tutorials/image/imagenet/classify_image.py"
    resources:
      limits:
        nvidia.com/gpu: "1"
  restartPolicy: OnFailure

建立pod等待執行完成，查看pod日誌:api

# kubectl get pod -a
NAME         READY     STATUS      RESTARTS   AGE
tensorflow   0/1       Completed   0          6m


# kubectl logs tensorflow
>> Downloading inception-2015-12-05.WARNING:tensorflow:From models/tutorials/image/imagenet/classify_image.py:141: __init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
2019-05-05 09:43:30.591730: W tensorflow/core/framework/op_def_util.cc:355] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
2019-05-05 09:43:30.806869: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-05 09:43:31.075142: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-05 09:43:31.075725: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4525ce0 executing computations on platform CUDA. Devices:
2019-05-05 09:43:31.075785: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla P4, Compute Capability 6.1
2019-05-05 09:43:31.078667: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494220000 Hz
2019-05-05 09:43:31.078953: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4ad0660 executing computations on platform Host. Devices:
2019-05-05 09:43:31.078980: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-05 09:43:31.079294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:08.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-05-05 09:43:31.079327: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-05 09:43:31.081074: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-05 09:43:31.081104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-05-05 09:43:31.081116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-05-05 09:43:31.081379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7116 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:08.0, compute capability: 6.1)
2019-05-05 09:43:32.200163: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
>> Downloading inception-2015-12-05.tgz 100.0%
Successfully downloaded inception-2015-12-05.tgz 88931400 bytes.
giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107)
indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779)
lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296)
custard apple (score = 0.00147)
earthstar (score = 0.00117)

pod的日誌顯示模型已經成功檢測到圖片爲panda。能夠看到在整個機器學習計算過程當中，咱們只是運行了一個pod，當pod變成terminated狀態後任務完成，沒有ecs環境準備，沒有購買GPU機器，沒有安裝Nivida GPU驅動，沒有安裝docker軟件，計算力如同水電同樣按需使用。app

最後

ACK中虛擬節點也一樣基於ECI實現了GPU的支持，使用方式與ACK Serverless相同（但須要把pod指定調度到虛擬節點上，或者把pod建立在有virtual-node-affinity-injection=enabled label的namespace中），基於虛擬節點的方式能夠更靈活的支持多種深度學習框架，如kubeflow、arena或其餘自定義CRD。框架

示例以下：

apiVersion: v1
kind: Pod
metadata:
  name: tensorflow
  annotations:
    k8s.aliyun.com/eci-gpu-type : "P4"
spec:
  containers:
  - image: registry-vpc.cn-hangzhou.aliyuncs.com/ack-serverless/tensorflow
    name: tensorflow
    command:
    - "sh"
    - "-c"
    - "python models/tutorials/image/imagenet/classify_image.py"
    resources:
      limits:
        nvidia.com/gpu: "1"
  restartPolicy: OnFailure
  nodeName: virtual-kubelet

本文做者：賢維

原文連接

本文爲雲棲社區原創內容，未經容許不得轉載。