ACK Serverless(Serverless Kubernetes)近期基於ECI(彈性容器實例)正式推出GPU容器實例支持,讓用戶以serverless的方式快速運行AI計算任務,極大下降AI平臺運維的負擔,顯著提高總體計算效率。node
AI計算離不開GPU已是行業共識,然而從零開始搭建GPU集羣環境是件相對複雜的任務,包括GPU規格購買、機器準備、驅動安裝、容器環境安裝等。GPU資源的serverless交付方式,充分的展示了serverless的核心優點,其向用戶提供標準化並且「開箱即用」的資源供給能力,用戶無需購買機器也無需登陸到節點安裝GPU驅動,極大下降了AI平臺的部署複雜度,讓客戶關注在AI模型和應用自己而非基礎設施的搭建和維護,讓使用GPU/CPU資源就如同打開水龍頭同樣簡單方便,同時按需計費的方式讓客戶按照計算任務進行消費, 避免包年包月帶來的高成本和資源浪費。python
在ACK Serverless中建立掛載GPU的pod也很是簡單,經過annotation指定所需GPU的類型,同時在resource.limits中指定GPU的個數便可(也可指定instance-type)。每一個pod獨佔GPU,暫不支持vGPU,GPU實例的收費與ECS GPU類型收費一致,不產生額外費用,目前阿里雲ECI提供以下幾種規格的GPU類型:git
vCPU | 內存(GiB) | GPU類型 | GPU count |
2 | 8.0 | P4 | 1 |
4 | 16.0 | P4 | 1 |
8 | 32.0 | P4 | 1 |
16 | 64.0 | P4 | 1 |
32 | 128.0 | P4 | 2 |
56 | 224.0 | P4 | 4 |
8 | 32.0 | V100 | 1 |
32 | 128.0 | V100 | 4 |
64 | 256.0 | V100 | 8 |
下面讓咱們經過一個簡單的圖片識別示例,展現如何在ACK Serverless中快速進行深度學習任務的計算。github
apiVersion: v1 kind: Pod metadata: name: tensorflow annotations: : "P4" spec: containers: - image: name: tensorflow command: - "sh" - "-c" - "python models/tutorials/image/imagenet/" resources: limits: "1" restartPolicy: OnFailure
# kubectl get pod -a NAME READY STATUS RESTARTS AGE tensorflow 0/1 Completed 0 6m # kubectl logs tensorflow >> Downloading inception-2015-12-05.WARNING:tensorflow:From models/tutorials/image/imagenet/ __init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version. Instructions for updating: Use tf.gfile.GFile. 2019-05-05 09:43:30.591730: W tensorflow/core/framework/] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization(). 2019-05-05 09:43:30.806869: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-05-05 09:43:31.075142: I tensorflow/stream_executor/cuda/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-05-05 09:43:31.075725: I tensorflow/compiler/xla/service/] XLA service 0x4525ce0 executing computations on platform CUDA. Devices: 2019-05-05 09:43:31.075785: I tensorflow/compiler/xla/service/] StreamExecutor device (0): Tesla P4, Compute Capability 6.1 2019-05-05 09:43:31.078667: I tensorflow/core/platform/profile_utils/] CPU Frequency: 2494220000 Hz 2019-05-05 09:43:31.078953: I tensorflow/compiler/xla/service/] XLA service 0x4ad0660 executing computations on platform Host. Devices: 2019-05-05 09:43:31.078980: I tensorflow/compiler/xla/service/] StreamExecutor device (0): <undefined>, <undefined> 2019-05-05 09:43:31.079294: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135 pciBusID: 0000:00:08.0 totalMemory: 7.43GiB freeMemory: 7.31GiB 2019-05-05 09:43:31.079327: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0 2019-05-05 09:43:31.081074: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-05-05 09:43:31.081104: I tensorflow/core/common_runtime/gpu/] 0 2019-05-05 09:43:31.081116: I tensorflow/core/common_runtime/gpu/] 0: N 2019-05-05 09:43:31.081379: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7116 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:08.0, compute capability: 6.1) 2019-05-05 09:43:32.200163: I tensorflow/stream_executor/] successfully opened CUDA library locally >> Downloading inception-2015-12-05.tgz 100.0% Successfully downloaded inception-2015-12-05.tgz 88931400 bytes. giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296) custard apple (score = 0.00147) earthstar (score = 0.00117)
pod的日誌顯示模型已經成功檢測到圖片爲panda。能夠看到在整個機器學習計算過程當中,咱們只是運行了一個pod,當pod變成terminated狀態後任務完成,沒有ecs環境準備,沒有購買GPU機器,沒有安裝Nivida GPU驅動,沒有安裝docker軟件,計算力如同水電同樣按需使用。框架
ACK中虛擬節點也一樣基於ECI實現了GPU的支持,使用方式與ACK Serverless相同(但須要把pod指定調度到虛擬節點上,或者把pod建立在有virtual-node-affinity-injection=enabled label的namespace中),基於虛擬節點的方式能夠更靈活的支持多種深度學習框架,如kubeflow、arena或其餘自定義CRD。less
apiVersion: v1 kind: Pod metadata: name: tensorflow annotations: : "P4" spec: containers: - image: name: tensorflow command: - "sh" - "-c" - "python models/tutorials/image/imagenet/" resources: limits: "1" restartPolicy: OnFailure nodeName: virtual-kubelet
原文連接 本文爲雲棲社區原創內容,未經容許不得轉載。