當您在阿里雲容器服務中使用GPU ECS主機構建Kubernetes集羣進行AI訓練時,常常須要知道每一個Pod使用的GPU的使用狀況,好比每塊顯存使用狀況、GPU利用率,GPU卡溫度等監控信息,本文介紹如何快速在阿里雲上構建基於Prometheus + Grafana的GPU監控方案。node
Prometheuspython
Prometheus 是一個開源的服務監控系統和時間序列數據庫。從 2012 年開始編寫代碼,再到 2015 年 github 上開源以來,已經吸引了 9k+ 關注,2016 年 Prometheus 成爲繼 k8s 後,第二名 CNCF(Cloud Native Computing Foundation) 成員。2018年8月 於CNCF畢業。
做爲新一代開源解決方案,不少理念與 Google SRE 運維之道不謀而合。git
前提:您已經經過阿里雲容器服務建立了擁有GPU ECS的Kubernetes集羣,具體步驟請參考:嚐鮮阿里雲容器服務Kubernetes 1.9,擁抱GPU新姿式。github
登陸容器服務控制檯,選擇【容器服務-Kubernetes】,點擊【應用-->部署-->使用模板建立】:web
選擇您的GPU集羣和Namespace,命名空間能夠選擇kube-system,而後在下面的模板中填入部署Prometheus和GPU-Expoter對應的Yaml內容。數據庫
部署Prometheusapi
apiVersion: v1 kind: ConfigMap metadata: name: prometheus-env data: storage-retention: 360h storage-memory-chunks: '1048576' --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: ["", "extensions", "apps"] resources: - nodes - nodes/proxy - services - endpoints - pods - deployments - services verbs: ["get", "list", "watch"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system # 若是部署在其餘namespace下, 須要修改這裏的namespace配置 --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: prometheus-deployment spec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: name: prometheus labels: app: prometheus spec: serviceAccount: prometheus serviceAccountName: prometheus containers: - name: prometheus image: registry.cn-hangzhou.aliyuncs.com/acs/prometheus:1.1.1 args: - '-storage.local.retention=$(STORAGE_RETENTION)' - '-storage.local.memory-chunks=1048576' - '-config.file=/etc/prometheus/prometheus.yml' ports: - name: web containerPort: 9090 env: - name: STORAGE_RETENTION valueFrom: configMapKeyRef: name: prometheus-env key: storage-retention - name: STORAGE_MEMORY_CHUNKS valueFrom: configMapKeyRef: name: prometheus-env key: storage-memory-chunks volumeMounts: - name: config-volume mountPath: /etc/prometheus - name: prometheus-data mountPath: /prometheus volumes: - name: config-volume configMap: name: prometheus-configmap - name: prometheus-data emptyDir: {} --- apiVersion: v1 kind: Service metadata: labels: name: prometheus-svc kubernetes.io/name: "Prometheus" name: prometheus-svc spec: type: LoadBalancer selector: app: prometheus ports: - name: prometheus protocol: TCP port: 9090 targetPort: 9090 --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-configmap data: prometheus.yml: |- rule_files: - "/etc/prometheus-rules/*.rules" scrape_configs: - job_name: kubernetes-service-endpoints honor_labels: false kubernetes_sd_configs: - api_servers: - 'https://kubernetes.default.svc' in_cluster: true role: endpoint relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::\d+);(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_service_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name
若是您選擇kube-system之外的namespace, 須要修改yaml中ClusterRoleBinding綁定的serviceAccount內容:瀏覽器
apiVersion: rbac.authorization.k8s.io/v1beta1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: kube-system # 若是部署在其餘namespace下, 須要修改這裏的namespace配置
部署Prometheus 的GPU 採集器app
apiVersion: apps/v1 kind: DaemonSet metadata: name: node-gpu-exporter spec: selector: matchLabels: app: node-gpu-exporter template: metadata: labels: app: node-gpu-exporter spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: aliyun.accelerator/nvidia_count operator: Exists hostPID: true containers: - name: node-gpu-exporter image: registry.cn-hangzhou.aliyuncs.com/acs/gpu-prometheus-exporter:0.1-f48bc3c imagePullPolicy: Always env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: MY_POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: MY_NODE_IP valueFrom: fieldRef: fieldPath: status.hostIP - name: EXCLUDE_PODS value: $(MY_POD_NAME),nvidia-device-plugin-$(MY_NODE_NAME),nvidia-device-plugin-ctr - name: CADVISOR_URL value: http://$(MY_NODE_IP):10255 ports: - containerPort: 9445 hostPort: 9445 resources: requests: memory: 30Mi cpu: 100m limits: memory: 50Mi cpu: 200m --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' name: node-gpu-exporter labels: app: node-gpu-exporter k8s-app: node-gpu-exporter spec: type: ClusterIP clusterIP: None ports: - name: http-metrics port: 9445 protocol: TCP selector: app: node-gpu-exporter
部署Grafana運維
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: monitoring-grafana spec: replicas: 1 template: metadata: labels: task: monitoring k8s-app: grafana spec: containers: - name: grafana image: registry.cn-hangzhou.aliyuncs.com/acs/grafana:5.0.4-gpu-monitoring ports: - containerPort: 3000 protocol: TCP volumes: - name: grafana-storage emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: monitoring-grafana spec: ports: - port: 80 targetPort: 3000 type: LoadBalancer selector: k8s-app: grafana
節點GPU監控
Pod GPU監控
若是您已經使用了Arena Arena - 打開KubeFlow的正確姿式) ,能夠直接使用arena提交一個訓練任務。
arena submit tf --name=style-transfer \ --gpus=1 \ --workers=1 \ --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \ --workingDir=/neural-style \ --ps=1 \ --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps \ "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000000" NAME: style-transfer LAST DEPLOYED: Thu Sep 20 14:34:55 2018 NAMESPACE: default STATUS: DEPLOYED RESOURCES: ==> v1alpha2/TFJob NAME AGE style-transfer-tfjob 0s
提交任務成功後在監控頁面裏能夠看到Pod的GPU相關指標, 可以看到咱們經過Arena部署的Pod,以及pod裏GPU 的資源消耗
節點維度也能夠看到對應的GPU卡和節點的負載, 在GPU節點監控頁面能夠選擇對應的節點和GPU卡
本文做者:蕭元
本文爲雲棲社區原創內容,未經容許不得轉載。