基於阿里雲容器服務監控 Kubernetes集羣GPU指標

時間 2019-12-11

標籤基於阿里容器服務監控 kubernetes 集羣 gpu 指標欄目阿里巴巴简体版

原文原文鏈接

簡介

當您在阿里雲容器服務中使用GPU ECS主機構建Kubernetes集羣進行AI訓練時，常常須要知道每一個Pod使用的GPU的使用狀況，好比每塊顯存使用狀況、GPU利用率，GPU卡溫度等監控信息，本文介紹如何快速在阿里雲上構建基於Prometheus + Grafana的GPU監控方案。node

Prometheuspython

Prometheus 是一個開源的服務監控系統和時間序列數據庫。從 2012 年開始編寫代碼，再到 2015 年 github 上開源以來，已經吸引了 9k+ 關注，2016 年 Prometheus 成爲繼 k8s 後，第二名 CNCF(Cloud Native Computing Foundation) 成員。2018年8月於CNCF畢業。
做爲新一代開源解決方案，不少理念與 Google SRE 運維之道不謀而合。git

操做

前提：您已經經過阿里雲容器服務建立了擁有GPU ECS的Kubernetes集羣，具體步驟請參考：嚐鮮阿里雲容器服務Kubernetes 1.9，擁抱GPU新姿式。github

登陸容器服務控制檯，選擇【容器服務-Kubernetes】，點擊【應用-->部署-->使用模板建立】：web

選擇您的GPU集羣和Namespace，命名空間能夠選擇kube-system，而後在下面的模板中填入部署Prometheus和GPU-Expoter對應的Yaml內容。數據庫

部署Prometheusapi

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-env
data:
  storage-retention: 360h
  storage-memory-chunks: '1048576'
---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: ["", "extensions", "apps"]
    resources:
    - nodes
    - nodes/proxy
    - services
    - endpoints
    - pods
    - deployments
    - services
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system # 若是部署在其餘namespace下， 須要修改這裏的namespace配置
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        app: prometheus
    spec:
      serviceAccount: prometheus
      serviceAccountName: prometheus
      containers:
      - name: prometheus
        image: registry.cn-hangzhou.aliyuncs.com/acs/prometheus:1.1.1
        args:
          - '-storage.local.retention=$(STORAGE_RETENTION)'
          - '-storage.local.memory-chunks=1048576'
          - '-config.file=/etc/prometheus/prometheus.yml'
        ports:
        - name: web
          containerPort: 9090
        env:
        - name: STORAGE_RETENTION
          valueFrom:
            configMapKeyRef:
              name: prometheus-env
              key: storage-retention
        - name: STORAGE_MEMORY_CHUNKS
          valueFrom:
            configMapKeyRef:
              name: prometheus-env
              key: storage-memory-chunks
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: prometheus-data
          mountPath: /prometheus
      volumes:
      - name: config-volume
        configMap:
          name: prometheus-configmap
      - name: prometheus-data
        emptyDir: {}

---

apiVersion: v1
kind: Service
metadata:
  labels:
    name: prometheus-svc
    kubernetes.io/name: "Prometheus"
  name: prometheus-svc
spec:
  type: LoadBalancer
  selector:
    app: prometheus
  ports:
  - name: prometheus
    protocol: TCP
    port: 9090
    targetPort: 9090

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-configmap
data:
  prometheus.yml: |-
    rule_files:
      - "/etc/prometheus-rules/*.rules"
    scrape_configs:
    - job_name: kubernetes-service-endpoints
      honor_labels: false
      kubernetes_sd_configs:
      - api_servers:
        - 'https://kubernetes.default.svc'
        in_cluster: true
        role: endpoint
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)(?::\d+);(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_service_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

若是您選擇kube-system之外的namespace，須要修改yaml中ClusterRoleBinding綁定的serviceAccount內容：瀏覽器

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system # 若是部署在其餘namespace下， 須要修改這裏的namespace配置

部署Prometheus 的GPU 採集器app

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-gpu-exporter
spec:
  selector:
    matchLabels:
      app: node-gpu-exporter
  template:
    metadata:
      labels:
        app: node-gpu-exporter
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: aliyun.accelerator/nvidia_count
                operator: Exists
      hostPID: true
      containers:
      - name: node-gpu-exporter
        image: registry.cn-hangzhou.aliyuncs.com/acs/gpu-prometheus-exporter:0.1-f48bc3c
        imagePullPolicy: Always
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: MY_NODE_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: EXCLUDE_PODS
          value: $(MY_POD_NAME),nvidia-device-plugin-$(MY_NODE_NAME),nvidia-device-plugin-ctr
        - name: CADVISOR_URL
          value: http://$(MY_NODE_IP):10255
        ports:
        - containerPort: 9445
          hostPort: 9445
        resources:
          requests:
            memory: 30Mi
            cpu: 100m
          limits:
            memory: 50Mi
            cpu: 200m

---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  name: node-gpu-exporter
  labels:
    app: node-gpu-exporter
    k8s-app: node-gpu-exporter
spec:
  type: ClusterIP
  clusterIP: None
  ports:
  - name: http-metrics
    port: 9445
    protocol: TCP
  selector:
    app: node-gpu-exporter

部署Grafana運維

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: monitoring-grafana
spec:
  replicas: 1
  template:
    metadata:
      labels:
        task: monitoring
        k8s-app: grafana
    spec:
      containers:
      - name: grafana
        image: registry.cn-hangzhou.aliyuncs.com/acs/grafana:5.0.4-gpu-monitoring
        ports:
        - containerPort: 3000
          protocol: TCP
      volumes:
      - name: grafana-storage
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: monitoring-grafana
spec:
  ports:
  - port: 80
    targetPort: 3000
  type: LoadBalancer
  selector:
    k8s-app: grafana

進入【應用-->服務】頁面，選擇對應集羣及kube-system命名空間，點擊monitoring-grafana對應的外部端點，瀏覽器自動跳轉到Grafana的login頁面，初始用戶名及密碼均爲admin，你能夠在登陸成功後重置密碼、添加用戶等操做。在Dashboard中能夠看到GPU應用監控和GPU節點監控信息。

節點GPU監控

Pod GPU監控

部署應用

若是您已經使用了Arena Arena - 打開KubeFlow的正確姿式) ，能夠直接使用arena提交一個訓練任務。

arena submit tf --name=style-transfer              \
              --gpus=1              \
              --workers=1              \
              --workerImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/neural-style:gpu \
              --workingDir=/neural-style \
              --ps=1              \
              --psImage=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/style-transfer:ps   \
              "python neural_style.py --styles /neural-style/examples/1-style.jpg --iterations 1000000"

NAME:   style-transfer
LAST DEPLOYED: Thu Sep 20 14:34:55 2018
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1alpha2/TFJob
NAME                  AGE
style-transfer-tfjob  0s

提交任務成功後在監控頁面裏能夠看到Pod的GPU相關指標，可以看到咱們經過Arena部署的Pod，以及pod裏GPU 的資源消耗

節點維度也能夠看到對應的GPU卡和節點的負載，在GPU節點監控頁面能夠選擇對應的節點和GPU卡

本文做者：蕭元

閱讀原文

本文爲雲棲社區原創內容，未經容許不得轉載。