編譯 gpushare-device-plugin

編譯 gpushare-device-plugin

手動編譯可執行文件(探索)

gpushare-device-plugin/Dockerfile 中明確填了編譯的命令html

# 第一個鏡像(編譯)
FROM golang:1.10-stretch as build

WORKDIR /go/src/github.com/AliyunContainerService/gpushare-device-plugin
COPY . .

# 編譯 gpushare-device-plugin-v2
RUN export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && \
    go build -ldflags="-s -w" -o /go/bin/gpushare-device-plugin-v2 cmd/nvidia/main.go

# 編譯 kubectl-inspect-gpushare-v2
RUN go build -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go

# 第二個鏡像(製做)
FROM debian:stretch-slim

# https://www.cnblogs.com/oolo/p/11679733.html#%E4%BD%BF%E7%94%A8-nvidia-docker2
# Device plugin 鏡像可以探測到全部的 GPU 設備
ENV NVIDIA_VISIBLE_DEVICES=all
# 只加載 NV Driver 中 nvidia-smi,MPS 等 cli 和 lib,其餘的不映射進去
ENV NVIDIA_DRIVER_CAPABILITIES=utility

# 將編譯的代碼放入容器中
COPY --from=build /go/bin/gpushare-device-plugin-v2 /usr/bin/gpushare-device-plugin-v2

COPY --from=build /go/bin/kubectl-inspect-gpushare-v2 /usr/bin/kubectl-inspect-gpushare-v2

# Device plugin 的入口函數
CMD ["gpushare-device-plugin-v2","-logtostderr"]

編譯 gpushare-device-plugin-v2node

(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' && \
>     go build -ldflags="-s -w" -o /go/bin/gpushare-device-plugin-v2 cmd/nvidia/main.go
cmd/nvidia/main.go:6:2: cannot find package "github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia" in any of:
    /usr/local/go/src/github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia (from $GOROOT)
    /opt/ml/bin/work/src/github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/github.com/AliyunContainerService/gpushare-device-plugin/pkg/gpu/nvidia

由此可知,咱們須要將代碼放置到有 github.com/AliyunContainerService 結構的目錄下才能編譯,因而:linux

# 建立目錄結構
(nni) root@ubuntu:/home/lihao/codes# mkdir -p /opt/ml/bin/work/src/github.com/AliyunContainerService/

# 創建軟鏈接
(nni) root@ubuntu:/home/lihao/codes# ln -s /home/lihao/codes/gpushare-device-plugin /opt/ml/bin/work/src/github.com/AliyunContainerService/

# 再次編譯
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' &&     go build -ldflags="-s -w" -o /go/bin/gpushare-device-plugin-v2 cmd/nvidia/main.go
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# ll /go/bin/gpushare-device-plugin-v2
-rwxr-xr-x 1 root root 28353928 10月 16 10:09 /go/bin/gpushare-device-plugin-v2*

編譯 kubectl-inspect-gpushare-v2git

(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go build -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
cmd/inspect/display.go:11:2: cannot find package "k8s.io/api/core/v1" in any of:
    /usr/local/go/src/k8s.io/api/core/v1 (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/api/core/v1 (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/api/core/v1
cmd/inspect/nodeinfo.go:8:2: cannot find package "k8s.io/apimachinery/pkg/apis/meta/v1" in any of:
    /usr/local/go/src/k8s.io/apimachinery/pkg/apis/meta/v1 (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/apimachinery/pkg/apis/meta/v1 (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/apimachinery/pkg/apis/meta/v1
cmd/inspect/podinfo.go:13:2: cannot find package "k8s.io/apimachinery/pkg/fields" in any of:
    /usr/local/go/src/k8s.io/apimachinery/pkg/fields (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/apimachinery/pkg/fields (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/apimachinery/pkg/fields
cmd/inspect/podinfo.go:14:2: cannot find package "k8s.io/apimachinery/pkg/labels" in any of:
    /usr/local/go/src/k8s.io/apimachinery/pkg/labels (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/apimachinery/pkg/labels (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/apimachinery/pkg/labels
cmd/inspect/podinfo.go:15:2: cannot find package "k8s.io/client-go/kubernetes" in any of:
    /usr/local/go/src/k8s.io/client-go/kubernetes (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/client-go/kubernetes (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/client-go/kubernetes
cmd/inspect/podinfo.go:16:2: cannot find package "k8s.io/client-go/rest" in any of:
    /usr/local/go/src/k8s.io/client-go/rest (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/client-go/rest (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/client-go/rest
cmd/inspect/podinfo.go:17:2: cannot find package "k8s.io/client-go/tools/clientcmd" in any of:
    /usr/local/go/src/k8s.io/client-go/tools/clientcmd (from $GOROOT)
    /opt/ml/bin/work/src/k8s.io/client-go/tools/clientcmd (from $GOPATH)
    /home/lihao/codes/jarvis-http-server/src/k8s.io/client-go/tools/clientcmd

kubectl-inspect-gpushare-v2 依賴的包並不在 GOPATH 中,所以報錯,咱們能夠用 go get XXX 來解決依賴github

(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go get k8s.io/api/core/v1
package k8s.io/api/core/v1: unrecognized import path "k8s.io/api/core/v1" (https fetch: Get https://k8s.io/api/core/v1?go-get=1: dial tcp 35.201.71.162:443: connect: connection refused)

k8s.io 被牆了,沒法鏈接,建議用戶搭建 http/https 代理,我這裏就直接使用了:golang

(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# export http_proxy=http://lihao:lihao-root@10.10.67.135:3128/
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# export https_proxy=http://lihao:lihao-root@10.10.67.135:3128/

(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go get k8s.io/api/core/v1
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go get k8s.io/client-go/kubernetes
# k8s.io/client-go/transport
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:70:11: cannot convert klog.V(9) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:72:11: cannot convert klog.V(8) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:74:11: cannot convert klog.V(7) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:76:11: cannot convert klog.V(6) (type klog.Verbose) to type bool
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go get k8s.io/client-go/rest
# k8s.io/client-go/transport
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:70:11: cannot convert klog.V(9) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:72:11: cannot convert klog.V(8) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:74:11: cannot convert klog.V(7) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:76:11: cannot convert klog.V(6) (type klog.Verbose) to type bool
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go get k8s.io/client-go/tools/clientcmd
# k8s.io/client-go/transport
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:70:11: cannot convert klog.V(9) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:72:11: cannot convert klog.V(8) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:74:11: cannot convert klog.V(7) (type klog.Verbose) to type bool
/opt/ml/bin/work/src/k8s.io/client-go/transport/round_trippers.go:76:11: cannot convert klog.V(6) (type klog.Verbose) to type bool

遇到了 cannot convert klog.V(9) (type klog.Verbose) to type bool 的錯誤,只能 google 之,告知的結論是他們依賴的 klog 應該是 0.4.0 版本 而是最新的 1.0.0,因而進入 klog 的路徑,進行以下操做:docker

# 使用 klog 0.4.0 或許有更好的辦法,好比 godep,gomodule 暫時先不考慮
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# cd /opt/ml/bin/work/src/k8s.io/klog
(nni) root@ubuntu:/opt/ml/bin/work/src/k8s.io/klog# git checkout -b v0.4.0 v0.4.0
Switched to a new branch 'v0.4.0'

# 從新下載依賴
(nni) root@ubuntu:/# go get k8s.io/client-go/rest
(nni) root@ubuntu:/# go get k8s.io/client-go/tools/clientcmd
(nni) root@ubuntu:/# go get k8s.io/client-go/kubernetes
(nni) root@ubuntu:/# go get k8s.io/api/core/v1

# 編譯
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# go build -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
(nni) root@ubuntu:/home/lihao/codes/gpushare-device-plugin# /go/bin/kubectl-inspect-gpushare-v2 --help
Usage of /go/bin/kubectl-inspect-gpushare-v2:
  -alsologtostderr
        log to standard error as well as files
  -d    details
  -log_backtrace_at value
        when logging hits line file:N, emit a stack trace
  -log_dir string
        If non-empty, write log files in this directory
  -logtostderr
        log to standard error instead of files
  -stderrthreshold value
        logs at or above this threshold go to stderr
  -v value
        log level for V logs
  -vmodule value
        comma-separated list of pattern=N settings for file-filtered logging

使用 vendor 來編譯(推薦)

技術在突飛猛進的發展,上述的編譯過程實在是漏洞百出,最核心的一點是,go get XXX 下載的是最新的包,該項目開發的時候用的依賴包版本和咱們編譯時候的包的版本極大可能會不一致,致使不可知的風險,所以上述的編譯方法簡直是個災難。json

當我發現 gpushare-device-plugin/vendor 下有該項目完整的依賴時,我就知道,我必須利用 vendor 的依賴來編譯,這樣才能保證是沒有問題的。ubuntu

爲了保持環境的整潔,我重構編譯環境,找一臺新的沒有裝過 go 的機器:api

[root@localhost ~]# wget https://mirrors.nju.edu.cn/golang/go1.12.1.linux-amd64.tar.gz -O /root/lihao04/
[root@localhost ~]# cd /root/lihao04
[root@localhost lihao04]# tar -zxvf go1.12.1.linux-amd64.tar.gz
[root@localhost lihao04]# export PATH=$PATH:/root/lihao04/go/bin/
[root@localhost lihao04]# go version
go version go1.12.1 linux/amd64

[root@localhost lihao04]# git clone https://github.com/AliyunContainerService/gpushare-device-plugin.git
Cloning into 'gpushare-device-plugin'...
remote: Enumerating objects: 2227, done.
remote: Total 2227 (delta 0), reused 0 (delta 0), pack-reused 2227
Receiving objects: 100% (2227/2227), 4.59 MiB | 447.00 KiB/s, done.
Resolving deltas: 100% (765/765), done.

[root@localhost lihao04]# cd gpushare-device-plugin
[root@localhost gpushare-device-plugin]# go build -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
cmd/inspect/display.go:10:2: cannot find package "github.com/golang/glog" in any of:
    /root/lihao04/go/src/github.com/golang/glog (from $GOROOT)
    /root/go/src/github.com/golang/glog (from $GOPATH)
cmd/inspect/display.go:11:2: cannot find package "k8s.io/api/core/v1" in any of:
    /root/lihao04/go/src/k8s.io/api/core/v1 (from $GOROOT)
    /root/go/src/k8s.io/api/core/v1 (from $GOPATH)
cmd/inspect/nodeinfo.go:8:2: cannot find package "k8s.io/apimachinery/pkg/apis/meta/v1" in any of:
    /root/lihao04/go/src/k8s.io/apimachinery/pkg/apis/meta/v1 (from $GOROOT)
    /root/go/src/k8s.io/apimachinery/pkg/apis/meta/v1 (from $GOPATH)
cmd/inspect/podinfo.go:13:2: cannot find package "k8s.io/apimachinery/pkg/fields" in any of:
    /root/lihao04/go/src/k8s.io/apimachinery/pkg/fields (from $GOROOT)
    /root/go/src/k8s.io/apimachinery/pkg/fields (from $GOPATH)
cmd/inspect/podinfo.go:14:2: cannot find package "k8s.io/apimachinery/pkg/labels" in any of:
    /root/lihao04/go/src/k8s.io/apimachinery/pkg/labels (from $GOROOT)
    /root/go/src/k8s.io/apimachinery/pkg/labels (from $GOPATH)
cmd/inspect/podinfo.go:15:2: cannot find package "k8s.io/client-go/kubernetes" in any of:
    /root/lihao04/go/src/k8s.io/client-go/kubernetes (from $GOROOT)
    /root/go/src/k8s.io/client-go/kubernetes (from $GOPATH)
cmd/inspect/podinfo.go:16:2: cannot find package "k8s.io/client-go/rest" in any of:
    /root/lihao04/go/src/k8s.io/client-go/rest (from $GOROOT)
    /root/go/src/k8s.io/client-go/rest (from $GOPATH)
cmd/inspect/podinfo.go:17:2: cannot find package "k8s.io/client-go/tools/clientcmd" in any of:
    /root/lihao04/go/src/k8s.io/client-go/tools/clientcmd (from $GOROOT)
    /root/go/src/k8s.io/client-go/tools/clientcmd (from $GOPATH)

很明顯,從 GOPATH 中找不到依賴,搜索了下,貌似能夠經過 export GO111MODULE=on 啓用 go modules,這樣就再也不須要強依賴 GOPATH

[root@localhost gpushare-device-plugin]# export GO111MODULE=on
[root@localhost gpushare-device-plugin]# go build -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
go: creating new go.mod: module github.com/AliyunContainerService/gpushare-device-plugin
go: copying requirements from Gopkg.lock
go: converting Gopkg.lock: stat google.golang.org/genproto@42f80515abfed431577ebeded4ab86390ce0a5cd: unrecognized import path "google.golang.org/genproto" (https fetch: Get https://google.golang.org/genproto?go-get=1: dial tcp 216.239.37.1:443: connect: connection timed out)
go: converting Gopkg.lock: stat k8s.io/apiserver@13cfe3978170675900fbed4994382716c5bd293b: unrecognized import path "k8s.io/apiserver" (https fetch: Get https://k8s.io/apiserver?go-get=1: dial tcp 35.201.71.162:443: connect: connection timed out)
go: converting Gopkg.lock: stat k8s.io/client-go@v8.0.0: unrecognized import path "k8s.io/client-go" (https fetch: Get https://k8s.io/client-go?go-get=1: dial tcp 35.201.71.162:443: connect: connection timed out)
go: converting Gopkg.lock: stat k8s.io/kube-openapi@f442ecb314a3679150c272e2b9713d8deed5955d: unrecognized import path "k8s.io/kube-openapi" (https fetch: Get https://k8s.io/kube-openapi?go-get=1: dial tcp 35.201.71.162:443: connect: connection timed out)
go: converting Gopkg.lock: stat k8s.io/kubernetes@v1.12.3: unrecognized import path "k8s.io/kubernetes" (https fetch: Get https://k8s.io/kubernetes?go-get=1: dial tcp 35.201.71.162:443: connect: connection timed out)
^C

啓用了 export GO111MODULE=on 確實有變化,編譯中缺乏的包會直接去嘗試下載,可是這不是咱們期待的,能夠經過增長 -mod=vendor 來明確使用 vendor 下的依賴:

# 取消 GO111MODULE 設置
[root@localhost gpushare-device-plugin]# unset GO111MODULE

# 編譯失敗,錯誤提示告訴咱們,使用 vendor 必須啓用 GO111MODULE!
[root@localhost gpushare-device-plugin]# go build -mod=vendor -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
build flag -mod=vendor only valid when using modules

# 從新打開 GO111MODULE=on
[root@localhost gpushare-device-plugin]# export GO111MODULE=on

[root@localhost gpushare-device-plugin]# go build -mod=vendor -o /go/bin/kubectl-inspect-gpushare-v2 cmd/inspect/*.go
go: creating new go.mod: module github.com/AliyunContainerService/gpushare-device-plugin
go: copying requirements from Gopkg.lock
go: converting Gopkg.lock: stat github.com/gogo/protobuf@v1.1.1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/json-iterator/go@v1.1.5: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/NVIDIA/nvidia-docker@v1.0.1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/golang/glog@23def4e6c14b4da8ac2ed8007337bc5eb5007998: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat google.golang.org/genproto@42f80515abfed431577ebeded4ab86390ce0a5cd: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/hashicorp/golang-lru@v0.5.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/davecgh/go-spew@v1.1.1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/fsnotify/fsnotify@v1.4.7: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/apiserver@13cfe3978170675900fbed4994382716c5bd293b: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/api@2d6f90ab1293a1fb871cf149423ebb72aa7423aa: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat golang.org/x/time@f51c12702a4d776e4c1fa9b0fabab841babae631: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat golang.org/x/crypto@56440b844dfe139a8ac053f4ecac0b20b79058f4: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat google.golang.org/grpc@v1.16.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/modern-go/concurrent@bacd9c7ef1dd9b15be4a9909b8ac7a4e313eec94: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/apiextensions-apiserver@f584b16eb23bd2a3fd292a027d698d95db427c5d: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/peterbourgon/diskv@v2.0.1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat golang.org/x/net@0ed95abb35c445290478a5348a7b38bb154135fd: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/client-go@v8.0.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/NVIDIA/gpu-monitoring-tools@86f2a9fac6c5b597dc494420005144b8ef7ec9fb: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/golang/protobuf@v1.2.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/kube-openapi@f442ecb314a3679150c272e2b9713d8deed5955d: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/gregjones/httpcache@787624de3eb7bd915c329cba748687a3b22666a6: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat golang.org/x/text@v0.3.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat gopkg.in/inf.v0@v0.9.1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/google/gofuzz@24818f796faf91cd76ec7bddd72458fbced7a6c1: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/kubernetes@v1.12.3: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/googleapis/gnostic@v0.2.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/ghodss/yaml@v1.0.0: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/google/btree@4030bb1f1f0c35b30ca7009e9ebd06849dd45306: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/petar/GoLLRB@53be0d36a84c2a886ca057d34b6aa4468df9ccb4: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/modern-go/reflect2@4b7aa43c6742a2c18fdef89dd197aaae7dac7ccd: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat k8s.io/apimachinery@70adfbae261eebb795b76321790745ad0e3c523f: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat gopkg.in/yaml.v2@v2.2.2: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat golang.org/x/sys@98c5dad5d1a0e8a73845ecc8897d0bd56586511d: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/imdario/mergo@v0.3.6: repo version lookup disabled by -mod=vendor
go: converting Gopkg.lock: stat github.com/spf13/pflag@v1.0.3: repo version lookup disabled by -mod=vendor

# kubectl-inspect-gpushare-v2 存在
[root@localhost gpushare-device-plugin]# ll /go/bin/kubectl-inspect-gpushare-v2
-rwxr-xr-x 1 root root 32535344 Oct 16 14:35 /go/bin/kubectl-inspect-gpushare-v2

# 編譯 gpushare-device-plugin 也能夠成功,注意咱們額外增長了 -mod=vendor
[root@localhost gpushare-device-plugin]# export CGO_LDFLAGS_ALLOW='-Wl,--unresolved-symbols=ignore-in-object-files' &&     go build -ldflags="-s -w" -mod=vendor  -o /go/bin/gpushare-device-plugin-v2 cmd/nvidia/main.go

# gpushare-device-plugin 存在
[root@localhost gpushare-device-plugin]# ll /go/bin/gpushare-device-plugin-v2
-rwxr-xr-x 1 root root 26825912 Oct 16 14:38 /go/bin/gpushare-device-plugin-v2

經過使用 gpushare-device-plugin/vendor 咱們指定 -mod=vendor 參數,根本不用咱們本身處理依賴,項目的開發人員早就幫咱們弄好了,不得不說,太方便了!並且保證必定是正確的!

可是這引起了個人思考:

  1. vendor 目錄如何生成?
  2. 若是修改源碼,新增依賴項目,應該怎麼辦?

vendor 目錄如何生成?

vendor 是 v1.5 開始引入依賴管理的設計,只是用來存儲本地依賴,我暫時的理解是,開發者把全部的依賴放到項目的 vendor 文件夾下,再設置環境變量 enable 該 feature,編譯時依賴就走這個 vendor 文件夾,這能彌補多個項目必須經過設置各自的 GOPATH 來解決使用同一依賴但不一樣版本的狀況。

可是如何把全部依賴都挪到 vendor 目錄中,GO 語言有一些方案,同時也有一些第三方的方案。

咱們查看 gpushare-device-plugin 項目,發現有 Gopkg.lockGopkg.toml 這兩個文件:

[root@localhost gpushare-device-plugin]# cat Gopkg.toml

[[constraint]]
  revision = "86f2a9fac6c5b597dc494420005144b8ef7ec9fb"
  name = "github.com/NVIDIA/gpu-monitoring-tools"

[[constraint]]
  name = "k8s.io/kubernetes"
  version = "v1.11.2"

[[constraint]]
  name = "k8s.io/apimachinery"
  branch = "release-1.11"

[[constraint]]
  name = "k8s.io/client-go"
  version = "~v8.0.0"

[[override]]
  name = "k8s.io/api"
  version = "kubernetes-1.11.2"

[[override]]
  name = "github.com/gregjones/httpcache"
  revision = "787624de3eb7bd915c329cba748687a3b22666a6"

[[override]]
  name = "golang.org/x/time"
  revision = "f51c12702a4d776e4c1fa9b0fabab841babae631"

[[override]]
  name = "github.com/docker/docker"
  revision = "4f3616fb1c112e206b88cb7a9922bf49067a7756"

[[override]]
  name = "github.com/docker/distribution"
  revision = "edc3ab29cdff8694dd6feb85cfeb4b5f1b38ed9c"

[prune]
  go-tests = true
  unused-packages = true

咱們仔細觀察,這裏面記錄的內容不就是依賴的版本信息麼?這個文件必定有特別之處;搜索之,果真有收穫。

原來 Gopkg.lockGopkg.toml 這兩個文件是 GO Dep 項目的兩個主要配置文件,簡單講 Gopkg.toml 是清單文件,Gopkg.lock 是校驗描述文件。儘可能不要修改,避免形成兩個文件不一樣步的錯誤。

GO Dep 項目正是官方對依賴高效、簡潔地管理提出的解決工具。也就是說,使用 Dep 應該能夠生成 vendor 目錄?

首先咱們須要安裝 Dep 工具:

# 安裝 go dep 工具
[root@localhost gpushare-device-plugin]# curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5230  100  5230    0     0   9254      0 --:--:-- --:--:-- --:--:--  9289
ARCH = amd64
OS = linux
Will install into /root/go/bin
Fetching https://github.com/golang/dep/releases/latest..
Release Tag = v0.5.4
Fetching https://github.com/golang/dep/releases/tag/v0.5.4..
Fetching https://github.com/golang/dep/releases/download/v0.5.4/dep-linux-amd64..
Setting executable permissions.
Moving executable to /root/go/bin/dep

# 將 /root/go/bin/ 加入 PATH
[root@localhost gpushare-device-plugin]# export PATH=$PATH:/root/go/bin/

# 測試,安裝成功
[root@localhost gpushare-device-plugin]# dep
Dep is a tool for managing dependencies for Go projects

Usage: "dep [command]"

Commands:

  init     Set up a new Go project, or migrate an existing one
  status   Report the status of the project's dependencies
  ensure   Ensure a dependency is safely vendored in the project
  version  Show the dep version information
  check    Check if imports, Gopkg.toml, and Gopkg.lock are in sync

Examples:
  dep init                               set up a new project
  dep ensure                             install the project's dependencies
  dep ensure -update                     update the locked versions of all dependencies
  dep ensure -add github.com/pkg/errors  add a dependency to the project

Use "dep help [command]" for more information about a command.

安裝好 Dep 後,爲了更好的測試,咱們將 Gopkg.lockGopkg.tomlvendor 先刪除,而後執行 dep init:

# 移除已經作好的依賴
[root@localhost gpushare-device-plugin]# mv Gopkg.lock Gopkg.toml /home/
[root@localhost gpushare-device-plugin]# mv Gopkg.lock Gopkg.toml /home/
[root@localhost gpushare-device-plugin]# mv vendor /home/

# 初始化
[root@localhost gpushare-device-plugin]# dep init
init failed: unable to detect the containing GOPATH: /root/lihao04/gpushare-device-plugin is not within a known GOPATH/src

報錯的意思是 dep init 所在的文件夾必須在 GOPATH/src 下,所以須要作調整:

[root@localhost gpushare-device-plugin]# mkdir -p /opt/ml/bin/work/src/github.com/AliyunContainerService/
[root@localhost gpushare-device-plugin]# export GOPATH=/opt/ml/bin/work
[root@localhost gpushare-device-plugin]# ln -s /root/lihao04/gpushare-device-plugin /opt/ml/bin/work/src/github.com/AliyunContainerService/
[root@localhost gpushare-device-plugin]# cd /opt/ml/bin/work/src/github.com/AliyunContainerService/gpushare-device-plugin
[root@localhost gpushare-device-plugin]# dep init
init failed: unable to determine direct dependencies: unable to deduce repository and source type for "golang.org/x/net/context": unable to read metadata: unable to fetch raw metadata: failed HTTP request to URL "http://golang.org/x/net/context?go-get=1": Get http://golang.org/x/net/context?go-get=1: dial tcp 216.239.37.1:80: connect: connection timed out

報錯的意思是被牆了,設置代理,並從新執行:

[root@localhost gpushare-device-plugin]# export http_proxy=http://jarvis:jarvis-root@10.191.67.135:3128/
[root@localhost gpushare-device-plugin]# export https_proxy=http://jarvis:jarvis-root@10.191.67.135:3128/

[root@localhost gpushare-device-plugin]# dep init
Importing configuration from godep. These are only initial constraints, and are further refined during the solve process.
Importing configuration from godep. These are only initial constraints, and are further refined during the solve process.
Importing configuration from godep. These are only initial constraints, and are further refined during the solve process.
Importing configuration from godep. These are only initial constraints, and are further refined during the solve process.
  Using master as constraint for direct dep github.com/NVIDIA/gpu-monitoring-tools
  Locking in master (7a750c7) for direct dep github.com/NVIDIA/gpu-monitoring-tools
  Locking in  (0c5108395e2d) for transitive dep github.com/googleapis/gnostic
  Locking in master (548a555) for transitive dep google.golang.org/genproto
  Locking in v0.3.1 (89e63fd) for transitive dep k8s.io/klog
  Locking in  (24818f796faf) for transitive dep github.com/google/gofuzz
  Locking in  (c2654d5206da) for transitive dep k8s.io/utils
  Locking in v1.0.1 (583c0c0) for transitive dep github.com/spf13/pflag
  Locking in v1.1.1 (8991bc2) for transitive dep github.com/davecgh/go-spew
  Locking in v1.3.2 (6c65a55) for transitive dep github.com/golang/protobuf
  Locking in v1.1.0 (fd68e98) for transitive dep sigs.k8s.io/yaml
  Locking in  (ab8a2e0c74be) for transitive dep github.com/json-iterator/go
  Locking in  (b3a7cee44a30) for transitive dep k8s.io/kube-openapi
  Using master as constraint for direct dep github.com/golang/glog
  Locking in master (23def4e) for direct dep github.com/golang/glog
  Locking in  (9f3314589c9a) for transitive dep golang.org/x/oauth2
  Locking in  (1799e75a0719) for direct dep k8s.io/apimachinery
  Using master as constraint for direct dep golang.org/x/net
  Locking in master (da9a3fd) for direct dep golang.org/x/net
  Using ^1.4.7 as constraint for direct dep github.com/fsnotify/fsnotify
  Locking in v1.4.7 (c282820) for direct dep github.com/fsnotify/fsnotify
  Locking in  (f51c12702a4d) for transitive dep golang.org/x/time
  Locking in v2.2.4 (f221b84) for transitive dep gopkg.in/yaml.v2
  Locking in v0.3.2 (342b2e1) for transitive dep golang.org/x/text
  Locking in master (b09406a) for transitive dep golang.org/x/sys
  Locking in  (e84da0312774) for transitive dep golang.org/x/crypto
  Locking in  (7cf5895f2711) for direct dep k8s.io/api
  Locking in 1.0.3 (bacd9c7) for transitive dep github.com/modern-go/concurrent
  Using ^12.0.0 as constraint for direct dep k8s.io/client-go
  Locking in v12.0.0 (78d2af7) for direct dep k8s.io/client-go
  Locking in v0.9.0 (3887ee9) for transitive dep gopkg.in/inf.v0
  Locking in v1.3.1 (5628607) for transitive dep github.com/gogo/protobuf
  Using ^1.24.0 as constraint for direct dep google.golang.org/grpc
  Locking in v1.24.0 (f6d0f9e) for direct dep google.golang.org/grpc
  Using ^1.16.2 as constraint for direct dep k8s.io/kubernetes
  Locking in v1.16.2 (c97fe50) for direct dep k8s.io/kubernetes
  Locking in v1.6.5 (971852b) for transitive dep google.golang.org/appengine
  Locking in v0.3.5 (9316a62) for transitive dep github.com/imdario/mergo
  Locking in v1.0.1 (94122c3) for transitive dep github.com/modern-go/reflect2

看起來執行成功了,咱們來看下有什麼變化:

先挑選一個貌似已經安裝的依賴包 go-spew,看看它被保存在什麼位置

[root@localhost gpushare-device-plugin]# find / -type d -name go-spew
# verdor 目錄中有了!!!
/root/lihao04/gpushare-device-plugin/vendor/github.com/davecgh/go-spew
# 以前咱們備份的,請忽略
/home/vendor/github.com/davecgh/go-spew
# GOPATH 中也保存了一份,查看了下確實不是硬連接,是真實的拷貝
/opt/ml/bin/work/pkg/dep/sources/https---github.com-kubernetes-kubernetes/vendor/github.com/davecgh/go-spew

看下 gpushare-device-plugin 目錄下是否多出什麼內容

# 發現 gpushare-device-plugin 下多出了 Gopkg.lock 和 Gopkg.toml !!!
[root@localhost gpushare-device-plugin]# ll /root/lihao04/gpushare-device-plugin/
total 52
drwxr-xr-x 4 root root    45 Oct 16 14:27 cmd
drwxr-xr-x 3 root root    30 Oct 16 14:27 demo
-rw-r--r-- 1 root root  1522 Oct 16 14:27 device-plugin-ds.yaml
-rw-r--r-- 1 root root   883 Oct 16 14:27 device-plugin-rbac.yaml
-rw-r--r-- 1 root root   684 Oct 16 14:27 Dockerfile
-rw------- 1 root root    73 Oct 16 15:11 go.mod
-rw-r--r-- 1 root root 13719 Oct 16 15:20 Gopkg.lock
-rw-r--r-- 1 root root  1153 Oct 16 15:20 Gopkg.toml
-rw-r--r-- 1 root root 11357 Oct 16 14:27 LICENSE
drwxr-xr-x 3 root root    24 Oct 16 14:27 pkg
-rw-r--r-- 1 root root   964 Oct 16 14:27 README.md
drwxr-xr-x 8 root root   138 Oct 16 15:20 vendor

至此,真相大白。咱們經過實驗,得出結論 Gopkg.lockGopkg.tomlvendor 都是由 dep init 自動生成的。

咱們還能夠對比下新舊 Gopkg.toml

[root@localhost gpushare-device-plugin]# diff /opt/ml/bin/work/src/github.com/AliyunContainerService/gpushare-device-plugin /home/Gopkg.toml
3c3
< # Refer to https://golang.github.io/dep/docs/Gopkg.toml.html
---
> # Refer to https://github.com/golang/dep/blob/master/docs/Gopkg.toml.md
27d26
<
29c28
<   branch = "master"
---
>   revision = "86f2a9fac6c5b597dc494420005144b8ef7ec9fb"
33,42c32,33
<   name = "github.com/fsnotify/fsnotify"
<   version = "1.4.7"
<
< [[constraint]]
<   branch = "master"
<   name = "github.com/golang/glog"
<
< [[constraint]]
<   branch = "master"
<   name = "golang.org/x/net"
---
>   name = "k8s.io/kubernetes"
>   version = "v1.11.2"
45,46c36,37
<   name = "google.golang.org/grpc"
<   version = "1.24.0"
---
>   name = "k8s.io/apimachinery"
>   branch = "release-1.11"
50c41
<   version = "12.0.0"
---
>   version = "~v8.0.0"
52,54c43,61

...
...

從對比的結果上看,確實依賴庫的版本有出入,項目源碼中只是調用 import XXX 導入,可是確實沒有代表版本,所以在 dep init 默認就用適配的最新的代碼。有出入是正常的。

這裏有一個小疑問,gpushare-device-plugin 項目若是僅僅發佈 Gopkg.lockGopkg.toml 這兩個配置文件,而不發佈 vendor 咱們將怎麼下載依賴呢?

# 還原代碼
[root@localhost gpushare-device-plugin]# git reset --hard HEAD
HEAD is now at eb657d1 rename config.yml (#10)
[root@localhost gpushare-device-plugin]# git status
# On branch master
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#   go.mod
nothing added to commit but untracked files present (use "git add" to track)

# 刪除全部項目依賴包
[root@localhost gpushare-device-plugin]# rm -rf go.mod
[root@localhost gpushare-device-plugin]# rm -rf /opt/ml/bin/work/*
[root@localhost gpushare-device-plugin]# rm -rf vendor/

# 從新創建目錄
[root@localhost gpushare-device-plugin]# mkdir -p /opt/ml/bin/work/src/github.com/AliyunContainerService/
[root@localhost gpushare-device-plugin]# ln -s /root/lihao04/gpushare-device-plugin /opt/ml/bin/work/src/github.com/AliyunContainerService/
[root@localhost gpushare-device-plugin]# cd /opt/ml/bin/work/src/github.com/AliyunContainerService/gpushare-device-plugin

# 執行 dep ensure,耗費時間較長
[root@localhost gpushare-device-plugin]# dep ensure
[root@localhost gpushare-device-plugin]#

# 成功後,發現 vendor 目錄等所有存在,Gopkg.toml 不變,可是 Gopkg.lock 有變化
[root@localhost gpushare-device-plugin]# git diff Gopkg.toml
[root@localhost gpushare-device-plugin]# git diff Gopkg.lock
diff --git a/Gopkg.lock b/Gopkg.lock
index 61eb536..3374d0f 100644
--- a/Gopkg.lock
+++ b/Gopkg.lock
@@ -9,14 +9,6 @@
   revision = "86f2a9fac6c5b597dc494420005144b8ef7ec9fb"

 [[projects]]
-  digest = "1:179992ae7637e3aa978ab9f27e9a32b1d452dc4ae443037a34d5e08aa4ca23e0"
-  name = "github.com/NVIDIA/nvidia-docker"
-  packages = ["src/nvml"]
-  pruneopts = "UT"
-  revision = "01d2c9436620d7dde4672e414698afe6da4a282f"
-  version = "v1.0.1"
-
-[[projects]]
   digest = "1:ffe9824d294da03b391f44e1ae8281281b4afc1bdaa9588c9097785e3af10cec"
   name = "github.com/davecgh/go-spew"
   packages = ["spew"]
@@ -114,17 +106,6 @@
   revision = "787624de3eb7bd915c329cba748687a3b22666a6"

 [[projects]]
-  digest = "1:8ec8d88c248041a6df5f6574b87bc00e7e0b493881dad2e7ef47b11dc69093b5"
-  name = "github.com/hashicorp/golang-lru"
-  packages = [
-    ".",
-    "simplelru",
-  ]
-  pruneopts = "UT"
-  revision = "20f1fb78b0740ba8c3cb143a61e86ba5c8669768"
-  version = "v0.5.0"

我能給出的解釋是,Gopkg.toml 是用戶的配置,Gopkg.lock 是知足 Gopkg.toml 的生效的、最新的軟件包,好比:

[[constraint]]
  name = "k8s.io/kubernetes"
  version = "v1.11.2"

v1.11.2 可能有若干 bug fix 的小版本,這個仍是不斷地再升級的,所以會有必定的區別???

若是修改源碼,新增依賴項目,應該怎麼辦?

根據上面的操做,咱們使用 dep ensure -add 來增長依賴:

## 報錯信息說的很是好,代碼沒有 import 這個庫,可是咱們臨時加上了,可是調用 dep ensure 後,會刪除
[root@localhost gpushare-device-plugin]# dep ensure -add github.com/pkg/errors
Fetching sources...

"github.com/pkg/errors" is not imported by your project, and has been temporarily added to Gopkg.lock and vendor/.
If you run "dep ensure" again before actually importing it, it will disappear from Gopkg.lock and vendor/.

懶得給你們演示成功加入一個新的依賴庫的輸出,總之加入新庫後,能夠經過 dep status 查看目前的依賴及其版本觀察其變化:

[root@localhost gpushare-device-plugin]# dep status
PROJECT                                 CONSTRAINT     VERSION        REVISION  LATEST   PKGS USED
github.com/NVIDIA/gpu-monitoring-tools  branch master  branch master  7a750c7   7a750c7  1
github.com/davecgh/go-spew              v1.1.1         v1.1.1         8991bc2   v1.1.1   1
github.com/fsnotify/fsnotify            ^1.4.7         v1.4.7         c282820   v1.4.7   1
github.com/gogo/protobuf                v1.3.1         v1.3.1         5628607   v1.3.1   4
github.com/golang/glog                  branch master  branch master  23def4e   23def4e  1
github.com/golang/protobuf              v1.3.2         v1.3.2         6c65a55   v1.3.2   5
github.com/google/gofuzz                *                             24818f7            1
github.com/googleapis/gnostic           *                             0c51083            3
github.com/imdario/mergo                v0.3.5         v0.3.5         9316a62   v0.3.5   1
github.com/json-iterator/go             *                             ab8a2e0            1
github.com/modern-go/concurrent         1.0.3          1.0.3          bacd9c7   1.0.3    1
github.com/modern-go/reflect2           v1.0.1         v1.0.1         94122c3   1.0.1    1
github.com/spf13/pflag                  v1.0.1         v1.0.1         583c0c0   v1.0.1   1
golang.org/x/crypto                     *                             e84da03            1
golang.org/x/net                        branch master  branch master  da9a3fd   da9a3fd  8
golang.org/x/oauth2                     *                             9f33145            2
golang.org/x/sys                        branch master  branch master  b09406a   b09406a  2
golang.org/x/text                       v0.3.2         v0.3.2         342b2e1   v0.3.2   16
golang.org/x/time                       *                             f51c127            1
google.golang.org/appengine             v1.6.5         v1.6.5         971852b   v1.6.5   7
google.golang.org/genproto              branch master  branch master  548a555   548a555  1
google.golang.org/grpc                  ^1.24.0        v1.24.0        f6d0f9e   v1.24.0  33
gopkg.in/inf.v0                         v0.9.0         v0.9.0         3887ee9   v0.9.0   1
gopkg.in/yaml.v2                        v2.2.4         v2.2.4         f221b84   v2.2.4   1
k8s.io/api                              *                             7cf5895            36
k8s.io/apimachinery                     *                             1799e75            39
k8s.io/client-go                        ^12.0.0        v12.0.0        78d2af7   v12.0.0  59
k8s.io/klog                             v0.3.1         v0.3.1         89e63fd   v0.3.1   1
k8s.io/kube-openapi                     *                             b3a7cee            1
k8s.io/kubernetes                       ^1.16.2        v1.16.2        c97fe50   v1.16.2  2
k8s.io/utils                            *                             c2654d5            1
sigs.k8s.io/yaml                        v1.1.0         v1.1.0         fd68e98   v1.1.0   1

dep 的最佳實踐

  • dep init 可以自動分析 import 的關係,將依賴添加到 Gopkg.lockGopkg.tomlvendor
  • 源碼修改後,直接調用 dep ensure 自動更新 Gopkg.lockvendor
  • 若是對依賴包的版本有要求,調用 dep ensure -add github.com/BurntSushi/toml@v0.2.0 指定版本或者其餘
  • 使用 dep ensure update <pkg> 進行升級

具體請參考 寫Go代碼時遇到的那些問題

製做 device plugin 鏡像

當編譯出來 gpushare-device-plugin-v2kubectl-inspect-gpushare-v2 後,咱們修改官方的 Dockerfile 直接製做 docker:

(nni) root@ubuntu:/go/bin# cp /opt/ml/bin/work/src/github.com/AliyunContainerService/gpushare-device-plugin/Dockerfile /go/bin/
(nni) root@ubuntu:/go/bin# cd /go/bin/
(nni) root@ubuntu:/go/bin# docker build -f Dockerfile -t registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.12-lihao-test .
Sending build context to Docker daemon  60.92MB
Step 1/6 : FROM debian:stretch-slim
 ---> 9634525e1209
Step 2/6 : ENV NVIDIA_VISIBLE_DEVICES=all
 ---> Using cache
 ---> c6f5ca103b63
Step 3/6 : ENV NVIDIA_DRIVER_CAPABILITIES=utility
 ---> Using cache
 ---> 5b389b7dbabe
Step 4/6 : COPY gpushare-device-plugin-v2 /usr/bin/gpushare-device-plugin-v2
 ---> Using cache
 ---> c8d9175fe5d7
Step 5/6 : COPY kubectl-inspect-gpushare-v2 /usr/bin/kubectl-inspect-gpushare-v2
 ---> e1a8a7b098fd
Step 6/6 : CMD ["gpushare-device-plugin-v2","-logtostderr"]
 ---> Running in b81830470fbd
Removing intermediate container b81830470fbd
 ---> 7fd0dfa09cbc
Successfully built 7fd0dfa09cbc
Successfully tagged registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.12-lihao-test

修改後的 Dockerfile 爲:

FROM debian:stretch-slim

ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=utility

COPY gpushare-device-plugin-v2 /usr/bin/gpushare-device-plugin-v2

COPY kubectl-inspect-gpushare-v2 /usr/bin/kubectl-inspect-gpushare-v2

CMD ["gpushare-device-plugin-v2","-logtostderr"]

測試 device plugin 鏡像

登陸到裝有 gpushare-device-plugin 的 k8s 環境中,修改鏡像 v2-1.11-aff8a23 -> v2-1.12-lihao-test

[root@k8s-master kubernetes]# cat device-plugin-ds.yaml|grep image
      - image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.12-lihao-test

查看升級前 device-plugin daemonset 使用的鏡像:

[root@k8s-master kubernetes]# kubectl describe pod gpushare-device-plugin-ds-59jzl -n kube-system|grep Image
    Image:         registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23
    Image ID:      docker-pullable://registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin@sha256:76769d69f5a5b24cbe117f8ac83a0ff7409fda6108ca982c8f3b8f763e016100

經過 kubectl-inspect-gpushare 能夠查看 GPU 顯存使用狀況

[root@k8s-master kubernetes]# kubectl-inspect-gpushare
NAME                                             IPADDRESS     GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(GiB)
mesos-gpu-v100-online020-bdwg.cloud.qiyi.domain  10.39.41.206  13/15                  14/15                  14/15                  14/15                  14/15                  10/15                  0/15                   0/15                   79/120
--------------------------------------------------------------------------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
79/120 (65%)

查看升級以前建立的 gpu pod

[root@k8s-master kubernetes]# kubectl get pod|grep binpack
binpack-1-0                 1/1     Running   12         7d1h
binpack-1-1                 1/1     Running   12         7d
binpack-1-2                 1/1     Running   12         7d
binpack-lihao-0             0/1     Pending   0          7d
binpack-lihao-1             1/1     Running   0          22h
binpack-lihao-2             1/1     Running   0          22h
binpack-lihao-3             1/1     Running   0          22h
binpack-lihao-4             1/1     Running   0          22h
binpack-lihao-5             1/1     Running   0          22h
binpack-lihao-6             1/1     Running   0          22h
binpack-lihao-7             1/1     Running   0          22h
binpack-lihao-8             1/1     Running   0          22h
binpack-lihao-9             1/1     Running   0          22h
binpack-test-0              1/1     Running   13         7d

開始升級:

[root@k8s-master kubernetes]# kubectl apply -f device-plugin-ds.yaml
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
daemonset.extensions/gpushare-device-plugin-ds configured

# 不知道爲何始終不見更新鏡像,直接刪除該 pod
[root@k8s-master kubernetes]# kubectl delete pod gpushare-device-plugin-ds-59jzl -n kube-system
pod "gpushare-device-plugin-ds-59jzl" deleted

# 新建的 pod 確實使用了指定的鏡像
[root@k8s-master kubernetes]# kubectl describe pod gpushare-device-plugin-ds-6b44p -n kube-system|grep Image
    Image:         registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.12-lihao-test
    Image ID:      docker://sha256:08ebf3626df08cf6ed27cfc301a946821b34b8fe574afbb2c2650c826e5589e0

# 新建 gpu pod
[root@k8s-master kubernetes]# kubectl apply -f example-2.yaml
statefulset.apps/binpack-2 created

查看新建的 pod 狀態

# 狀態正常
[root@k8s-master kubernetes]# kubectl get pod|grep binpack-2
binpack-2-0                 1/1     Running   0          48s
binpack-2-1                 1/1     Running   0          50s

# 正常使用
[root@k8s-master kubernetes]# kubectl exec -it binpack-2-0 bash
root@binpack-2-0:/notebooks# nvidia-smi
Wed Oct 16 10:03:02 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8C:00.0 Off |                    0 |
| N/A   45C    P0    63W / 300W |  15830MiB / 16130MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

查看存量 pod 的狀態

# 找了一個已經運行了 22h 的存量 GPU pod
[root@k8s-master kubernetes]# kubectl get pod |grep binpack-lihao-1
binpack-lihao-1             1/1     Running   0          22h

# 使用正常
[root@k8s-master kubernetes]# kubectl exec -it binpack-lihao-1 bash
root@binpack-lihao-1:/notebooks# nvidia-smi
Wed Oct 16 10:05:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   40C    P0    57W / 300W |   9300MiB / 16130MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
相關文章
相關標籤/搜索