【Kubernetes 加強】勿讓 Docker Volume 引起 Terminating Pod

Terminating Pod 是業務容器化後遇到的一個典型問題，誘因不一。本文記錄了網易數帆-輕舟 Kubernetes 加強技術團隊如何一步步排查，發現 Docker Volume 目錄過多致使 Terminating Pod 問題的經歷，並給出瞭解決方案。但願本文的分享對讀者排查及規避同類問題有所幫助。

問題背景

最近用戶的集羣中又出現了某個節點上的 Pod 長時間處於 Terminating 狀態的問題。起初咱們覺得是 18.06.3 版本的幾個經典的 Docker 和 Containerd 問題致使的，可是登錄出現問題的節點後發現環境以下：html

Component	Version
OS	Debian GNU/Linux 10 (buster)
Kernel	4.19.87-netease
Docker	18.09.9
Containerd	1.2.10
Kubernetes	v1.13.12-netease

Terminating Pod 的元數據以下：java

apiVersion: v1 kind: Pod metadata:  creationTimestamp: "2020-09-17T06:59:28Z"  deletionGracePeriodSeconds: 60  deletionTimestamp: "2020-10-09T07:33:11Z"  name: face-compare-for-music-4-achieve-848f89dfdf-4v7p6  namespace: ai-cv-team  # ...... spec:  # ...... status:  # ......  containerStatuses:  - containerID: docker://de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3  lastState: {}  name: docker-image  ready: false  restartCount: 0  state:  running:  startedAt: "2020-09-17T06:59:30Z"  hostIP: 10.194.172.224  phase: Running  podIP: 10.178.132.136  qosClass: Guaranteed  startTime: "2020-09-17T06:59:28Z"

在節點上經過 docker 命令查看發現 Terminating Pod 下的業務容器 de6d3812bfc8 仍然未被刪除：node

$ docker ps -a | grep de6d3812bfc8
de6d3812bfc8        91f062eaa3a0        "java -Xms4096m -Xmx…"        3 weeks ago        Up 3 weeks        k8s_docker-image_face-compare-for-music-4-achieve-......

再經過 ctr 命令查看發現 Containerd 中仍然存有該容器的元數據：linux

$ ctr --address /var/run/containerd/containerd.sock --namespace moby container list | grep de6d3812bfc8
de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3    -        io.containerd.runtime.v1.linux

咱們懷疑是 Shim 的進程回收出現了問題，經過 ps 命令查找 de6d3812bfc8 容器的 Shim 進程準備獲取棧進行分析，可是此時沒法找到該容器的 Shim 進程以及業務進程。日誌中查看到 Docker 和 Containerd 已經處理容器退出：git

Oct  9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.824919188+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
Oct  9 15:46:36 ai-data-k8s-3 containerd[1965]: time="2020-10-09T15:46:36.863630606+08:00" level=info msg="shim reaped" id=de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3
Oct  9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.873487822+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
Oct  9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.873531302+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

此時又有多個新的業務 Pod 被調度到該節點上，新調度 Pod 的容器一直處於 Created 狀態。該現象和咱們已知的幾個 Docker 和 Containerd 問題是不一樣的：github

$ docker ps -a | grep Created
03fed51454c2        c2e6abc00a12        "java -Xms8092m -Xmx…"        3 minutes ago        Created        k8s_docker-image_dynamic-bg-service-cpu-28-achieve-......
......

綜上所述，當前觀察到的現象以下：golang

Kubelet 刪除 Pod 的邏輯已被觸發。
Docker 已經接收並處理 Kubelet 刪除容器的請求。
該容器的 Shim 進程以及業務進程已經被清理。
某種緣由致使 Docker 和 Containerd 中的元數據一直沒法被刪除。
此後建立的容器一直處於 Created 狀態。

緣由分析

經過查看監控咱們發現出現問題時該節點的磁盤利用率很是高而且 CPU 負載異常：docker

磁盤利用率很是高api

CPU 負載異常緩存

咱們初步猜想該問題和異常的節點磁盤利用率有關。

爲何新調度 Pod 的容器一直處於 Created 狀態

新調度 Pod 的容器一直處於 Created 狀態是咱們在 Docker 版本爲 18.09.9 的環境遇到的新現象。針對該現象入手，咱們在 Docker 棧中發現多個阻塞在包含 github.com/docker/docker/daemon.(*Daemon).ContainerCreate 函數的 Goroutine，而且阻塞的緣由是 semacquire。其中一個 Goroutine 內容以下：

goroutine 19962397 [semacquire, 8 minutes]:
sync.runtime_SemacquireMutex(0xc000aee824, 0xc0026e4600)
        /usr/local/go/src/runtime/sema.go:71 +0x3f
sync.(*Mutex).Lock(0xc000aee820)
        /usr/local/go/src/sync/mutex.go:134 +0x101
github.com/docker/docker/volume/local.(*Root).Get(0xc000aee820, 0xc003d1e3c0, 0x40, 0x1, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/volume/local/local.go:237 +0x33
github.com/docker/docker/volume/service.(*VolumeStore).getVolume(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:717 +0x611
github.com/docker/docker/volume/service.(*VolumeStore).create(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x203000, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:582 +0x950
github.com/docker/docker/volume/service.(*VolumeStore).Create(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0xc002b0f090, 0x1, 0x1, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:468 +0x1c2
github.com/docker/docker/volume/service.(*VolumesService).Create(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0xc002b0f090, 0x1, 0x1, ...)
        /go/src/github.com/docker/docker/volume/service/service.go:61 +0xc6
github.com/docker/docker/daemon.(*Daemon).createContainerOSSpecificSettings(0xc0009741e0, 0xc000c3a6c0, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/create_unix.go:62 +0x364
github.com/docker/docker/daemon.(*Daemon).create(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/daemon/create.go:177 +0x44f
github.com/docker/docker/daemon.(*Daemon).containerCreate(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/daemon/create.go:72 +0x1c8
github.com/docker/docker/daemon.(*Daemon).ContainerCreate(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/daemon/create.go:31 +0xa6
......

從棧的內容中咱們發現該 Goroutine 阻塞在地址爲 0xc000aee820 的 Mutex 上，而且該地址與 github.com/docker/docker/volume/local.(*Root).Get 的 Function Receiver 相同。讓咱們經過代碼看看這個 Root 是怎樣的數據結構：

// Store is an in-memory store for volume drivers type Store struct {
    extensions   map[string]volume.Driver
    mu           sync.Mutex
    driverLock   *locker.Locker
    pluginGetter getter.PluginGetter
}

// Driver is for creating and removing volumes. type Driver interface {
    // Name returns the name of the volume driver.     Name() string
    // Create makes a new volume with the given name.     Create(name string, opts map[string]string) (Volume, error)
    // Remove deletes the volume.     Remove(vol Volume) (err error)
    // List lists all the volumes the driver has     List() ([]Volume, error)
    // Get retrieves the volume with the requested name     Get(name string) (Volume, error)
    // Scope returns the scope of the driver (e.g. `global` or `local`).     // Scope determines how the driver is handled at a cluster level     Scope() string
}

// Root implements the Driver interface for the volume package and // manages the creation/removal of volumes. It uses only standard vfs // commands to create/remove dirs within its provided scope. type Root struct {
    m            sync.Mutex
    scope        string
    path         string
    volumes      map[string]*localVolume
    rootIdentity idtools.Identity
}

Root 是 Volume 驅動的實現，用於管理 Volume 的生命週期。它緩存了全部的 Volume 而且由 Mutex 保護緩存數據的安全。github.com/docker/docker/volume/local.(*Root).Get 阻塞在237行等待 Mutex 的邏輯上，因此節點上新建立的容器一直處於 Created 狀態：

// Get looks up the volume for the given name and returns it if found func (r *Root) Get(name string) (volume.Volume, error) {
    r.m.Lock() // Line 237     v, exists := r.volumes[name]
    r.m.Unlock()
    if !exists {
        return nil, ErrNotFound
    }
    return v, nil
}

看來新建立的容器一直處於 Created 狀態只是結果，那麼是誰持有這個地址爲 0xc000aee820 的 Mutex 呢？

誰持有阻塞其餘 Goroutine 的 Mutex

經過搜索阻塞在地址爲 0xc000aee820 的 Mutex，咱們找到了持有該 Mutex 的 Goroutine：

goroutine 19822190 [syscall]:
syscall.Syscall(0x107, 0xffffffffffffff9c, 0xc0026a5710, 0x200, 0x0, 0x0, 0x15)
        /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.unlinkat(0xffffffffffffff9c, 0xc0026a55f0, 0x83, 0x200, 0x5597cbefb200, 0xc000c913e8)
        /usr/local/go/src/syscall/zsyscall_linux_amd64.go:126 +0x8d
syscall.Rmdir(0xc0026a55f0, 0x83, 0x5597cbefb200, 0xc000c913e8)
        /usr/local/go/src/syscall/syscall_linux.go:158 +0x49
os.Remove(0xc0026a55f0, 0x83, 0x24, 0x83)
        /usr/local/go/src/os/file_unix.go:310 +0x70
os.RemoveAll(0xc0026a55f0, 0x83, 0x5e, 0x5597cb0bebee)
        /usr/local/go/src/os/path.go:68 +0x4f
os.RemoveAll(0xc001b8b320, 0x5e, 0x2, 0xc000843620)
        /usr/local/go/src/os/path.go:109 +0x4f7
github.com/docker/docker/volume/local.removePath(0xc001b8b320, 0x5e, 0x5e, 0x1)
        /go/src/github.com/docker/docker/volume/local/local.go:226 +0x4f
github.com/docker/docker/volume/local.(*Root).Remove(0xc000aee820, 0x5597cbf258c0, 0xc001508e10, 0x0, 0x0)
        /go/src/github.com/docker/docker/volume/local/local.go:217 +0x1f8
github.com/docker/docker/volume/service.(*VolumeStore).Remove(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0x5597cbf25da0, 0xc002587b00, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/volume/service/store.go:796 +0x71f
github.com/docker/docker/volume/service.(*VolumesService).Remove(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc002550c40, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/docker/docker/volume/service/service.go:135 +0x1e8
github.com/docker/docker/daemon.(*Daemon).removeMountPoints(0xc0009741e0, 0xc000bba6c0, 0x1, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/mounts.go:40 +0x239
github.com/docker/docker/daemon.(*Daemon).cleanupContainer(0xc0009741e0, 0xc000bba6c0, 0x101, 0xc000bba6c0, 0x0)
        /go/src/github.com/docker/docker/daemon/delete.go:141 +0x827
github.com/docker/docker/daemon.(*Daemon).ContainerRm(0xc0009741e0, 0xc000bb8089, 0x40, 0xc000998066, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/delete.go:45 +0x272
......

從 Goroutine 棧中咱們看到 github.com/docker/docker/volume/local.(*Root).Remove 函數持有地址爲 0xc000aee820 的 Mutex，而且執行到了217行，該函數負責調用 os.RemoveAll 函數刪除指定的 Volume 以及數據：

// Remove removes the specified volume and all underlying data. If the // given volume does not belong to this driver and an error is // returned. The volume is reference counted, if all references are // not released then the volume is not removed. func (r *Root) Remove(v volume.Volume) error {
    r.m.Lock()
    defer r.m.Unlock()

    // ...... 
    if err := removePath(realPath); err != nil { // Line 217         return err
    }

    delete(r.volumes, lv.name)
    return removePath(filepath.Dir(lv.path))
}

經過觀察 Goroutine 棧能夠發現，os.RemoveAll 函數在棧中出現了兩次，查看源碼咱們得知 os.RemoveAll 的實現採用了遞歸的方式。在109行包含遞歸調用的邏輯：

// RemoveAll removes path and any children it contains. // It removes everything it can but returns the first error // it encounters. If the path does not exist, RemoveAll // returns nil (no error). func RemoveAll(path string) error {
    // Simple case: if Remove works, we're done.     err := Remove(path) // Line 68     if err == nil || IsNotExist(err) {
        return nil
    }

    // ...... 
    err = nil
    for {
        // ...... 
        names, err1 := fd.Readdirnames(request)

        // Removing files from the directory may have caused         // the OS to reshuffle it. Simply calling Readdirnames         // again may skip some entries. The only reliable way         // to avoid this is to close and re-open the         // directory. See issue 20841.         fd.Close()

        for _, name := range names {
            err1 := RemoveAll(path + string(PathSeparator) + name) // Line 109             if err == nil {
                err = err1
            }
        }

        // ......     }

    // ...... }

Goroutine 棧的最上層是 syscall.unlinkat 函數，即經過系統調用 unlinkat 刪除容器的文件系統目錄。咱們發現了一個 Terminating Pod 的容器 Volume 有異常：

$ ls -l /var/lib/docker/volumes/0789a0f8cbfdc59de30726a7ea21a76dd36fea0e4e832c9f806cdf39c29197c5/
total 4
drwxr-xr-x 1 root root 512378124 Aug 26  2020 _data

該目錄文件大小超過了 500MB 可是 Link 計數只有1，經過查看 ext4 文檔發現如下內容：

dir_nlink
       Normally, ext4 allows an inode to have no more than 65,000
       hard links.  This applies to regular files as well as
       directories, which means that there can be no more than
       64,998 subdirectories in a directory (because each of the
       '.' and '..' entries, as well as the directory entry for
       the directory in its parent directory counts as a hard
       link).  This feature lifts this limit by causing ext4 to
       use a link count of 1 to indicate that the number of hard
       links to a directory is not known when the link count
       might exceed the maximum count limit.

即當一個 ext4 文件系統下目錄中的子目錄個數超過64998時，該目錄的 Link 會被置爲1來表示硬連接計數已超過最大限制。對該目錄下的文件進行遍歷後咱們發現有超過500萬個空目錄，已經遠超過64998的限制。因此在第一次觸發刪除 Pod 邏輯後該節點的磁盤利用率一直居高不下而且 CPU 負載異常，Volume 文件刪除過程很是緩慢致使全部相同業務的容器刪除邏輯阻塞。經過查看相關代碼能夠確認在 Kubelet 刪除容器時 Volume 也是一塊兒被回收的：

// RemoveContainer removes the container. func (ds *dockerService) RemoveContainer(_ context.Context, r *runtimeapi.RemoveContainerRequest) (*runtimeapi.RemoveContainerResponse, error) {
    // Ideally, log lifecycle should be independent of container lifecycle.     // However, docker will remove container log after container is removed,     // we can't prevent that now, so we also clean up the symlink here.     err := ds.removeContainerLogSymlink(r.ContainerId)
    if err != nil {
        return nil, err
    }
    err = ds.client.RemoveContainer(r.ContainerId, dockertypes.ContainerRemoveOptions{RemoveVolumes: true, Force: true}) // Line 280     if err != nil {
        return nil, fmt.Errorf("failed to remove container %q: %v", r.ContainerId, err)
    }
    return &runtimeapi.RemoveContainerResponse{}, nil
}

爲何 Containerd 中的元數據一直沒法被刪除

還有一個疑問，爲何 ctr 命令能夠查到須要被刪除的容器元數據呢？咱們發現了另外一類等待該 Mutex 的 Goroutine：

goroutine 19943568 [semacquire, 95 minutes]:
sync.runtime_SemacquireMutex(0xc000aee824, 0x5597c9ab7300)
        /usr/local/go/src/runtime/sema.go:71 +0x3f
sync.(*Mutex).Lock(0xc000aee820)
        /usr/local/go/src/sync/mutex.go:134 +0x101
github.com/docker/docker/volume/local.(*Root).Get(0xc000aee820, 0xc002b12180, 0x40, 0x5597cbf22080, 0xc000aee820, 0x0, 0x0)
        /go/src/github.com/docker/docker/volume/local/local.go:237 +0x33
github.com/docker/docker/volume/service.lookupVolume(0x5597cbf18f40, 0xc00013e038, 0xc000cc5500, 0xc000c914c8, 0x5, 0xc002b12180, 0x40, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:744 +0xc7
github.com/docker/docker/volume/service.(*VolumeStore).getVolume(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc002b12180, 0x40, 0x5597cb0c1541, 0x5, 0x5597c9a85bb5, 0x0, 0xc003deb198, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:688 +0x299
github.com/docker/docker/volume/service.(*VolumeStore).Get(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc002b12180, 0x40, 0xc003deb240, 0x1, 0x1, 0x0, 0x0, ...)
        /go/src/github.com/docker/docker/volume/service/store.go:636 +0x173
github.com/docker/docker/volume/service.(*VolumesService).Unmount(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc0016f1ce0, 0xc00381b040, 0x40, 0x5597c9a5bfeb, 0xc0018ced50)
        /go/src/github.com/docker/docker/volume/service/service.go:105 +0xc6
github.com/docker/docker/daemon.(*volumeWrapper).Unmount(0xc003cdede0, 0xc00381b040, 0x40, 0x0, 0x0)
        /go/src/github.com/docker/docker/daemon/volumes.go:414 +0x6a
github.com/docker/docker/volume/mounts.(*MountPoint).Cleanup(0xc001d89680, 0xc0013cbc50, 0xc003deb3f8)
        /go/src/github.com/docker/docker/volume/mounts/mounts.go:83 +0x7a
github.com/docker/docker/container.(*Container).UnmountVolumes(0xc000c3ad80, 0xc003deb4e0, 0x60, 0x0)
        /go/src/github.com/docker/docker/container/container.go:475 +0x102
github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc0009741e0, 0xc000c3ad80)
        /go/src/github.com/docker/docker/daemon/start.go:257 +0x4ae
......

該 Goroutine 棧中包含 github.com/docker/docker/daemon.(*Daemon).Cleanup 函數而且執行到了257行，該函數負責釋放容器網絡資源並反掛載容器的文件系統：

// Cleanup releases any network resources allocated to the container along with any rules // around how containers are linked together. It also unmounts the container's root filesystem. func (daemon *Daemon) Cleanup(container *container.Container) {
    // ...... 
    if container.BaseFS != nil && container.BaseFS.Path() != "" {
        if err := container.UnmountVolumes(daemon.LogVolumeEvent); err != nil { // Line 257             logrus.Warnf("%s cleanup: Failed to umount volumes: %v", container.ID, err)
        }
    }

    container.CancelAttachContext()

    if err := daemon.containerd.Delete(context.Background(), container.ID); err != nil {
        logrus.Errorf("%s cleanup: failed to delete container from containerd: %v", container.ID, err)
    }
}

而該函數調用 Containerd 刪除元數據在257行的 github.com/docker/docker/container.(*Container).UnmountVolumes 函數以後，這也解釋了爲何經過 ctr 命令查看發現 Containerd 中仍然存有該容器的元數據。

真相大白

這些 Volume 多達 500MB 的容器是怎麼來的呢？經過和用戶溝通後咱們獲得了答案，原來用戶沒有理解 Docker Volume 的含義和使用場景，在 Dockerfile 中使用了 Volume：

# ......
VOLUME /app/images
VOLUME /app/logs
# ......

用戶在業務邏輯中頻繁的向 Volume 寫入數據而且未進行有效的垃圾回收，致使一段時間後大量空目錄泄漏而觸發 Terminating Pod 的問題。至此咱們問題的緣由就清晰了，Terminating Pod 問題產生的流程以下：

某個業務的 Pod 中包含頻繁的向 Volume 寫入數據的邏輯致使文件硬連接計數超過最大限制。
用戶進行了一次滾動更新，觸發 Pod 刪除的時間被記錄到 .metadata.deletionTimestamp。
刪除 Docker Volume 的邏輯被調用，因 Volume 中的空目錄過多致使 unlinkat 系統調用被大量執行。
函數 os.RemoveAll 遞歸刪除 Volume 目錄時大量執行 unlinkat 系統調用致使該節點的磁盤利用率很是高而且 CPU 負載異常。
第一個執行 Volume 刪除邏輯的 Goroutine 持有保護 Root 緩存的 Mutex，因函數 os.RemoveAll 刪除 Volume 目錄時遞歸處理500萬個文件過慢而沒法返回，該節點上後續對 Volume 的操做均阻塞在等待 Mutex 的邏輯上。
使用 Volume 的容器沒法被刪除，此時集羣中出現多個 Terminating Pod。

時序圖

總結

最後咱們在線上環境採用了節點下線進行磁盤格式化再從新上線的方案進行緊急恢復，而且建議用戶儘快棄用 Docker Volume 而改用 Kubernetes 本地盤方案。用戶在咱們的建議下修改了 Dockerfile 和編排模板，並對業務代碼中的邏輯進行了優化，杜絕了此類問題。

在此過程當中，輕舟 Kubernetes 加強技術團隊一樣受益不淺，從技術的角度去解決問題只是咱們工做的一個維度，用戶對雲原生技術的認知與服務方推行規範之間的差距更值得關注。雖然當前咱們解決了用戶使用 Kubernetes 的基本問題，可是在幫助用戶切實解決雲原生落地過程當中的痛點、加深用戶對新技術的理解、下降用戶的使用成本並讓用戶真正享受到技術紅利的道路上，咱們還有很長的路要走。

做者簡介

黃久遠，網易數帆資深開發工程師，專一於雲原生以及分佈式系統等領域，參與了網易雲音樂、網易傳媒、網易嚴選、考拉海購等多個用戶的大規模容器化落地以及網易輕舟容器平臺產品化工做，主要方向包括集羣監控、智能運維體系建設、Kubernetes 以及 Docker 核心組件維護等。當前主要負責網易輕舟雲原生故障自動診斷系統的設計、開發以及產品商業化工做。