Terminating Pod 是業務容器化後遇到的一個典型問題,誘因不一。本文記錄了網易數帆-輕舟 Kubernetes 加強技術團隊如何一步步排查,發現 Docker Volume 目錄過多致使 Terminating Pod 問題的經歷,並給出瞭解決方案。但願本文的分享對讀者排查及規避同類問題有所幫助。
問題背景
最近用戶的集羣中又出現了某個節點上的 Pod 長時間處於 Terminating 狀態的問題。起初咱們覺得是 18.06.3 版本的幾個經典的 Docker 和 Containerd 問題致使的,可是登錄出現問題的節點後發現環境以下:html
Component | Version |
---|---|
OS | Debian GNU/Linux 10 (buster) |
Kernel | 4.19.87-netease |
Docker | 18.09.9 |
Containerd | 1.2.10 |
Kubernetes | v1.13.12-netease |
Terminating Pod 的元數據以下:java
apiVersion: v1 kind: Pod metadata: creationTimestamp: "2020-09-17T06:59:28Z" deletionGracePeriodSeconds: 60 deletionTimestamp: "2020-10-09T07:33:11Z" name: face-compare-for-music-4-achieve-848f89dfdf-4v7p6 namespace: ai-cv-team # ...... spec: # ...... status: # ...... containerStatuses: - containerID: docker://de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3 lastState: {} name: docker-image ready: false restartCount: 0 state: running: startedAt: "2020-09-17T06:59:30Z" hostIP: 10.194.172.224 phase: Running podIP: 10.178.132.136 qosClass: Guaranteed startTime: "2020-09-17T06:59:28Z"
在節點上經過 docker
命令查看發現 Terminating Pod 下的業務容器 de6d3812bfc8
仍然未被刪除:node
$ docker ps -a | grep de6d3812bfc8 de6d3812bfc8 91f062eaa3a0 "java -Xms4096m -Xmx…" 3 weeks ago Up 3 weeks k8s_docker-image_face-compare-for-music-4-achieve-......
再經過 ctr
命令查看發現 Containerd 中仍然存有該容器的元數據:linux
$ ctr --address /var/run/containerd/containerd.sock --namespace moby container list | grep de6d3812bfc8
de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3 - io.containerd.runtime.v1.linux
咱們懷疑是 Shim 的進程回收出現了問題,經過 ps
命令查找 de6d3812bfc8
容器的 Shim 進程準備獲取棧進行分析,可是此時沒法找到該容器的 Shim 進程以及業務進程。日誌中查看到 Docker 和 Containerd 已經處理容器退出:git
Oct 9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.824919188+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit Oct 9 15:46:36 ai-data-k8s-3 containerd[1965]: time="2020-10-09T15:46:36.863630606+08:00" level=info msg="shim reaped" id=de6d3812bfc8b6bef915d663c3d0e0cf5b3f95a41c2a98c867a1376dbccd30d3 Oct 9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.873487822+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete Oct 9 15:46:36 ai-data-k8s-3 dockerd[10017]: time="2020-10-09T15:46:36.873531302+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
此時又有多個新的業務 Pod 被調度到該節點上,新調度 Pod 的容器一直處於 Created 狀態。該現象和咱們已知的幾個 Docker 和 Containerd 問題是不一樣的:github
$ docker ps -a | grep Created 03fed51454c2 c2e6abc00a12 "java -Xms8092m -Xmx…" 3 minutes ago Created k8s_docker-image_dynamic-bg-service-cpu-28-achieve-...... ......
綜上所述,當前觀察到的現象以下:golang
- Kubelet 刪除 Pod 的邏輯已被觸發。
- Docker 已經接收並處理 Kubelet 刪除容器的請求。
- 該容器的 Shim 進程以及業務進程已經被清理。
- 某種緣由致使 Docker 和 Containerd 中的元數據一直沒法被刪除。
- 此後建立的容器一直處於 Created 狀態。
緣由分析
經過查看監控咱們發現出現問題時該節點的磁盤利用率很是高而且 CPU 負載異常:docker

磁盤利用率很是高api

CPU 負載異常緩存
咱們初步猜想該問題和異常的節點磁盤利用率有關。
爲何新調度 Pod 的容器一直處於 Created 狀態
新調度 Pod 的容器一直處於 Created 狀態是咱們在 Docker 版本爲 18.09.9 的環境遇到的新現象。針對該現象入手,咱們在 Docker 棧中發現多個阻塞在包含 github.com/docker/docker/daemon.(*Daemon).ContainerCreate
函數的 Goroutine,而且阻塞的緣由是 semacquire
。其中一個 Goroutine 內容以下:
goroutine 19962397 [semacquire, 8 minutes]: sync.runtime_SemacquireMutex(0xc000aee824, 0xc0026e4600) /usr/local/go/src/runtime/sema.go:71 +0x3f sync.(*Mutex).Lock(0xc000aee820) /usr/local/go/src/sync/mutex.go:134 +0x101 github.com/docker/docker/volume/local.(*Root).Get(0xc000aee820, 0xc003d1e3c0, 0x40, 0x1, 0x0, 0x0, 0x0) /go/src/github.com/docker/docker/volume/local/local.go:237 +0x33 github.com/docker/docker/volume/service.(*VolumeStore).getVolume(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/docker/volume/service/store.go:717 +0x611 github.com/docker/docker/volume/service.(*VolumeStore).create(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x203000, ...) /go/src/github.com/docker/docker/volume/service/store.go:582 +0x950 github.com/docker/docker/volume/service.(*VolumeStore).Create(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0xc002b0f090, 0x1, 0x1, ...) /go/src/github.com/docker/docker/volume/service/store.go:468 +0x1c2 github.com/docker/docker/volume/service.(*VolumesService).Create(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc003d1e3c0, 0x40, 0x0, 0x0, 0xc002b0f090, 0x1, 0x1, ...) /go/src/github.com/docker/docker/volume/service/service.go:61 +0xc6 github.com/docker/docker/daemon.(*Daemon).createContainerOSSpecificSettings(0xc0009741e0, 0xc000c3a6c0, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0) /go/src/github.com/docker/docker/daemon/create_unix.go:62 +0x364 github.com/docker/docker/daemon.(*Daemon).create(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/docker/daemon/create.go:177 +0x44f github.com/docker/docker/daemon.(*Daemon).containerCreate(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/docker/daemon/create.go:72 +0x1c8 github.com/docker/docker/daemon.(*Daemon).ContainerCreate(0xc0009741e0, 0xc000ef8ca3, 0x74, 0xc0024f2b40, 0xc00225bc00, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/docker/daemon/create.go:31 +0xa6 ......
從棧的內容中咱們發現該 Goroutine 阻塞在地址爲 0xc000aee820
的 Mutex 上,而且該地址與 github.com/docker/docker/volume/local.(*Root).Get
的 Function Receiver 相同。讓咱們經過代碼看看這個 Root
是怎樣的數據結構:
// Store is an in-memory store for volume drivers type Store struct { extensions map[string]volume.Driver mu sync.Mutex driverLock *locker.Locker pluginGetter getter.PluginGetter } // Driver is for creating and removing volumes. type Driver interface { // Name returns the name of the volume driver. Name() string // Create makes a new volume with the given name. Create(name string, opts map[string]string) (Volume, error) // Remove deletes the volume. Remove(vol Volume) (err error) // List lists all the volumes the driver has List() ([]Volume, error) // Get retrieves the volume with the requested name Get(name string) (Volume, error) // Scope returns the scope of the driver (e.g. `global` or `local`). // Scope determines how the driver is handled at a cluster level Scope() string } // Root implements the Driver interface for the volume package and // manages the creation/removal of volumes. It uses only standard vfs // commands to create/remove dirs within its provided scope. type Root struct { m sync.Mutex scope string path string volumes map[string]*localVolume rootIdentity idtools.Identity }
Root
是 Volume 驅動的實現,用於管理 Volume 的生命週期。它緩存了全部的 Volume 而且由 Mutex 保護緩存數據的安全。github.com/docker/docker/volume/local.(*Root).Get
阻塞在237行等待 Mutex 的邏輯上,因此節點上新建立的容器一直處於 Created 狀態:
// Get looks up the volume for the given name and returns it if found func (r *Root) Get(name string) (volume.Volume, error) { r.m.Lock() // Line 237 v, exists := r.volumes[name] r.m.Unlock() if !exists { return nil, ErrNotFound } return v, nil }
看來新建立的容器一直處於 Created 狀態只是結果,那麼是誰持有這個地址爲 0xc000aee820
的 Mutex 呢?
誰持有阻塞其餘 Goroutine 的 Mutex
經過搜索阻塞在地址爲 0xc000aee820
的 Mutex,咱們找到了持有該 Mutex 的 Goroutine:
goroutine 19822190 [syscall]: syscall.Syscall(0x107, 0xffffffffffffff9c, 0xc0026a5710, 0x200, 0x0, 0x0, 0x15) /usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5 syscall.unlinkat(0xffffffffffffff9c, 0xc0026a55f0, 0x83, 0x200, 0x5597cbefb200, 0xc000c913e8) /usr/local/go/src/syscall/zsyscall_linux_amd64.go:126 +0x8d syscall.Rmdir(0xc0026a55f0, 0x83, 0x5597cbefb200, 0xc000c913e8) /usr/local/go/src/syscall/syscall_linux.go:158 +0x49 os.Remove(0xc0026a55f0, 0x83, 0x24, 0x83) /usr/local/go/src/os/file_unix.go:310 +0x70 os.RemoveAll(0xc0026a55f0, 0x83, 0x5e, 0x5597cb0bebee) /usr/local/go/src/os/path.go:68 +0x4f os.RemoveAll(0xc001b8b320, 0x5e, 0x2, 0xc000843620) /usr/local/go/src/os/path.go:109 +0x4f7 github.com/docker/docker/volume/local.removePath(0xc001b8b320, 0x5e, 0x5e, 0x1) /go/src/github.com/docker/docker/volume/local/local.go:226 +0x4f github.com/docker/docker/volume/local.(*Root).Remove(0xc000aee820, 0x5597cbf258c0, 0xc001508e10, 0x0, 0x0) /go/src/github.com/docker/docker/volume/local/local.go:217 +0x1f8 github.com/docker/docker/volume/service.(*VolumeStore).Remove(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0x5597cbf25da0, 0xc002587b00, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/docker/docker/volume/service/store.go:796 +0x71f github.com/docker/docker/volume/service.(*VolumesService).Remove(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc002550c40, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/docker/docker/volume/service/service.go:135 +0x1e8 github.com/docker/docker/daemon.(*Daemon).removeMountPoints(0xc0009741e0, 0xc000bba6c0, 0x1, 0x0, 0x0) /go/src/github.com/docker/docker/daemon/mounts.go:40 +0x239 github.com/docker/docker/daemon.(*Daemon).cleanupContainer(0xc0009741e0, 0xc000bba6c0, 0x101, 0xc000bba6c0, 0x0) /go/src/github.com/docker/docker/daemon/delete.go:141 +0x827 github.com/docker/docker/daemon.(*Daemon).ContainerRm(0xc0009741e0, 0xc000bb8089, 0x40, 0xc000998066, 0x0, 0x0) /go/src/github.com/docker/docker/daemon/delete.go:45 +0x272 ......
從 Goroutine 棧中咱們看到 github.com/docker/docker/volume/local.(*Root).Remove
函數持有地址爲 0xc000aee820
的 Mutex,而且執行到了217行,該函數負責調用 os.RemoveAll
函數刪除指定的 Volume 以及數據:
// Remove removes the specified volume and all underlying data. If the // given volume does not belong to this driver and an error is // returned. The volume is reference counted, if all references are // not released then the volume is not removed. func (r *Root) Remove(v volume.Volume) error { r.m.Lock() defer r.m.Unlock() // ...... if err := removePath(realPath); err != nil { // Line 217 return err } delete(r.volumes, lv.name) return removePath(filepath.Dir(lv.path)) }
經過觀察 Goroutine 棧能夠發現,os.RemoveAll
函數在棧中出現了兩次,查看源碼咱們得知 os.RemoveAll
的實現採用了遞歸的方式。在109行包含遞歸調用的邏輯:
// RemoveAll removes path and any children it contains. // It removes everything it can but returns the first error // it encounters. If the path does not exist, RemoveAll // returns nil (no error). func RemoveAll(path string) error { // Simple case: if Remove works, we're done. err := Remove(path) // Line 68 if err == nil || IsNotExist(err) { return nil } // ...... err = nil for { // ...... names, err1 := fd.Readdirnames(request) // Removing files from the directory may have caused // the OS to reshuffle it. Simply calling Readdirnames // again may skip some entries. The only reliable way // to avoid this is to close and re-open the // directory. See issue 20841. fd.Close() for _, name := range names { err1 := RemoveAll(path + string(PathSeparator) + name) // Line 109 if err == nil { err = err1 } } // ...... } // ...... }
Goroutine 棧的最上層是 syscall.unlinkat
函數,即經過系統調用 unlinkat
刪除容器的文件系統目錄。咱們發現了一個 Terminating Pod 的容器 Volume 有異常:
$ ls -l /var/lib/docker/volumes/0789a0f8cbfdc59de30726a7ea21a76dd36fea0e4e832c9f806cdf39c29197c5/ total 4 drwxr-xr-x 1 root root 512378124 Aug 26 2020 _data
該目錄文件大小超過了 500MB 可是 Link 計數只有1,經過查看 ext4
文檔發現如下內容:
dir_nlink Normally, ext4 allows an inode to have no more than 65,000 hard links. This applies to regular files as well as directories, which means that there can be no more than 64,998 subdirectories in a directory (because each of the '.' and '..' entries, as well as the directory entry for the directory in its parent directory counts as a hard link). This feature lifts this limit by causing ext4 to use a link count of 1 to indicate that the number of hard links to a directory is not known when the link count might exceed the maximum count limit.
即當一個 ext4
文件系統下目錄中的子目錄個數超過64998時,該目錄的 Link 會被置爲1來表示硬連接計數已超過最大限制。對該目錄下的文件進行遍歷後咱們發現有超過500萬個空目錄,已經遠超過64998的限制。因此在第一次觸發刪除 Pod 邏輯後該節點的磁盤利用率一直居高不下而且 CPU 負載異常,Volume 文件刪除過程很是緩慢致使全部相同業務的容器刪除邏輯阻塞。經過查看相關代碼能夠確認在 Kubelet 刪除容器時 Volume 也是一塊兒被回收的:
// RemoveContainer removes the container. func (ds *dockerService) RemoveContainer(_ context.Context, r *runtimeapi.RemoveContainerRequest) (*runtimeapi.RemoveContainerResponse, error) { // Ideally, log lifecycle should be independent of container lifecycle. // However, docker will remove container log after container is removed, // we can't prevent that now, so we also clean up the symlink here. err := ds.removeContainerLogSymlink(r.ContainerId) if err != nil { return nil, err } err = ds.client.RemoveContainer(r.ContainerId, dockertypes.ContainerRemoveOptions{RemoveVolumes: true, Force: true}) // Line 280 if err != nil { return nil, fmt.Errorf("failed to remove container %q: %v", r.ContainerId, err) } return &runtimeapi.RemoveContainerResponse{}, nil }
爲何 Containerd 中的元數據一直沒法被刪除
還有一個疑問,爲何 ctr
命令能夠查到須要被刪除的容器元數據呢?咱們發現了另外一類等待該 Mutex 的 Goroutine:
goroutine 19943568 [semacquire, 95 minutes]: sync.runtime_SemacquireMutex(0xc000aee824, 0x5597c9ab7300) /usr/local/go/src/runtime/sema.go:71 +0x3f sync.(*Mutex).Lock(0xc000aee820) /usr/local/go/src/sync/mutex.go:134 +0x101 github.com/docker/docker/volume/local.(*Root).Get(0xc000aee820, 0xc002b12180, 0x40, 0x5597cbf22080, 0xc000aee820, 0x0, 0x0) /go/src/github.com/docker/docker/volume/local/local.go:237 +0x33 github.com/docker/docker/volume/service.lookupVolume(0x5597cbf18f40, 0xc00013e038, 0xc000cc5500, 0xc000c914c8, 0x5, 0xc002b12180, 0x40, 0x0, 0x0, 0x0, ...) /go/src/github.com/docker/docker/volume/service/store.go:744 +0xc7 github.com/docker/docker/volume/service.(*VolumeStore).getVolume(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc002b12180, 0x40, 0x5597cb0c1541, 0x5, 0x5597c9a85bb5, 0x0, 0xc003deb198, ...) /go/src/github.com/docker/docker/volume/service/store.go:688 +0x299 github.com/docker/docker/volume/service.(*VolumeStore).Get(0xc000aeec80, 0x5597cbf18f40, 0xc00013e038, 0xc002b12180, 0x40, 0xc003deb240, 0x1, 0x1, 0x0, 0x0, ...) /go/src/github.com/docker/docker/volume/service/store.go:636 +0x173 github.com/docker/docker/volume/service.(*VolumesService).Unmount(0xc000a96540, 0x5597cbf18f40, 0xc00013e038, 0xc0016f1ce0, 0xc00381b040, 0x40, 0x5597c9a5bfeb, 0xc0018ced50) /go/src/github.com/docker/docker/volume/service/service.go:105 +0xc6 github.com/docker/docker/daemon.(*volumeWrapper).Unmount(0xc003cdede0, 0xc00381b040, 0x40, 0x0, 0x0) /go/src/github.com/docker/docker/daemon/volumes.go:414 +0x6a github.com/docker/docker/volume/mounts.(*MountPoint).Cleanup(0xc001d89680, 0xc0013cbc50, 0xc003deb3f8) /go/src/github.com/docker/docker/volume/mounts/mounts.go:83 +0x7a github.com/docker/docker/container.(*Container).UnmountVolumes(0xc000c3ad80, 0xc003deb4e0, 0x60, 0x0) /go/src/github.com/docker/docker/container/container.go:475 +0x102 github.com/docker/docker/daemon.(*Daemon).Cleanup(0xc0009741e0, 0xc000c3ad80) /go/src/github.com/docker/docker/daemon/start.go:257 +0x4ae ......
該 Goroutine 棧中包含 github.com/docker/docker/daemon.(*Daemon).Cleanup
函數而且執行到了257行,該函數負責釋放容器網絡資源並反掛載容器的文件系統:
// Cleanup releases any network resources allocated to the container along with any rules // around how containers are linked together. It also unmounts the container's root filesystem. func (daemon *Daemon) Cleanup(container *container.Container) { // ...... if container.BaseFS != nil && container.BaseFS.Path() != "" { if err := container.UnmountVolumes(daemon.LogVolumeEvent); err != nil { // Line 257 logrus.Warnf("%s cleanup: Failed to umount volumes: %v", container.ID, err) } } container.CancelAttachContext() if err := daemon.containerd.Delete(context.Background(), container.ID); err != nil { logrus.Errorf("%s cleanup: failed to delete container from containerd: %v", container.ID, err) } }
而該函數調用 Containerd 刪除元數據在257行的 github.com/docker/docker/container.(*Container).UnmountVolumes
函數以後,這也解釋了爲何經過 ctr
命令查看發現 Containerd 中仍然存有該容器的元數據。
真相大白
這些 Volume 多達 500MB 的容器是怎麼來的呢?經過和用戶溝通後咱們獲得了答案,原來用戶沒有理解 Docker Volume 的含義和使用場景,在 Dockerfile 中使用了 Volume:
# ...... VOLUME /app/images VOLUME /app/logs # ......
用戶在業務邏輯中頻繁的向 Volume 寫入數據而且未進行有效的垃圾回收,致使一段時間後大量空目錄泄漏而觸發 Terminating Pod 的問題。至此咱們問題的緣由就清晰了,Terminating Pod 問題產生的流程以下:
- 某個業務的 Pod 中包含頻繁的向 Volume 寫入數據的邏輯致使文件硬連接計數超過最大限制。
- 用戶進行了一次滾動更新,觸發 Pod 刪除的時間被記錄到
.metadata.deletionTimestamp
。 - 刪除 Docker Volume 的邏輯被調用,因 Volume 中的空目錄過多致使
unlinkat
系統調用被大量執行。 - 函數
os.RemoveAll
遞歸刪除 Volume 目錄時大量執行unlinkat
系統調用致使該節點的磁盤利用率很是高而且 CPU 負載異常。 - 第一個執行 Volume 刪除邏輯的 Goroutine 持有保護
Root
緩存的 Mutex,因函數os.RemoveAll
刪除 Volume 目錄時遞歸處理500萬個文件過慢而沒法返回,該節點上後續對 Volume 的操做均阻塞在等待 Mutex 的邏輯上。 - 使用 Volume 的容器沒法被刪除,此時集羣中出現多個 Terminating Pod。

時序圖
總結
最後咱們在線上環境採用了節點下線進行磁盤格式化再從新上線的方案進行緊急恢復,而且建議用戶儘快棄用 Docker Volume 而改用 Kubernetes 本地盤方案。用戶在咱們的建議下修改了 Dockerfile 和編排模板,並對業務代碼中的邏輯進行了優化,杜絕了此類問題。
在此過程當中,輕舟 Kubernetes 加強技術團隊一樣受益不淺,從技術的角度去解決問題只是咱們工做的一個維度,用戶對雲原生技術的認知與服務方推行規範之間的差距更值得關注。雖然當前咱們解決了用戶使用 Kubernetes 的基本問題,可是在幫助用戶切實解決雲原生落地過程當中的痛點、加深用戶對新技術的理解、下降用戶的使用成本並讓用戶真正享受到技術紅利的道路上,咱們還有很長的路要走。
做者簡介
黃久遠,網易數帆資深開發工程師,專一於雲原生以及分佈式系統等領域,參與了網易雲音樂、網易傳媒、網易嚴選、考拉海購等多個用戶的大規模容器化落地以及網易輕舟容器平臺產品化工做,主要方向包括集羣監控、智能運維體系建設、Kubernetes 以及 Docker 核心組件維護等。當前主要負責網易輕舟雲原生故障自動診斷系統的設計、開發以及產品商業化工做。