本文來自OPPO互聯網基礎技術團隊,轉載請註名做者。同時歡迎關注咱們的公衆號:OPPO_tech,與你分享OPPO前沿互聯網技術及活動。
背景:這個是以前遇到的老問題,最近docker社區裏面其餘人報了這問題暫時還沒解決。node
issue的連接是:https://github.com/containerd...linux
下面列一下咱們是怎麼排查並解這個問題的。git
Oppo雲智能監控發現lxcfs 服務不是處於工做態超過配置的閾值:github
# systemctl status lxcfs ● lxcfs.service - FUSE filesystem for LXC Loaded: loaded (/usr/lib/systemd/system/lxcfs.service; enabled; vendor preset: disabled) Active: activating (start-post) since Tue 2020-06-23 14:37:50 CST; 5min ago---這個是6月份的案例,非active (running) 狀態 Docs: man:lxcfs(1) Process: 415455 ExecStopPost=/bin/sh -c if mount |grep "baymax\/lxcfs"; then fusermount -u /var/lib/baymax/lxcfs; fi (code=exited, status=0/SUCCESS) Main PID: 415526 (lxcfs); : 415529 (lxcfs-remount-c) Tasks: 43 Memory: 28.9M CGroup: /system.slice/lxcfs.service ├─415526 /usr/bin/lxcfs -o nonempty /var/lib/baymax/lxcfs/ └─control ├─415529 /bin/sh /usr/bin/lxcfs-remount-containers ├─416923 /bin/sh /usr/bin/lxcfs-remount-containers └─419090 docker exec 1eb2f723b69e sh -c \ export f=/proc/cpuinfo && test -f /var/lib/baymax/lxcfs/$f && (umount $f; mount --bind...
查看對應的進程樹,發現 1eb2f723b69e 容器的runc阻塞沒有返回。docker
# ps -ef |grep -i runc |grep -v shim root 172169 138974 0 14:43 pts/2 00:00:00 grep --color -i runc root 420924 70170 0 14:37 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/1eb2f723b69e2dba83bc490d3fab66922a13a4787be8bcb4cd486e97843ffef5/log.json --log-format json exec --process /tmp/runc-process904568476 --detach --pid-file /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/1eb2f723b69e2dba83bc490d3fab66922a13a4787be8bcb4cd486e97843ffef5/4dfeee72cd794ebec396fb8450f8944499cdde99d22054c950e5a80fb56f0968.pid 1eb2f723b69e2dba83bc490d3fab66922a13a4787be8bcb4cd486e97843ffef5 root 423656 420924 0 14:37 ? 00:00:00 runc init
經過cat /proc/423656/stack 發現它阻塞在pipe的writejson
而後查看對應 423656 的堆棧詳細信息:app
PID: 423656 TASK: ffffa0e872d56180 CPU: 28 COMMAND: "runc:[2:INIT]" 。。。。。。 #1 [ffffa13093eb3d08] schedule at ffffffffb6969f19 ffffa13093eb3d10: ffffa13093eb3d60 ffffffffb644bd50 #2 [ffffa13093eb3d18] pipe_wait at ffffffffb644bd50 ffffa13093eb3d20: 0000000000000000 ffffa0e872d56180 ffffa13093eb3d30: ffffffffb62c3f50 ffffa0f072d87108 ffffa13093eb3d40: ffffa1032ad42030 00000000e3a7c164 ffffa13093eb3d50: ffffa1032ad42000 0000000000000010 -----分析堆棧,pipe的inode壓棧在此 ffffa13093eb3d60: ffffa13093eb3de8 ffffffffb644bff9 #3 [ffffa13093eb3d68] pipe_write at ffffffffb644bff9 ffffa13093eb3d70: ffffa1032ad42028 ffffa0e872d56180 ffffa13093eb3d80: ffffa13093eb3df8 0000000000000000 。。。。限於篇幅,省略部分堆棧 ffffa13093eb3f50: ffffffffb6976ddb #7 [ffffa13093eb3f50] system_call_fastpath at ffffffffb6976ddb RIP: 000000000045b8a5 RSP: 000000c000008be8 RFLAGS: 00010206 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 000000c000000000 RDX: 0000000000000010 RSI: 000000c000008bf0 RDI: 0000000000000002 RBP: 000000c000008b90 R8: 0000000000000001 R9: 00000000006c0fab R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000000 R14: 000000000086d0d8 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
看狀況是卡在pipe的write,而後看下它打開的文件,找到對應的inode信息:async
PID: 423656 TASK: ffffa0e872d56180 CPU: 28 COMMAND: "runc:[2:INIT]" ROOT: /rootfs CWD: /rootfs FD FILE DENTRY INODE TYPE PATH 0 ffffa0feca33cf00 ffffa1333b800240 ffffa1333f568850 CHR /dev/null 1 ffffa1031adba700 ffffa0f10efb83c0 ffffa0d78de55c80 FIFO 2 ffffa12f3d7df300 ffffa0f10efb8a80 ffffa0d78de56f00 FIFO -----對應那個pipe 3 ffffa133026b9700 ffffa11382e2f080 ffffa12a2b07ad30 SOCK UNIX
驗證下這個pipe:post
crash> struct file.private_data ffffa12f3d7df300 private_data = 0xffffa1032ad42000 crash> pipe_inode_info 0xffffa1032ad42000--------和上面的堆棧對的上 struct pipe_inode_info { mutex = { count = { counter = 1 }, 。。。。。 } }, nrbufs = 1, ----只有一個buf,說明pipe建立的時候,page不夠,這個主要受限於 pipe-user-pages-soft的默認配置以及pipe-max-size curbuf = 0, buffers = 1, readers = 1, writers = 1, files = 2, waiting_writers = 1, r_counter = 1, w_counter = 1, tmp_page = 0x0, fasync_readers = 0x0, fasync_writers = 0x0, bufs = 0xffffa132c3a3d100, user = 0xffffffffb6e4d700 }
看下pipe中的內容:測試
crash> pipe_buffer 0xffffa132c3a3d100 struct pipe_buffer { page = 0xffffe392f992cb00, offset = 0, len = 4081, ---內容的長度 ops = 0xffffffffb6a2e000, flags = 0, private = 0 } crash> kmem -p |grep ffffe392f992cb00 ffffe392f992cb00 2e64b2c000 0 0 1 2fffff00000000 crash> rd -a -p 2e64b2c000 4081-----------這個4081就是上面的長度 2e64b2c000: runtime/cgo: pthread_create failed: Resource temporarily una 2e64b2c03c: vailable 2e64b2c045: SIGABRT: abort----------------- 2e64b2c054: PC=0x6c0fab m=0 sigcode=18446744073709551610 2e64b2c082: goroutine 0 [idle]: 2e64b2c096: runtime: unknown pc 0x6c0fab 2e64b2c0b3: stack: frame={sp:0x7ffc54fb5b18, fp:0x0} stack=[0x7ffc547b6f 2e64b2c0ef: a8,0x7ffc54fb5fd0) 2e64b2c102: 00007ffc54fb5a18: 0000000000004000 0000000000000000 2e64b2c139: 00007ffc54fb5a28: 0000000000d0eb80 00007fe3c913f000 限於篇幅,省略部分打印。。。。 2e64b2cf70: 00007ffc54fb5bd8: 0000000000cd4603 0000000000a9d760 2e64b2cfa7: 00007ffc54fb5be8: 00000000006ea87b 0000000000cd4580 2e64b2cfde: 00007ffc54fb5bf8:
很明顯,pipe的write阻塞是由於page的空間不夠了,默認的阻塞模式,若是對端可以及時讀取,應該會將空間空出來纔對,
因此下面須要看下對端爲啥沒有來讀:
crash> pipe_inode_info.wait 0xffffa1032ad42000 wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffffa13093eb3d38, --------__wait_queue的task_list鏈串在此 prev = 0xffffa0f072d87108 } } crash> __wait_queue 0xffffa13093eb3d20 struct __wait_queue { flags = 0, private = 0xffffa0e872d56180, ----對應的就是 423656自己的等待 func = 0xffffffffb62c3f50, task_list = { next = 0xffffa0f072d87108, prev = 0xffffa1032ad42030 } }
根據fd的對端信息,能夠找到其父進程,也就是containerd-shim進程,而根據以下代碼:
func (r *Runc) Exec(context context.Context, id string, spec specs.Process, opts *ExecOpts) error { 。。。。 cmd := r.command(context, append(args, id)...) 。。。。 ec, err := Monitor.Start(cmd) 。。。。 status, err := Monitor.Wait(cmd, ec) }
因爲shim的代碼裏面是等待runc退出再去讀取pipe,而runc又沒有pipe容量不夠而不退出,因此造成了死鎖。
有興趣的同窗也能夠了解下這個:https://github.com/opencontai...
一、
# docker ps |grep busybox 8bebfd8a7b59 busybox "sh" 6 days ago Up 2 days
二、
# pwd /sys/fs/cgroup/pids/docker/8bebfd8a7b59748da6bcb154ec2ce428d1f21376b16c3915d962ec4149484e5c
三、
# cat pids.current -------查看一下當前線程數 1
四、
# echo 3 > pids.max ----限制一下線程數,因爲runc執行過程會建立多個線程,這裏設置爲3
五、
# cat pids.max 3
六、查看默認pipe配置:
# cat /proc/sys/fs/pipe-user-pages-soft 16384
七、測試是否阻塞:
# docker exec 8bebfd8 ls OCI runtime exec failed: exec failed: container_linux.go:344: starting container process caused "read init-p: connection reset by peer": unknown 發現並無阻塞,而後咱們將pipe對應的page數量改小:
八、
# echo 1 >/proc/sys/fs/pipe-user-pages-soft
而後再次執行一樣命令:
# docker exec 8bebfd8 ls 發現阻塞了,沒有結果返回,咱們來看阻塞在哪:
九、
# ps -ef |grep runc |grep -v shim|grep -v grep root 122935 642935 0 09:10 ? 00:00:00 runc --root /var/run/docker/runtime-runc/moby --log /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/8bebfd8a7b59748da6bcb154ec2ce428d1f21376b16c3915d962ec4149484e5c/log.json --log-format json exec --process /tmp/runc-process773686128 --detach --pid-file /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/8bebfd8a7b59748da6bcb154ec2ce428d1f21376b16c3915d962ec4149484e5c/7d5955f67c29d2a66c77295f16d416c5baa5277635bf3f3edec381c27f30bafc.pid 8bebfd8a7b59748da6bcb154ec2ce428d1f21376b16c3915d962ec4149484e5c root 122943 122935 0 09:10 ? 00:00:00 runc init-------------被阻塞,沒返回
十、查看具體阻塞的堆棧:
# ls /proc/122943/task/ |xargs -I file sh -c "echo file && cat /proc/file/stack" 122943 [<ffffffffadc4bd50>] pipe_wait+0x70/0xc0-------------阻塞在pipe的寫, [<ffffffffadc4bff9>] pipe_write+0x1f9/0x540 [<ffffffffadc41c13>] do_sync_write+0x93/0xe0 [<ffffffffadc42700>] vfs_write+0xc0/0x1f0 [<ffffffffadc4351f>] SyS_write+0x7f/0xf0 [<ffffffffae176ddb>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff 122944 [<ffffffffadb0e286>] futex_wait_queue_me+0xc6/0x130 [<ffffffffadb0ef6b>] futex_wait+0x17b/0x280 [<ffffffffadb10cb6>] do_futex+0x106/0x5a0 [<ffffffffadb111d0>] SyS_futex+0x80/0x190 [<ffffffffae176ddb>] system_call_fastpath+0x22/0x27 [<ffffffffffffffff>] 0xffffffffffffffff
十一、經過以下的stap打點命令,能夠肯定是pipe容量不足:
printf(" %s,tid=%d,pipe=%d\n",execname(),tid(), @cast(@cast($iocb->ki_filp,"struct file")->private_data, "struct pipe_inode_info")->buffers)
打點結果:
runc:[2:INIT],tid=122943,pipe=1--------------當容量足夠時,這個值默認爲16,是linux內核的默認值
十二、查看pipe中的數據:
# cat /proc/122943/fd/2 runtime/cgo: pthread_create failed: Resource temporarily unavailable SIGABRT: abort PC=0x6c0fab m=0 sigcode=18446744073709551610 goroutine 0 [idle]: runtime: unknown pc 0x6c0fab stack: frame={sp:0x7fffcd06ae18, fp:0x0} stack=[0x7fffcc86c398,0x7fffcd06b3c0) 00007fffcd06ad18: 00000000ffffffff 00007f286a8ba000 00007fffcd06ad28: 00007fffcd06ad70 00007fffcd06ada8 00007fffcd06ad38: 0000000000404f61 <runtime.mmap+177> 00007fffcd06ad78 00007fffcd06ad48: 0000000000000000 0000000000000000 。。。限於篇幅,其他數據省略
以上就是完整復現流程。
咱們的解決方案是:
PS:docker hung住的問題案例不少,後續咱們會有相應的案例展出,以免你們重複踩坑,共同推動雲原生的進步。