前兩篇文章講了雲主機上lua openresty
項目容器化的歷程,在測試環境通過一段時間的驗證,一切都比較順利,就在線上開始灰度。html
可是,好景不長。灰度沒多久,使用top pod
查看時,發現內存滿了,最開始懷疑k8s的resources limit memory(2024Mi)
分配小了,放大後(4096Mi),重啓pod,沒多久又滿了。linux
緊接着,懷疑是放量較大,負載高了引發的,擴大hpa
,再重啓,好傢伙,不到兩炷香時間,又滿了。 nginx
到此,就把目光投向了pod
內部的程序了,估摸着又是哪裏出現內存泄漏了。git
pod
的resources limit memory
增長和hpa
增長都沒有解決問題。nginx -s reload
能夠釋放內存,可是沒多久後又滿了。命令:ps aux
redis
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 252672 5844 ? Ss Jun11 0:00 nginx: master process /data/openresty/bin/openresty -g daemon off; nobody 865 10.1 0.3 864328 590744 ? S 14:56 7:14 nginx: worker process nobody 866 13.0 0.3 860164 586748 ? S 15:13 7:02 nginx: worker process nobody 931 15.6 0.2 759944 486408 ? R 15:31 5:37 nginx: worker process nobody 938 13.3 0.1 507784 234384 ? R 15:49 2:23 nginx: worker process
發現woker進程號已經接近1000了,那麼確定不斷的被kill,而後再調起,那到底是誰幹的呢? 經過 dmesg
命令:docker
[36812300.604948] dljgo invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=999 [36812300.648057] Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4ad18fa_3b8f_4600_a557_a2bc853e80d9.slice/docker-c888fefbafc14b39e42db5ad204b2e5fa7cbfdf20cbd621ecf15fdebcb692a61.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode4ad18fa_3b8f_4600_a557_a2bc853e80d9.slice/docker-c888fefbafc14b39e42db5ad204b2e5fa7cbfdf20cbd621ecf15fdebcb692a61.scope [36812300.655582] memory: usage 500000kB, limit 500000kB, failcnt 132 [36812300.657244] memory+swap: usage 500000kB, limit 500000kB, failcnt 0 [36812300.658931] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0 …… [36675871.040485] Memory cgroup out of memory: Kill process 16492 (openresty) score 1286 or sacrifice child
發現當cgroup內存不足時,Linux內核會觸發cgroup OOM來選擇一些進程kill掉,以便能回收一些內存,儘可能繼續保持系統繼續運行。shell
雖然kill掉了nginx worker進程後釋放了內存,短暫性的解決了問題,但根本性問題還沒解決。json
將雲主機的lua代碼拷貝到本地進行對比,發現代碼自己並無什麼問題。app
那隻能是其餘方面的問題引發了。工具
以上兩個點,都沒能很好的定位問題問題,看來只能經過經過 top
、pmap
、gdb
等命令去排查問題了。
經過top
查看哪一個進程佔用內存高
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 942 nobody 20 0 618.9m 351.7m 4.2m S 18.0 0.2 4:05.72 openresty 943 nobody 20 0 413.8m 146.7m 4.2m S 11.7 0.1 1:18.93 openresty 940 nobody 20 0 792.0m 524.9m 4.2m S 7.0 0.3 6:25.81 openresty 938 nobody 20 0 847.4m 580.2m 4.2m S 3.7 0.3 7:15.97 openresty 1 root 20 0 246.8m 5.7m 3.9m S 0.0 0.0 0:00.24 openresty
經過 pmap -x pid
找出該進程的內存分配,發現0000000000af7000
存在大內存的分配。
Address Kbytes RSS Dirty Mode Mapping 0000000000400000 1572 912 0 r-x-- nginx 0000000000788000 4 4 4 r---- nginx 0000000000789000 148 128 116 rw--- nginx 00000000007ae000 140 28 28 rw--- [ anon ] 0000000000a15000 904 900 900 rw--- [ anon ] 0000000000af7000 531080 530980 530980 rw--- [ anon ] 0000000040048000 128 124 124 rw--- [ anon ] ……
得到大內存的內存地址後,經過 cat /proc/pid/smaps
命令,查看內存段的具體起始位置:
00af7000-21412000 rw-p 00000000 00:00 0 [heap] Size: 533612 kB Rss: 533596 kB Pss: 533596 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 533596 kB Referenced: 533596 kB Anonymous: 533596 kB AnonHugePages: 0 kB Swap: 0 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Locked: 0 kB VmFlags: rd wr mr mw me ac sd
gcore pid
獲得 "core.pid"文件。
gdb
加載內存信息gdb core.pid sh-4.2$ gdb core.942 GNU gdb (GDB) Red Hat Enterprise Linux 7.* Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... [New LWP pid] Core was generated by `nginx: worker process'. #0 0x00007ff9435e1463 in ?? () "/tmp/core.942" is a core file. Please specify an executable to debug. (gdb)
進入以上命令窗口
dump binary
導出泄露內存的內容第3點的時候已經獲得了內存起始地址了,這時候咱們內存地址導出:
dump binary memory worker-pid.bin 0x00af7000 0x21412000
查看導出的文件大小
sh-4.2$ du -sh worker-pid.bin 511M worker-pid.bin
用hex
工具打開二進制文件,發現裏面大量的json對象,經過分析json對象,發現pod上的so加密庫爲舊的(git上並無跟新),而云主機依賴的是新的。
替換so庫,重啓pod,問題解決!~。
生產環境上的內存泄漏是一個比較頭疼的問題,不妨經過以上的方式去分析,也是一個不錯的思路。