線上使用zabbix的host update來監測監控值是否完整(關於host update的實現請參考:java
http://caiguangguang.blog.51cto.com/1652935/1345789)node
一直髮現有機器過一段時間update值就會莫名其妙變低,以前一直沒有找到rc,只是簡單經過重啓agent來進行修復,最近同事細心地發現多是和sudo的bug有關係。linux
回過頭再來驗證下整個的排查過程。sql
1.經過zabbix 數據庫獲取丟失數據的item,拿出缺失的(20分鐘沒有更新的)值的item列表數據庫
select b.key_,b.lastvalue,from_unixtime(b.lastclock) from hosts a, items b where a.hostid=b.hostid and a.host='xxxxxx' and b.lastclock < (unix_timestamp() - 1200) limit 10;
好比這裏看agent.ping:api
觀察監控圖,發如今18點20分以後數據丟失:bash
2.分析zabbix agent端的日誌app
發如今18點24粉左右出現下面的日誌,沒有看到正常的獲取值和發送值的狀況,只有大量的update_cpustats狀態,同時發現有一行kill command 失敗的日誌:socket
27589:20141021:182442.143 In zbx_popen() command:'sudo hadoop_stats.sh nodemanager StopContainerAvgTime' 27589:20141021:182442.143 End of zbx_popen():5 48430:20141021:182442.143 zbx_popen(): executing script 27585:20141021:182442.284 In update_cpustats() 27585:20141021:182442.285 End of update_cpustats() 27585:20141021:182443.285 In update_cpustats() 27585:20141021:182443.286 End of update_cpustats() 27585:20141021:182444.286 In update_cpustats() 27585:20141021:182444.287 End of update_cpustats() 27585:20141021:182445.287 In update_cpustats() 27585:20141021:182445.287 End of update_cpustats() 27585:20141021:182446.288 In update_cpustats() 27585:20141021:182446.288 End of update_cpustats() .......... 27585:20141021:182508.305 In update_cpustats() 27585:20141021:182508.305 End of update_cpustats() 27585:20141021:182509.306 In update_cpustats() 27585:20141021:182509.306 End of update_cpustats() 27585:20141021:182510.306 In update_cpustats() 27585:20141021:182510.307 End of update_cpustats() 27585:20141021:182511.307 In update_cpustats() 27585:20141021:182511.308 End of update_cpustats() 27589:20141021:182512.154 failed to kill [sudo hadoop_stats.sh nodemanager StopContainerAvgTime]: [1] Operation not permitted 27589:20141021:182512.155 In zbx_waitpid() 27585:20141021:182512.308 In update_cpustats() 27585:20141021:182512.309 End of update_cpustats() 27585:20141021:182513.309 In update_cpustats() 27585:20141021:182513.309 End of update_cpustats()
對比正常的日誌:ide
27589:20141021:180054.376 In zbx_popen() command:'sudo hadoop_stats.sh nodemanager StopContainerAvgTime' 27589:20141021:180054.377 End of zbx_popen():5 18798:20141021:180054.377 zbx_popen(): executing script 27589:20141021:180054.384 In zbx_waitpid() 27589:20141021:180054.384 zbx_waitpid() exited, status:1 27589:20141021:180054.384 End of zbx_waitpid():18798 27589:20141021:180054.384 Run remote command [sudo hadoop_stats.sh nodemanager StopContainerAvgTime] Result [2] [-1]... 27589:20141021:180054.384 For key [hadoop_stats[nodemanager,StopContainerAvgTime]] received value [-1] 27589:20141021:180054.384 In process_value() key:'gd6g203s80-hadoop-datanode.idc.vipshop.com:hadoop_stats[nodemanager,StopContainerAvgTime]' value:'-1' 27589:20141021:180054.384 In send_buffer() host:'10.200.100.28' port:10051 values:37/50 27589:20141021:180054.384 Will not send now. Now 1413885654 lastsent 1413885654 < 1 27589:20141021:180054.385 End of send_buffer():SUCCEED 27589:20141021:180054.385 buffer: new element 37 27589:20141021:180054.385 End of process_value():SUCCEED
能夠看到正常狀況下腳本會有返回值,而出問題的時候,腳本是沒有返回值的,而且因爲是使用sudo 運行腳本,致使以普通用戶啓動的zabbix在超時時沒有辦法殺掉這個command(Operation not permitted 錯誤)
3.假設這裏啓動zabbix agent的普通用戶爲apps用戶,咱們看下這個腳本目前的狀態
ps -ef|grep hadoop_stats.sh root 34494 31429 0 12:54 pts/0 00:00:00 grep 48430 root 48430 27589 0 Oct21 ? 00:00:00 sudo hadoop_stats.sh nodemanager StopContainerAvgTime root 48431 48430 0 Oct21 ? 00:00:00 [hadoop_stats.sh] <defunct>
能夠看到,這裏產生了一個殭屍進程(關於殭屍進程能夠參考:http://en.wikipedia.org/wiki/Zombie_process)
殭屍進程是因爲子進程運行完畢以後,發送SIGCHLD到父進程,而父進程沒有正常處理這個信號致使。
You have killed the process, but a dead process doesn't disappear from the process table until its parent process performs a task called "reaping" (essentially calling wait(3) for that process to read its exit status). Dead processes that haven't been reaped are called "zombie processes." The parent process id you see for 31756 is process id 1, which always belongs to init. That process should reap its zombie processes periodically, but if it can't, they will remain zombies in the process table until you reboot.
正常的進程狀況下,咱們使用strace attach到父進程,而後殺掉子進程後能夠看到以下信息:
Process 3036 attached - interrupt to quit select(6, [5], [], NULL, NULL ) = ? ERESTARTNOHAND (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- rt_sigreturn(0x11) = -1 EINTR (Interrupted system call) wait4(3037, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG|WSTOPPED, NULL) = 3037 exit_group(143) = ? Process 3036 detached
產生殭屍進程以後,能夠經過殺掉父進程把殭屍進程變成孤兒進程(父進程爲init進程)
可是這裏由於是用sudo啓動的腳本,致使啓動用戶都是root,apps用戶就沒有權限殺掉啓動的命令,進而致使子進程一直是殭屍進程的狀態存在
4.來看一下zabbix agent端啓動的相關進程狀況
ps -ef|grep zabbix apps 27583 1 0 Sep09 ? 00:00:00 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf apps 27585 27583 0 Sep09 ? 00:33:25 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf apps 27586 27583 0 Sep09 ? 00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf apps 27587 27583 0 Sep09 ? 00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf apps 27588 27583 0 Sep09 ? 00:00:14 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf apps 27589 27583 0 Sep09 ? 02:28:12 /apps/svr/zabbix/sbin/zabbix_agentd -c /apps/conf/zabbix_agentd.conf root 34207 31429 0 12:54 pts/0 00:00:00 grep zabbix root 48430 27589 0 Oct21 ? 00:00:00 sudo /apps/sh/zabbix_scripts/hadoop/hadoop_stats.sh nodemanager StopContainerAvgTime
經過strace咱們發現27589的進程一直在等待48430的進程
strace -p 27589 Process 27589 attached - interrupt to quit wait4(48430, ^C <unfinished ...> Process 27589 detached
而48430的進程即爲殭屍進程的父進程,經過strace attach上去,能夠看到在等待#5的fd
strace -p 48430 Process 48430 attached - interrupt to quit select(6, [5], [], NULL, NULL^C <unfinished ...> Process 48430 detached
經過lsof能夠看到#5的fd實際上是一個socket
lsof -p 48430 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME sudo 48430 root cwd DIR 8,2 4096 2 / sudo 48430 root rtd DIR 8,2 4096 2 / sudo 48430 root txt REG 8,2 212904 1578739 /usr/bin/sudo sudo 48430 root mem REG 8,2 65928 1441822 /lib64/libnss_files-2.12.so sudo 48430 root mem REG 8,2 99158704 1573509 /usr/lib/locale/locale-archive sudo 48430 root mem REG 8,2 91096 1441832 /lib64/libz.so.1.2.3 sudo 48430 root mem REG 8,2 141576 1442145 /lib64/libpthread-2.12.so sudo 48430 root mem REG 8,2 386040 1442172 /lib64/libfreebl3.so sudo 48430 root mem REG 8,2 108728 1575924 /usr/lib64/libsasl2.so.2.0.23 sudo 48430 root mem REG 8,2 243064 1441896 /lib64/libnspr4.so sudo 48430 root mem REG 8,2 21256 1442186 /lib64/libplc4.so sudo 48430 root mem REG 8,2 17096 1442187 /lib64/libplds4.so sudo 48430 root mem REG 8,2 128368 1577789 /usr/lib64/libnssutil3.so sudo 48430 root mem REG 8,2 1290648 1582418 /usr/lib64/libnss3.so sudo 48430 root mem REG 8,2 188072 1575925 /usr/lib64/libsmime3.so sudo 48430 root mem REG 8,2 220200 1587191 /usr/lib64/libssl3.so sudo 48430 root mem REG 8,2 113952 1442182 /lib64/libresolv-2.12.so sudo 48430 root mem REG 8,2 43392 1442173 /lib64/libcrypt-2.12.so sudo 48430 root mem REG 8,2 63304 1442180 /lib64/liblber-2.4.so.2.5.6 sudo 48430 root mem REG 8,2 1979000 1442169 /lib64/libc-2.12.so sudo 48430 root mem REG 8,2 308912 1442181 /lib64/libldap-2.4.so.2.5.6 sudo 48430 root mem REG 8,2 22536 1442171 /lib64/libdl-2.12.so sudo 48430 root mem REG 8,2 58480 1442174 /lib64/libpam.so.0.82.2 sudo 48430 root mem REG 8,2 17520 1441884 /lib64/libutil-2.12.so sudo 48430 root mem REG 8,2 124624 1441798 /lib64/libselinux.so.1 sudo 48430 root mem REG 8,2 99112 1442170 /lib64/libaudit.so.1.0.0 sudo 48430 root mem REG 8,2 156872 1442168 /lib64/ld-2.12.so sudo 48430 root 0r CHR 1,3 0t0 3916 /dev/null sudo 48430 root 1w FIFO 0,8 0t0 1429910151 pipe sudo 48430 root 2w REG 8,3 376639626 524292 /apps/logs/zabbix/zabbix_agentd.log sudo 48430 root 3u sock 0,6 0t0 1429910161 can't identify protocol sudo 48430 root 4r REG 8,2 764 2240617 /etc/group sudo 48430 root 5u unix 0xffff880179ee4680 0t0 1429910162 socket
這裏經過查看/proc/pid/fd下的文件描述符的狀態,發現這個fd實際上是已經關閉的。
這裏就有多是子進程已經運行完成,而父進程沒有正確處理子進程的返回信息致使父進程一直認爲子進程還在運行,最終產生了殭屍進程。
這實際上是sudo的一個bug,相關的bug id :
http://www.gratisoft.us/bugzilla/show_bug.cgi?id=447
關於bug的描述:
If the parent process gets re-scheduled after the 「if」 was executed, and at this very time the child process finishes and SIGCHLD is sent to the parent process, sudo gets in trouble. The SIGCHLD handler accounts in the variable 「recvsig[]」 that the signal was received, and then the parent process calls select(). This select will never be interrupted, as the author had it in mind. In 99% of the cases, the parent process will enter in the select() blocking state before the child process ended. The child would then send SIGCHLD, which will be accounted in the handler procedure, and will also interrupt select() which will return -1 in 「nready」, and 「errno」 will be set to EINTR.
問題出在sudo的代碼sudo/file/tip/src/exec.c,小於 1.7.5或1.8.0 以前的版本都有問題,當子進程剛好在select()這個系統調用前退出的時候,句柄已經被退出,因此sudo會卡在select這裏
patch:
http://www.sudo.ws/repos/sudo/rev/99adc5ea7f0a
Avoid a potential race condition if SIGCHLD is received immediately before we call select().
網上有人遇到了一樣地問題:
http://blog.famzah.net/2010/11/01/sudo-hangs-and-leaves-the-executed-program-as-zombie/
回過頭總結,這個問題實際上是多個潛在問題同時形成:
1.zabbix agent的自定義監控配置中使用了sudo,致使殭屍進程的父進程不能正常關閉(若是須要sudo,寫在腳本里面便可)
2.sudo的bug致使產生了僵死進程(升級sudo便可)
3.zabbix agent端的實現也有問題,某一個進程成爲僵死進程後會影響其餘的監控項獲取(zabbix agent的進程被阻塞致使)