一、故障描述html
接到用戶報障,生產某系統沒法訪問。同事接到報障後當即排查,經測試,系統確實沒法訪問,而且沒法ping通服務器。
linux
二、故障處理api
因爲客戶端沒法ping通服務器,須要進入機房查看。經查看,服務器硬件無報警,系統無重啓。登陸系統使用ifconfig命令查看,IP丟失(eth0不存在),緊接打開網卡配置目錄/etc/sysconfig/network-scripts,發現網卡文件ifcfg-eth0丟失,只存在以前備份的ifcfg-eth0.bak文件和ifcfg-peth0文件。根據先搶通業務後處理故障原則,經過備份的文件複製一份進行修復,重啓network服務,故障解決。安全
三、故障分析bash
3.1經瞭解,故障發生時,有一同事正在登陸系統查詢安全基線配置,但同事堅稱並未進行rm或者mv網卡文件操做。經過history命令得知,該同事確實未執行rm或者mv操做,只執行了chkconfig --list命令,但卻不當心把本來須要複製的內容誤操做的看成命令去執行了,歷史記錄以下:服務器
883 chkconfig --list 884 NetworkManager 0:off 1:off 2:off 3:off 4:off 5:off 6:off 885 PowerIscsi 0:off 1:off 2:off 3:on 4:off 5:on 6:off 886 PowerMig 0:off 1:off 2:off 3:on 4:off 5:on 6:off 887 PowerMigRecoverAll 0:off 1:off 2:off 3:on 4:off 5:on 6:off 888 acpid 0:off 1:off 2:on 3:on 4:on 5:on 6:off 889 anacron 0:off 1:off 2:on 3:on 4:on 5:on 6:off 890 atd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 891 auditd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 892 autofs 0:off 1:off 2:off 3:on 4:on 5:on 6:off 893 avahi-daemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off 894 avahi-dnsconfd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 895 bluetooth 0:off 1:off 2:on 3:on 4:on 5:on 6:off 896 capi 0:off 1:off 2:off 3:off 4:off 5:off 6:off 897 conman 0:off 1:off 2:off 3:off 4:off 5:off 6:off 898 coremail 0:off 1:off 2:on 3:on 4:on 5:on 6:off 899 cpuspeed 0:off 1:on 2:on 3:on 4:on 5:on 6:off 900 crond 0:off 1:off 2:on 3:on 4:on 5:on 6:off 901 cups 0:off 1:off 2:on 3:on 4:on 5:on 6:off 902 dnsmasq 0:off 1:off 2:off 3:off 4:off 5:off 6:off 903 dund 0:off 1:off 2:off 3:off 4:off 5:off 6:off 904 ebtables 0:off 1:off 2:off 3:off 4:off 5:off 6:off 905 firstboot 0:off 1:off 2:off 3:on 4:off 5:on 6:off 906 gpm 0:off 1:off 2:on 3:on 4:on 5:on 6:off 907 haldaemon 0:off 1:off 2:off 3:on 4:on 5:on 6:off 908 hidd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 909 hplip 0:off 1:off 2:on 3:on 4:on 5:on 6:off 910 httpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 911 ip6tables 0:off 1:off 2:on 3:on 4:on 5:on 6:off 912 ipmi 0:off 1:off 2:off 3:off 4:off 5:off 6:off 913 iptables 0:off 1:off 2:off 3:off 4:off 5:off 6:off 914 irda 0:off 1:off 2:off 3:off 4:off 5:off 6:off 915 irqbalance 0:off 1:off 2:on 3:on 4:on 5:on 6:off 916 iscsi 0:off 1:off 2:off 3:on 4:on 5:on 6:off 917 iscsid 0:off 1:off 2:off 3:on 4:on 5:on 6:off 918 isdn 0:off 1:off 2:on 3:on 4:on 5:on 6:off 919 kdump 0:off 1:off 2:off 3:off 4:off 5:off 6:off 920 kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off 921 libvirt-guests 0:off 1:off 2:off 3:on 4:on 5:on 6:off 922 libvirtd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 923 lvm2-monitor 0:off 1:on 2:on 3:on 4:on 5:on 6:off 924 mcstrans 0:off 1:off 2:on 3:on 4:on 5:on 6:off 925 mdmonitor 0:off 1:off 2:on 3:on 4:on 5:on 6:off 926 mdmpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 927 messagebus 0:off 1:off 2:off 3:on 4:on 5:on 6:off 928 microcode_ctl 0:off 1:off 2:on 3:on 4:on 5:on 6:off 929 multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 930 named 0:off 1:off 2:off 3:off 4:off 5:off 6:off 931 netbackup 0:off 1:off 2:on 3:on 4:off 5:on 6:off 932 netconsole 0:off 1:off 2:off 3:off 4:off 5:off 6:off 933 netfs 0:off 1:off 2:off 3:on 4:on 5:on 6:off 934 netplugd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 935 network 0:off 1:off 2:on 3:on 4:on 5:on 6:off 936 nfs 0:off 1:off 2:off 3:off 4:off 5:off 6:off 937 nfslock 0:off 1:off 2:off 3:on 4:on 5:on 6:off 938 nscd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 939 ntpd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 940 pand 0:off 1:off 2:off 3:off 4:off 5:off 6:off 941 pcscd 0:off 1:off 2:on 3:on 4:on 5:on 6:off 942 portmap 0:off 1:off 2:off 3:on 4:on 5:on 6:off 943 psacct 0:off 1:off 2:off 3:off 4:off 5:off 6:off 944 rawdevices 0:off 1:off 2:off 3:on 4:on 5:on 6:off 945 rdisc 0:off 1:off 2:off 3:off 4:off 5:off 6:off 946 readahead_early 0:off 1:off 2:on 3:on 4:on 5:on 6:off 947 readahead_later 0:off 1:off 2:off 3:off 4:off 5:on 6:off 948 restorecond 0:off 1:off 2:on 3:on 4:on 5:on 6:off 949 rhnsd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 950 rpcgssd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 951 rpcidmapd 0:off 1:off 2:off 3:on 4:on 5:on 6:off 952 rpcsvcgssd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 953 saslauthd 0:off 1:off 2:off 3:off 4:off 5:off 6:off 954 sendmail 0:off 1:off 2:off 3:off 4:off 5:off 6:off
以上操做記錄表面看起來,並沒有異常。ide
3.2經過查看系統日誌messages,發現有「removed ifcfg-eth0」關鍵字,發生的時間與同事誤操做的時間吻合,以下:測試
Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: removed /etc/sysconfig/network-scripts/ifcfg-eth0. Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-peth0 ... Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: read connection 'System peth0' Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: updating /etc/sysconfig/network-scripts/ifcfg-peth0
同事既然沒有誤操做,那爲何會有remove網卡文件的日誌呢?難道被黑了?仍是有其它緣由?.net
3.3查看日誌secure和命令last,並未發現異常登陸IP,先排除被黑可能性,着重排查同事誤操做的命令中,哪一條纔是引發網卡文件丟失的。rest
3.4再一次確認3.1的history操做記錄,表面看上去確實沒有什麼異常,並且都是chkconfig --list的輸出內容,百思不得其解。
3.5查問題,看日誌。只能經過仔細的分析message日誌查找一點蛛絲馬跡。從3.2的日誌來看,當看到
Mar 21 09:46:50 localhost nm-system-settings: ifcfg-rh: parsing /etc/sysconfig/network-scripts/ifcfg-peth0 ...
時,發現「ifcfg-peth0」這個網卡文件很可疑,該文件應該跟XEN虛擬化有關,但這個系統並未使用到XEN虛擬化。
3.6登陸系統確認,系統雖未使用虛擬化,但前期安裝時安裝了XEN虛擬化,而且加載了kernel-xen內核和啓動了xend服務:
1)[root@~]# uname -r 2.6.18-238.el5xen 2)# /etc/init.d/xend status xend is running
3.7查看Ifcfg-peth0文件的建立修改時間,與同事誤操做的時間吻合,再一次懷疑這個文件跟此次故障有必定的關係:
# find . -type f -mtime 2|xargs ls -l -rw-r--r-- 1 root root 303 Mar 21 09:46 ./etc/modprobe.conf -rw-r--r-- 1 root root 23116 Mar 21 09:46 ./etc/sysconfig/hwconf -rw-r--r-- 1 root root 122 Mar 21 09:46 ./etc/sysconfig/network-scripts/ifcfg-peth0
3.8爲方便排查和重現故障,根據系統的環境,在測試環境搭建:安裝了XEN虛擬化RHEL5.6。
3.8.1跟生產系統同樣,一樣的備份一份Ifcfg-eth0.bak文件;
3.8.2根據同事誤操做的歷史記錄,逐條進行執行測試,當測試到「kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off」,問題重現:ifcfg-eth0文件丟失,同時生成了ifcfg-peth0文件,而且斷網。與生產系統故障的狀況一致。如圖:
3.9搭建另外一個測試環境:並未安裝XEN虛擬化的RHEL5.6。一樣的執行3.8.2章節的命令,但問題未重現。如圖:
四、故障緣由
經過問題重現,得出結論:安裝了XEN虛擬化環境的系統,同事誤操做的時候執行了其中一條「kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off」命令,二者條件知足狀況下,從而致使刪除了ifcfg-eth0文件,繼而發生斷網。
五、相關知識
根據網上信息瞭解,kudzu命令爲何會致使刪除網卡配置文件,目前所瞭解的,應該是在特定狀況下(安裝了XEN虛擬化)觸發的BUG或者自己的機制致使。
附:
一、kudzu介紹:http://blog.csdn.net/huyangg/article/details/7189743
二、kudzu相關BUG:https://bugzilla.redhat.com/show_bug.cgi?id=2069十、https://bugzilla.redhat.com/show_bug.cgi?id=22957九、http://linux.bigresource.com/Red-Hat-Prevent-kudzu-from-changing-ifcfg-ethX-file--wi29JYmpf.html