今天線上一臺centos6機器用xshell一直鏈接不上,而後在xshell上顯示html
Message from syslogd@GZxxx at Mar 29 14:13:14 ...
kernel:BUG: soft lockup - CPU#1 stuck for 68s! [events/1:36]node
過了10分鐘,終於能夠連上了,看一下開機日誌linux
dmesg |grep stuck
BUG: soft lockup - CPU#2 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#5 stuck for 67s! [bdi-default:49]
BUG: soft lockup - CPU#3 stuck for 67s! [irqbalance:1351]
BUG: soft lockup - CPU#4 stuck for 67s! [swapper:0]
BUG: soft lockup - CPU#6 stuck for 67s! [watchdog/6:30]
BUG: soft lockup - CPU#5 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#0 stuck for 67s! [events/0:35]
BUG: soft lockup - CPU#7 stuck for 67s! [lldpad:1459]
BUG: soft lockup - CPU#6 stuck for 67s! [mpt_poll_0:376]
BUG: soft lockup - CPU#4 stuck for 67s! [ksoftirqd/4:21]
BUG: soft lockup - CPU#1 stuck for 67s! [events/1:36]
BUG: soft lockup - CPU#3 stuck for 62s! [rsyslogd:1325]
BUG: soft lockup - CPU#4 stuck for 72s! [events/4:39]
BUG: soft lockup - CPU#1 stuck for 70s! [automount:4252]
BUG: soft lockup - CPU#2 stuck for 73s! [hald:1685]
BUG: soft lockup - CPU#0 stuck for 61s! [automount:1776]
BUG: soft lockup - CPU#6 stuck for 67s! [events/6:41]
BUG: soft lockup - CPU#5 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#7 stuck for 65s! [lldpad:1459]
BUG: soft lockup - CPU#3 stuck for 68s! [swapper:0]
BUG: soft lockup - CPU#2 stuck for 68s! [events/2:37]
BUG: soft lockup - CPU#0 stuck for 67s! [crond:1815]
BUG: soft lockup - CPU#7 stuck for 67s! [watchdog/7:34]
BUG: soft lockup - CPU#1 stuck for 68s! [events/1:36]
BUG: soft lockup - CPU#4 stuck for 67s! [watchdog/4:22]
BUG: soft lockup - CPU#5 stuck for 68s! [watchdog/5:26]
BUG: soft lockup - CPU#3 stuck for 66s! [swapper:0]
BUG: soft lockup - CPU#2 stuck for 66s! [ksoftirqd/2:13]
BUG: soft lockup - CPU#0 stuck for 67s! [watchdog/0:6]
BUG: soft lockup - CPU#5 stuck for 67s! [watchdog/5:26]
BUG: soft lockup - CPU#6 stuck for 62s! [fcoemon:1509]
BUG: soft lockup - CPU#4 stuck for 70s! [lldpad:1459]
BUG: soft lockup - CPU#7 stuck for 63s! [watchdog/7:34]
BUG: soft lockup - CPU#1 stuck for 63s! [sync_supers:48]
BUG: soft lockup - CPU#3 stuck for 63s! [irqbalance:1351]
BUG: soft lockup - CPU#2 stuck for 62s! [events/2:37]
BUG: soft lockup - CPU#0 stuck for 68s! [events/0:35]
BUG: soft lockup - CPU#2 stuck for 68s! [sa1:4687]
BUG: soft lockup - CPU#3 stuck for 78s! [flush-8:0:4618]
BUG: soft lockup - CPU#1 stuck for 78s! [events/1:36]
BUG: soft lockup - CPU#4 stuck for 63s! [lldpad:1459]
BUG: soft lockup - CPU#6 stuck for 64s! [fcoemon:1509]
BUG: soft lockup - CPU#5 stuck for 64s! [NetworkManager:1531]
BUG: soft lockup - CPU#0 stuck for 62s! [watchdog/0:6]
BUG: soft lockup - CPU#7 stuck for 68s! [watchdog/7:34]
BUG: soft lockup - CPU#4 stuck for 63s! [lldpad:1459]
BUG: soft lockup - CPU#1 stuck for 162s! [irqbalance:1351]
BUG: soft lockup - CPU#6 stuck for 128s! [hald:1685]
BUG: soft lockup - CPU#2 stuck for 130s! [sshd:4688]
BUG: soft lockup - CPU#5 stuck for 147s! [rsyslogd:1325]
BUG: soft lockup - CPU#3 stuck for 71s! [flush-8:0:4618]
BUG: soft lockup - CPU#6 stuck for 68s! [events/6:41]
BUG: soft lockup - CPU#2 stuck for 68s! [irqbalance:1351]
BUG: soft lockup - CPU#1 stuck for 68s! [su:4783]
BUG: soft lockup - CPU#7 stuck for 67s! [crond:1815]
BUG: soft lockup - CPU#5 stuck for 67s! [events/5:40]
BUG: soft lockup - CPU#0 stuck for 66s! [lldpad:1459]
BUG: soft lockup - CPU#4 stuck for 65s! [automount:4785]shell
所有都是這種錯誤:BUG: soft lockup - CPU#x stuck for xscentos
這個錯誤是什麼鬼?服務器
查了一下百度,發現這是一個軟死鎖數據結構
內核軟死鎖(soft lockup)bugapp
Soft lockup名稱解釋:所謂,soft lockup就是說,這個bug沒有讓系統完全死機,可是若干個進程(或者kernel thread)被鎖死在了某個狀態(通常在內核區域),不少狀況下這個是因爲內核鎖的使用的問題。less
Linux內核對於每個cpu都有一個監控進程,在技術界這個叫作watchdog(看門狗)。經過ps -eo ppid,pid,user,args |grep watchdog可以看見,進程名稱大概是watchdog/X(數字:cpu邏輯編號1/2/3/4之類的)。這個進程或者線程每一秒鐘運行一次,不然會睡眠和待機。這個進程運行會收集每個cpu運行時使用數據的時間而且存放到屬於每一個cpu本身的內核數據結構。在內核中有不少特定的中斷函數。這些中斷函數會調用soft lockup計數,他會使用當前的時間戳與特定(對應的)cpu的內核數據結構中保存的時間對比,若是發現當前的時間戳比對應cpu保存的時間大於設定的閥值,他就假設監測進程或看門狗線程在一個至關可觀的時間尚未執。Cpu軟鎖爲何會產生,是怎麼產生的?若是linux內核是通過精心設計安排的CPU調度訪問,那麼怎麼會產生cpu軟死鎖?那麼只能說因爲用戶開發的或者第三方軟件引入,看咱們服務器內核panic的緣由就是qmgr進程引發。由於每個無限的循環都會一直有一個cpu的執行流程(qmgr進程示一個後臺郵件的消息隊列服務進程),而且擁有必定的優先級。Cpu調度器調度一個驅動程序來運行,若是這個驅動程序有問題而且沒有被檢測到,那麼這個驅動程序將會暫用cpu的很長時間。根據前面的描述,看門狗進程會抓住(catch)這一點而且拋出一個軟死鎖(soft lockup)錯誤。軟死鎖會掛起cpu使你的系統不可用。ssh
若是是用戶空間的進程或線程引發的問題backtrace是不會有內容的,若是內核線程那麼在soft lockup消息中會顯示出backtrace信息。
簡單來講: 因爲系統的某個驅動程序有問題致使watchdog沒法收集每個邏輯cpu運行時使用數據並拋出一個軟死鎖(soft lockup)錯誤
線上服務器有8個邏輯cpu因此有8只狗
cat /proc/cpuinfo |grep processor
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
ps -eo ppid,pid,user,args |grep watchdog
2 6 root [watchdog/0]
2 10 root [watchdog/1]
2 14 root [watchdog/2]
2 18 root [watchdog/3]
2 22 root [watchdog/4]
2 26 root [watchdog/5]
2 30 root [watchdog/6]
2 34 root [watchdog/7]
4852 4883 root grep watchdog
在/var/log/messages裏找到關鍵信息,因爲用的是vmware esxi平臺,估計vmware esxi的某個硬件驅動有問題,正準備聯繫vmware那邊的工程師解決
less /var/log/messages
Mar 28 18:34:55 xxx kernel: UNSUPPORTED HARDWARE DEVICE: CPU family 6 model > 59
Mar 28 18:34:55 xxx kernel: ------------[ cut here ]------------
Mar 28 18:34:55 xxx kernel: WARNING: at kernel/rh_taint.c:13 mark_hardware_unsupported+0x39/0x40() (Not tainted)
Mar 28 18:34:55 xxx kernel: Hardware name: VMware Virtual Platform
Mar 28 18:34:55 xxx kernel: Your hardware is unsupported. Please do not report bugs, panics, oopses, etc., on this hardware.
Mar 28 18:34:55 xxx kernel: Modules linked in:
Mar 28 18:34:55 xxx kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Mar 28 18:34:55 xxx kernel: Call Trace:
Mar 28 18:34:55 xxx kernel: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Mar 28 18:34:55 xxx kernel: [<ffffffff8106b7df>] ? warn_slowpath_fmt_taint+0x3f/0x50
Mar 28 18:34:55 xxx kernel: [<ffffffff8109a869>] ? mark_hardware_unsupported+0x39/0x40
Mar 28 18:34:55 xxx kernel: [<ffffffff81c27b5d>] ? setup_arch+0xb1f/0xb42
Mar 28 18:34:55 xxx kernel: [<ffffffff814fd223>] ? printk+0x41/0x46
Mar 28 18:34:55 xxx kernel: [<ffffffff81c21c33>] ? start_kernel+0xdc/0x430
Mar 28 18:34:55 xxx kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Mar 28 18:34:55 xxx kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Mar 28 18:34:55 xxx kernel: ---[ end trace a7919e7f17c0a725 ]---
Mar 28 18:34:55 xxx kernel: NR_CPUS:4096 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
Mar 28 18:34:55 xxx kernel: PERCPU: Embedded 31 pages/cpu @ffff880028200000 s94424 r8192 d24360 u262144
Mar 28 18:34:55 xxx kernel: pcpu-alloc: s94424 r8192 d24360 u262144 alloc=1*2097152
Mar 28 18:34:55 xxx kernel: pcpu-alloc: [0] 0 1 2 3 4 5 6 7
Mar 28 18:34:55 xxx kernel: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 2064657
Mar 28 18:34:55 xxx kernel: Policy zone: Normal
Mar 28 18:34:55 xxx kernel: Kernel command line: ro root=UUID=12b1eb92-e0a3-441c-98e0-6d75d9e510c2 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M KEYBOAR
DTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
參考文章
http://blog.jobbole.com/110581/
http://www.cnblogs.com/brucewoo/archive/2012/12/16/3226861.html
若有不對的地方,歡迎你們拍磚o(∩_∩)o
本文版權歸做者全部,未經做者贊成不得轉載。