[troubleshoot][daily][redhat] 設備反覆重啓故障排查

一臺服務器設備,反覆重啓,天天重啓數次。服務器

 

一: 緣由分析及初步排異。

1.  硬件,內存主板,一一更換,甚至除了硬盤將整臺機器都換掉了,依然重啓。app

2.  排除電源問題,換了電源線,換了插座,仍是重啓。socket

3.  那麼接下來,還有三種可能:ide

  A。內核問題,內核crash。(redhat的穩定性仍是十分讓人信賴的,這種可能性不高)ui

  B。硬盤或文件系統故障。本質上,這樣會致使內核crash。this

  C。程序自主reboot。(咱們本身的程序reboot,或進了黑客放了reboot腳本。好無聊的黑客。。。。)atom

 

二: 最好排除的,就是先解決內核的問題。

  內核在crash那一刻是會發現,本身即將crash的,因而他會在臨死前留下一些信息。告訴用戶我發生了什麼。 但是問題在於:文件系統的複雜性,會致使內核臨死以前文件系統也隨之崩潰了。spa

  經過重啓以後查看日誌,確實沒有留下有用的信息。3d

  這是時候咱們還有另外一種手段,netcosole,他的功能是吧內核日誌從socket以udp的方式,自組IP包而不走協議棧,講包推出網卡端口。包的格式爲syslog格式。日誌

netcosole使用:

1.  修改配置文件

[root@S205 ~]# cat /etc/sysconfig/netconsole                                                                                                                                                 
# This is the configuration file for the netconsole service.  By starting                                                                                                                    
# this service you allow a remote syslog daemon to record console output                                                                                                                     
# from this system.

# The local port number that the netconsole module will use
LOCALPORT=6666

# The ethernet device to send console messages out of (only set this if it
# can't be automatically determined)
DEV=enp3s0

# The IP address of the remote syslog server to send messages to
SYSLOGADDR=192.168.10.214

# The listening port of the remote syslog daemon
SYSLOGPORT=514

# The MAC address of the remote syslog server (only set this if it can't
# be automatically determined)
SYSLOGMACADDR=40:8d:5c:22:53:18
[root@S205 ~]# 
cat /etc/sysconfig/netconsole

2.  啓動服務

[root@S205 ~]# systemctl start netconsole
[root@S205 ~]# systemctl enable netconsole

 

當前系統及內核版本:

[root@S205 ~]# cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core) 
[root@S205 ~]# uname -a
Linux S205 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@S205 ~]# 

 

成功收到內核crash日誌:

Jul 25 08:14:54 192.168.10.205 [20239.422386] NMI watchdog: Watchdog detected hard LOCKUP on cpu 7
Jul 25 08:14:54 192.168.10.205  
Jul 25 08:14:54 192.168.10.205 [20239.422529] Kernel panic - not syncing: Hard LOCKUP
Jul 25 08:14:54 192.168.10.205 [20239.422543] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 3.10.0-514.el7.x86_64 #1
Jul 25 08:14:54 192.168.10.205 [20239.422561] Hardware name: LENOVO 10C0A038CD/ , BIOS FCKT73AUS 08/28/2015
Jul 25 08:14:54 192.168.10.205 [20239.422579]  ffffffff818d9784
Jul 25 08:14:54 192.168.10.205  90a5d9572fc8872b
Jul 25 08:14:54 192.168.10.205  ffff88041edc5b18
Jul 25 08:14:54 192.168.10.205  ffffffff81685fac
Jul 25 08:14:54 192.168.10.205  
Jul 25 08:14:54 192.168.10.205 [20239.422603]  ffff88041edc5b98
Jul 25 08:14:54 192.168.10.205  ffffffff8167f3b3
Jul 25 08:14:54 192.168.10.205  0000000000000010
Jul 25 08:14:54 192.168.10.205  ffff88041edc5ba8
Jul 25 08:14:54 192.168.10.205  
Jul 25 08:14:54 192.168.10.205 [20239.422627]  ffff88041edc5b48
Jul 25 08:14:54 192.168.10.205  90a5d9572fc8872b
Jul 25 08:14:54 192.168.10.205  ffff88041edc5ba8
Jul 25 08:14:54 192.168.10.205  ffffffff818d948a
Jul 25 08:14:54 192.168.10.205  
Jul 25 08:14:54 192.168.10.205 [20239.422651] Call Trace:
Jul 25 08:14:54 192.168.10.205 [20239.422658]  <NMI> 
Jul 25 08:14:54 192.168.10.205  [<ffffffff81685fac>] dump_stack+0x19/0x1b
Jul 25 08:14:54 192.168.10.205 [20239.422678]  [<ffffffff8167f3b3>] panic+0xe3/0x1f2
Jul 25 08:14:54 192.168.10.205 [20239.422692]  [<ffffffff8108562f>] nmi_panic+0x3f/0x40
Jul 25 08:14:54 192.168.10.205 [20239.422706]  [<ffffffff8112f0e6>] watchdog_overflow_callback+0xf6/0x100
Jul 25 08:14:54 192.168.10.205 [20239.422725]  [<ffffffff8117465e>] __perf_event_overflow+0x8e/0x1f0
Jul 25 08:14:54 192.168.10.205 [20239.422741]  [<ffffffff811752a4>] perf_event_overflow+0x14/0x20
Jul 25 08:14:54 192.168.10.205 [20239.422759]  [<ffffffff81009d88>] intel_pmu_handle_irq+0x1f8/0x4e0
Jul 25 08:14:54 192.168.10.205 [20239.422776]  [<ffffffff8168dbeb>] perf_event_nmi_handler+0x2b/0x50
Jul 25 08:14:54 192.168.10.205 [20239.422793]  [<ffffffff8168f019>] nmi_handle.isra.0+0x69/0xb0
Jul 25 08:14:54 192.168.10.205 [20239.422808]  [<ffffffff8168f193>] do_nmi+0x133/0x410
Jul 25 08:14:54 192.168.10.205 [20239.422822]  [<ffffffff8168e453>] end_repeat_nmi+0x1e/0x2e
Jul 25 08:14:54 192.168.10.205 [20239.422838]  [<ffffffff8168d9c7>] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422855]  [<ffffffff8168d9c7>] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422871]  [<ffffffff8168d9c7>] ? _raw_spin_lock_irqsave+0x47/0x60
Jul 25 08:14:54 192.168.10.205 [20239.422887]  <<EOE>> 
Jul 25 08:14:54 192.168.10.205  <IRQ> 
Jul 25 08:14:54 192.168.10.205  [<ffffffffa01fae23>] nvkm_fantog_update+0x43/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.422947]  [<ffffffffa01faf48>] nvkm_fantog_set+0x38/0x40 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.422976]  [<ffffffffa01fa378>] nvkm_fan_update+0xc8/0x210 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423005]  [<ffffffffa01fa519>] nvkm_therm_fan_set+0x19/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423035]  [<ffffffffa01f9b87>] nvkm_therm_update+0x97/0x310 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423064]  [<ffffffffa01f9e17>] nvkm_therm_alarm+0x17/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423106]  [<ffffffffa01fd2d3>] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423147]  [<ffffffffa01fd3d0>] nvkm_timer_alarm+0x60/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423176]  [<ffffffffa01fb651>] alarm_timer_callback+0xd1/0xe0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423207]  [<ffffffffa01fd2d3>] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423238]  [<ffffffffa01fd3d0>] nvkm_timer_alarm+0x60/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423266]  [<ffffffffa01faeea>] nvkm_fantog_update+0x10a/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423295]  [<ffffffffa01faf0a>] nvkm_fantog_alarm+0x1a/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423324]  [<ffffffffa01fd2d3>] nvkm_timer_alarm_trigger+0x103/0x150 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423355]  [<ffffffffa01fd6fb>] nv04_timer_intr+0x6b/0xb0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423384]  [<ffffffffa01fd174>] nvkm_timer_intr+0x14/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423419]  [<ffffffffa01ada87>] nvkm_subdev_intr+0x17/0x20 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423458]  [<ffffffffa01ef7f9>] nvkm_mc_intr+0x79/0x110 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423486]  [<ffffffffa01f4155>] nvkm_pci_intr+0x55/0xa0 [nouveau]
Jul 25 08:14:54 192.168.10.205 [20239.423503]  [<ffffffff8113015e>] handle_irq_event_percpu+0x3e/0x1e0
Jul 25 08:14:54 192.168.10.205 [20239.423521]  [<ffffffff8113033d>] handle_irq_event+0x3d/0x60
Jul 25 08:14:54 192.168.10.205 [20239.423536]  [<ffffffff81133007>] handle_edge_irq+0x77/0x130
Jul 25 08:14:54 192.168.10.205 [20239.424012]  [<ffffffff8102d26f>] handle_irq+0xbf/0x150
Jul 25 08:14:54 192.168.10.205 [20239.424491]  [<ffffffff810f3c8a>] ? tick_check_idle+0x8a/0xd0
Jul 25 08:14:54 192.168.10.205 [20239.424967]  [<ffffffff8169201a>] ? atomic_notifier_call_chain+0x1a/0x20
Jul 25 08:14:54 192.168.10.205 [20239.425445]  [<ffffffff81698bef>] do_IRQ+0x4f/0xf0
Jul 25 08:14:54 192.168.10.205 [20239.425921]  [<ffffffff8168dd6d>] common_interrupt+0x6d/0x6d
Jul 25 08:14:54 192.168.10.205 [20239.426389]  <EOI> 
Jul 25 08:14:54 192.168.10.205  [<ffffffff81514052>] ? cpuidle_enter_state+0x52/0xc0
Jul 25 08:14:54 192.168.10.205 [20239.426863]  [<ffffffff81514199>] cpuidle_idle_call+0xd9/0x210
Jul 25 08:14:54 192.168.10.205 [20239.427314]  [<ffffffff8103516e>] arch_cpu_idle+0xe/0x30
Jul 25 08:14:54 192.168.10.205 [20239.427793]  [<ffffffff810e7c95>] cpu_startup_entry+0x245/0x290
Jul 25 08:14:54 192.168.10.205 [20239.428222]  [<ffffffff8104f12a>] start_secondary+0x1ba/0x230
Jul 25 08:18:03 192.168.10.205 [    2.633081] nouveau 0000:01:00.0: priv: HUB0: 085014 ffffffff (1b70820b)
Jul 25 08:20:01 S214 systemd: Started Session 171 of user root.
Jul 25 08:20:01 S214 systemd: Starting Session 171 of user root.
Jul 25 08:30:01 S214 systemd: Started Session 172 of user root.

 

這是正確的處理方式,不是去深刻調查緣由,也不是去hacking。

1. 升至最新版穩定內核。

2. 回退至前一版穩定內涵。

[root@S205 ~]# yum upgrade
Installing:
kernel                                                x86_64                              3.10.0-514.26.2.el7                                    updates                               37 M

 

已升最新,待觀察:

[root@S205 ~]# cat /etc/redhat-release 
CentOS Linux release 7.3.1611 (Core) 
[root@S205 ~]# uname -a
Linux S205 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

 

三:仍然重啓,現象與錯誤信息一致。

  參考:https://stackoverflow.com/questions/44039958/kernel-panic-not-syncing-watchdog-detected-hard-lockup

  好像是 nvidia 顯卡的問題。

[root@S205 ~]# cat /etc/default/grub |grep CMDLINE
GRUB_CMDLINE_LINUX="crashkernel=auto rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rd.driver.blacklist=nouveau nomodeset rhgb quiet"
[root@S205 ~]# 

  增長內核參數:rd.driver.blacklist=nouveau nomodeset 

  再觀察。

 

四:Fixed

連續24小時未重啓。

完。

相關文章
相關標籤/搜索