Intel在Pentium 四、Xenon和P6系列處理器中實現了機器檢查(Machinecheck)架構,提供可以檢測和報告硬件(機器)的錯誤機制,如系統總線錯誤、ECC錯誤、奇偶校驗錯誤、緩存錯誤、TLB錯誤等。它包括一直MSR(Model-Specific Registers)寄存器,用來設置機器檢查和額外的bank MSR記錄錯誤。編程
當機器檢查到不可糾正的machine-check錯誤時,就觸發一個machine-check異常。machine-check架構不容許在出現MCE後處理器重啓,但MCE處理程序能夠從MSR寄存器收集相關信息。緩存
CPU 7: Machine Check Exception: 5 Bank 0: b200004010000400架構
RIP !INEXACT! 10:<ffffffff8010f16e> {mwait_idle+0x5e/0x90}socket
TSC 1952dbeebcc8函數
Kernel panic: Machine checkspa
Reconfiguring memory bank information….設計
This may take a while….code
done waiting: 3 cpus not respondingorm
Warning: Non-empty request queueblog
I/O requests in flight at dump time
CPU 7: Machine Check Exception: 4 Bank 0: f200004040000400
RIP !INEXACT! 10:<ffffffff8011ef69>
凡是內核死機打印「Machine Check Exception「或內核棧信息中打印有do_machine_check()函數,均爲MCE問題。
CPU Cache損壞或其它故障
如CPU生產製造過程當中帶來的缺陷
以上面MCE錯誤爲例,Machine Check Exception和Bank 0(5)的值分別對應IA32_MCG_STATUS MSR、IA32_MCi_STATUS寄存器。
則對應的寄存器值爲:
IA32_MCG_STATUS MSR寄存器的值爲0000000000000004
IA32_MC0_STATUS MSR的值爲f200000410000800
IA32_MC5_STATUS MSR的值爲f200001044100e0f
根據MSR的值,對照Intel編程手冊和Intel其餘資料,就能夠比較容易找出MCE緣由。
dmesg顯示
1
2 3 4 5 6 7 8 |
... sbridge: HANDLING MCE MEMORY ERROR CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093 TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0 EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr= 0x67081b300 => socket=0, Channel=3(mask=8), rank=0 ... |
保存4行log爲mlog
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# mcelog --ascii < /tmp/mlog WARNING: with --dmi mcelog --ascii must run on the same machine with the same BIOS/memory configuration as where the machine check occurred. sbridge: HANDLING MCE MEMORY ERROR CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010093 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor Wed Sep 2 16:14:36 2015 CPU 0 BANK 5 MISC 2140040486 ADDR 67081b300 STATUS 8c00004000010093 MCGSTATUS 0 CPUID Vendor Intel Family 6 Model 45 WARNING: SMBIOS data is often unreliable. Take with a grain of salt! <24> DIMM 1333 Mhz Res13 Width 72 Data Width 64 Size 16 GB Device Locator: Node0_Channel2_Dimm0 Bank Locator: Node0_Bank0 Manufacturer: Hynix Semiconducto Serial Number: 40743B5A Asset Tag: Dimm2_AssetTag Part Number: HMT42GR7BFR4A-PB TSC 0 ADDR 67081b300 MISC 2140040486 PROCESSOR 0:206d7 TIME 1441181676 SOCKET 0 APIC 0 EDAC MC0: CE row 2, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x67081b300 => socket=0, Channel=3(mask=8), rank=0 |
根據
Part Number: HMT42GR7BFR4A-PB
Serial Number: 40743B5A
在lshw中找相應硬件
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
... *-memory:0 description: System Memory physical id: 2d slot: System board or motherboard *-bank:0 description: DIMM 1333 MHz (0.8 ns) product: HMT42GR7BFR4A-PB vendor: Hynix Semiconducto physical id: 0 serial: 905D21AE slot: Node0_Channel1_Dimm0 size: 16GiB width: 64 bits clock: 1333MHz (0.8ns) *-bank:1 description: DIMM Synchronous [empty] product: A1_Dimm1_PartNumber vendor: Dimm1_Manufacturer physical id: 1 serial: Dimm1_SerNum slot: Node0_Channel1_Dimm1 width: 64 bits *-bank:2 description: DIMM 1333 MHz (0.8 ns) product: HMT42GR7BFR4A-PB vendor: Hynix Semiconducto physical id: 2 serial: 40743B5A slot: Node0_Channel2_Dimm0 size: 16GiB width: 64 bits clock: 1333MHz (0.8ns) ... |