修改linux內核代碼或者內核模塊的時候,搞很差就會形成linux死機崩潰,crash死機後/var/log/kern.log裏面不會有任何異常信息記錄。這時候kdump就會派上用場了,網上kdump的中英文介紹資料不少,可是不少都是基於系統自帶的linux進行說明的,這裏記錄一下在新編譯的內核上使用kdump遇到的問題html
一、首先使用ubuntu的config文件編譯的內核能夠使用kdumplinux
二、在上面的config文件後打開內核編譯的DEBUG_INFO選項,編譯安裝內核後,結果kdmup不起做用,crash後電腦直接卡死沒反應,有下面幾個現象redis
經過service --status-all查看,kdump服務成功啓動ubuntu
經過kdump-config show 查看,kdump not ready。bash
從新啓動kdump服務。提示啓動成功,可是查看/var/log/syslog,看到有提示"Could not find a free area of memory of xxxxx",kdump預留內存不足,經過/proc/iomem查看已經預留了crash memory,可是預留的crash memory與syslog中提示的內存很是接近socket
從新啓動筆記本,在grub命令行編輯啓動命令,設置crash memory爲256M,以前grub配置文件中爲128M,重啓後,發現kdump服務ok,經過kdump-config show查看kdump 也是ready狀態。oop
經過echo c > /proc/sysrq-trigger觸發內核崩潰,看到kdump有響應,可是並不會重啓筆記本,也不會記錄crash信息ui
三、既然沒有配置DEBUG_INFO時候編譯的內核能夠kdump,配置DEBUG_INFO後編譯的內核反而不能kdump,那麼在安裝嘗試把DEBUG信息去掉spa
objcopy --strip-debug ./vmlinux.o (建議先備份vmlinux.o)命令行
make modules_installs INSTALL_MOD_STRIP=1 install
上面兩條命令一個是把內核中的debug信息去掉,另一個命令則是在安裝內核模塊的時候,一樣把內核模塊的debug info去掉,INSTALL_MOD_STRIP這個參數實際上就是從內核makefile中找到的,INSTALL_MOD_STRIP在makefile中就是控制strip-debug是否使能的。
四、通過上面把內核以及內核模塊的調試信息去掉後,確認kdump能夠正常使用了
root@Inspiron:/home/lybxin# crash doc/ubuntu-compile/vmlinux /var/crash/201610291718/dump.201610291718
crash 7.1.4
Copyright(C)2002-2015 RedHat,Inc.
Copyright(C)2004,2005,2006,2010 IBM Corporation
Copyright(C)1999-2006 Hewlett-PackardCo
Copyright(C)2005,2006,2011,2012 FujitsuLimited
Copyright(C)2006,2007 VA LinuxSystemsJapan K.K.
Copyright(C)2005,2011 NEC Corporation
Copyright(C)1999,2002,2007 SiliconGraphics,Inc.
Copyright(C)1999,2000,2001,2002 MissionCriticalLinux,Inc.
This program is free software, covered by the GNU GeneralPublicLicense,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter"help copying" to see the conditions.
This program has absolutely no warranty. Enter"help warranty"for details.
GNU gdb (GDB)7.6
Copyright(C)2013FreeSoftwareFoundation,Inc.
LicenseGPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type"show copying"
and "show warranty"for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
KERNEL: doc/ubuntu-compile/vmlinux
DUMPFILE:/var/crash/201610291718/dump.201610291718 [PARTIAL DUMP]
CPUS:4
DATE:SatOct2917:17:442016
UPTIME:00:03:39
LOAD AVERAGE:0.69,1.06,0.50
TASKS:582
NODENAME:Inspiron
RELEASE:4.4.13+
VERSION:#35 SMP Fri Oct 28 23:13:30 CST 2016
MACHINE: x86_64 (2526Mhz)
MEMORY:3.9 GB
PANIC:"sysrq: SysRq : Trigger a crash"
PID:2656
COMMAND:"bash"
TASK: ffff880081543e80 [THREAD_INFO: ffff880124db8000]
CPU:0
STATE: TASK_RUNNING (SYSRQ)
crash> bt
PID:2656 TASK: ffff880081543e80 CPU:0 COMMAND:"bash"
#0 [ffff880124dbbaf0] machine_kexec at ffffffff8105ae6b
#1 [ffff880124dbbb50] crash_kexec at ffffffff8110cb12
#2 [ffff880124dbbc20] oops_end at ffffffff81030c29
#3 [ffff880124dbbc48] no_context at ffffffff81069c35
#4 [ffff880124dbbca8] __bad_area_nosemaphore at ffffffff81069f00
#5 [ffff880124dbbcf0] bad_area at ffffffff8106a0d3
#6 [ffff880124dbbd18] __do_page_fault at ffffffff8106a5eb
#7 [ffff880124dbbd70] do_page_fault at ffffffff8106a6b2
#8 [ffff880124dbbd90] page_fault at ffffffff81832178
[exception RIP: sysrq_handle_crash+22]
RIP: ffffffff814f42b6 RSP: ffff880124dbbe48 RFLAGS:00010282
RAX:000000000000000f RBX:0000000000000063 RCX:0000000000000000
RDX:0000000000000000 RSI: ffff880137c0dc78 RDI:0000000000000063
RBP: ffff880124dbbe48 R8:0000000000000002 R9:00000000000003ed
R10:0000000000000001 R11:00000000000003ed R12:0000000000000004
R13:0000000000000000 R14: ffffffff81ebada0 R15:0000000000000000
ORIG_RAX: ffffffffffffffff CS:0010 SS:0018
#9 [ffff880124dbbe50] __handle_sysrq at ffffffff814f4a8a
#10 [ffff880124dbbe80] write_sysrq_trigger at ffffffff814f4f0f
#11 [ffff880124dbbe98] proc_reg_write at ffffffff81279782
#12 [ffff880124dbbeb8] __vfs_write at ffffffff8120b438
#13 [ffff880124dbbec8] vfs_write at ffffffff8120bdc9
#14 [ffff880124dbbf08] sys_write at ffffffff8120ca85
#15 [ffff880124dbbf50] entry_SYSCALL_64_fastpath at ffffffff8182fff2
RIP:00007fe506368a10 RSP:00007ffd28c06838 RFLAGS:00000246
RAX: ffffffffffffffda RBX:00000000006f4378 RCX:00007fe506368a10
RDX:0000000000000002 RSI:00000000011bc408 RDI:0000000000000001
RBP:00007ffd28c06770 R8:00007fe506637780 R9:00007fe506c6b700
R10:0000000000000001 R11:0000000000000246 R12:00007fe506c6d5d0
R13:0000000000000000 R14:00007fe506c8f168 R15:00007ffd28c06798
ORIG_RAX:0000000000000001 CS:0033 SS:002b
後記:
實際上編譯debug info後遇到兩個問題,第一個是預留的crash內存不足的問題,另一個是預留足夠內存後雖然kdump服務正常了,可是內核崩潰的時候仍然不能啓動kdump服務
對於第一個預留內存不足的問題,後來查看kdump-tools和kexec-tools的源碼,原來kdump-config show執行的時候,就是讀取的/sys/kernel/kexec_crash_loaded這個文件的值,若是kexec_crash_loaded爲0就表示kdump沒有處於ready狀態。接着查看內核源碼,/sys/kernel/kexec_crash_loaded文件對應內核變量kexec_crash_image,而這個變量只會在兩個系統調用中進行修改,一個是kexec_load另一個是kexec_file_load,而我並無修改kexec相關的代碼,所以內核這塊出問題的機率比較小。
經過跟蹤kdump服務的啓動腳本執行過程,原來啓動kdump服務最終執行的命令是/bin/systemctl --no-pager start kdump-tools.service ,而systemctl則會經過/run/systemd/private這個socket與init進程通訊,init最終又會執行kexec-tools中的kexec程序,在kexec執行的時候則會查找預留的carsh memory,當預留的內存不足的時候就會輸出"Could not find a free area of memory of xxxxx"錯誤信息,當預留的內存充足的時候,則會執行kexec_load系統調用進行加載。
對於第二個,kdump服務啓動後還不能正常使用kdump的問題,打開debug info和debug info對比,發現預留的crash memory位置不一樣,估計多半和內核的內存機制與kexec機制有關係,這塊不太懂了,沒有深刻研究了。好在單獨剝離debug info後能夠正常使用kdump。
grub啓動文件中crashkernel參數的解析,能夠參考內核代碼parse_crashkernel
命令備記:
objcopy --only-keep-debug ./vmlinux.o vmlinux.debug
objcopy --strip-debug ./vmlinux.o
INSTALL_MOD_STRIP=1
strace -f -F -ff -o kdump /bin/systemctl --no-pager start kdump-tools.service
linux/boot/vmlinuz-4.4.13 root=UUID=1ba5f5c5-70c3-4936-b757-821899fe6264 ro quiet splash crashkernel=384M-:128M $vt_handoff