實驗二 跟蹤分析Linux內核5.0系統調用處理過程html
學號293 原創做品轉載請註明出處 https://github.com/mengning/linuxkernel/linux
Ubuntu 18.04 LTSgit
gcc 7.3.0github
下載地址app
能夠直接下載後而後手動解壓dom
也能夠按照如下方式下載解壓函數
mkdir ~/LinuxKernel cd ~/LinuxKernel wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.1.tar.xz xz -d linux-5.0.1.tar.xz tar -xvf linux-5.0.1.tar cd linux-5.0.1
而後安裝內核編譯工具,(能夠考慮換源爲國內的,參考)工具
sudo apt install build-essential flex bison libssl-dev libelf-dev libncurses-dev
而後post
make menuconfig
選擇kernel hacking -> Compile-time checks and compiler options -> [*]compile the kernel with debug info測試
再make
make -j8
此時已經編譯完成,生成的文件地址在 ./arch/x86/boot/bzImage
cd ~/LinuxKernel mkdir rootfs git clone https://github.com/mengning/menu.git cd menu sudo apt install gcc-multilib #不安裝,編譯時會提示缺乏文件,其實是安裝gcc環境不完善致使 gcc -pthread -o init linktable.c menu.c test.c -m32 -static cd ../rootfs cp ../menu/init ./ find . | cpio -o -Hnewc |gzip -9 > ../rootfs.img
qemu-system-i386 -kernel linux-5.0.1/arch/x86/boot/bzImage -initrd rootfs.img
此時顯示
儘管我已經裝了qemu-system-i386
而後
我因而採用建議下載了qemu-system-x86,這裏主要是由於我編譯內核時編譯的時64位的版本因此不能用 也能夠經過從新make i386_defconfig
解決,可是這樣在以後的gdb過程當中,沒法顯示斷點所在的文件與行數 因此這裏我從新使用make menuconfig
,去掉勾選64-bit kernel
,同時勾選Kernel hacking -> Compile-time checks and compiler options -> Compile the kernel with debug info
使編譯成32位的內核而且能方便顯示文件位置
再編譯
make -j8
此時須要從新生成rootfs.img,爲了方便我這裏直接修改Makefile中的設置爲 而後進行編譯
cd ~/LinuxKernel/menu make rootfs
獲得
cd .. qemu-system-i386 -kernel linux-5.0.1/arch/x86/boot/bzImage -initrd rootfs.img -S -s -append nokaslr
注意:-append nokaslr選項的說明見知乎。 運行qemu虛擬機後,在當前目錄新建一個終端窗口,運行下列命令:
cd linux-5.0 gdb vmlinux
進入gdb界面後鏈接到qemu,輸入
target remote:1234
而後便可正常的進行debug了
幾乎全部的內核模塊均會在start_kernel進行初始化.在start_kernel中,會對各項硬件設備進行初始化,包括一些page_address、tick等等,直到最後須要執行的rest_init中,會開始讓系統跑起來。
而後在rest_init()過程當中,會調用kernel_thread()來建立內核線程kernel_init,它建立用戶的init進程,初始化內核,並設置成1號進程,這個進程會繼續作相關的系統的初始化。
而後,start_kernel 會調用kernel_thread 並建立kthreadd,負責管理內核中得全部線程,而後進程ID會被設置爲2。
最後,會建立idle進程(0號進程),不能被調度,並利用循環來不斷調號空閒的CPU時間片,而且從不返回。
參考自:pianogirl123
void __init __weak arch_call_rest_init(void) { rest_init(); } asmlinkage __visible void __init start_kernel(void) { char *command_line; char *after_dashes; set_task_stack_end_magic(&init_task); smp_setup_processor_id(); debug_objects_early_init(); cgroup_init_early(); local_irq_disable(); early_boot_irqs_disabled = true; /* * Interrupts are still disabled. Do necessary setups, then * enable them. */ boot_cpu_init(); page_address_init(); pr_notice("%s", linux_banner); setup_arch(&command_line); /* * Set up the the initial canary and entropy after arch * and after adding latent and command line entropy. */ add_latent_entropy(); add_device_randomness(command_line, strlen(command_line)); boot_init_stack_canary(); mm_init_cpumask(&init_mm); setup_command_line(command_line); setup_nr_cpu_ids(); setup_per_cpu_areas(); smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */ boot_cpu_hotplug_init(); build_all_zonelists(NULL); page_alloc_init(); pr_notice("Kernel command line: %s\n", boot_command_line); parse_early_param(); after_dashes = parse_args("Booting kernel", static_command_line, __start___param, __stop___param - __start___param, -1, -1, NULL, &unknown_bootoption); if (!IS_ERR_OR_NULL(after_dashes)) parse_args("Setting init args", after_dashes, NULL, 0, -1, -1, NULL, set_init_arg); jump_label_init(); /* * These use large bootmem allocations and must precede * kmem_cache_init() */ setup_log_buf(0); vfs_caches_init_early(); sort_main_extable(); trap_init(); mm_init(); ftrace_init(); /* trace_printk can be enabled here */ early_trace_init(); /* * Set up the scheduler prior starting any interrupts (such as the * timer interrupt). Full topology setup happens at smp_init() * time - but meanwhile we still have a functioning scheduler. */ sched_init(); /* * Disable preemption - early bootup scheduling is extremely * fragile until we cpu_idle() for the first time. */ preempt_disable(); if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n")) local_irq_disable(); radix_tree_init(); /* * Set up housekeeping before setting up workqueues to allow the unbound * workqueue to take non-housekeeping into account. */ housekeeping_init(); /* * Allow workqueue creation and work item queueing/cancelling * early. Work item execution depends on kthreads and starts after * workqueue_init(). */ workqueue_init_early(); rcu_init(); /* Trace events are available after this */ trace_init(); if (initcall_debug) initcall_debug_enable(); context_tracking_init(); /* init some links before init_ISA_irqs() */ early_irq_init(); init_IRQ(); tick_init(); rcu_init_nohz(); init_timers(); hrtimers_init(); softirq_init(); timekeeping_init(); time_init(); printk_safe_init(); perf_event_init(); profile_init(); call_function_init(); WARN(!irqs_disabled(), "Interrupts were enabled early\n"); early_boot_irqs_disabled = false; local_irq_enable(); kmem_cache_init_late(); /* * HACK ALERT! This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. */ console_init(); if (panic_later) panic("Too many boot %s vars at `%s'", panic_later, panic_param); lockdep_init(); /* * Need to run this when irqs are enabled, because it wants * to self-test [hard/soft]-irqs on/off lock inversion bugs * too: */ locking_selftest(); /* * This needs to be called before any devices perform DMA * operations that might use the SWIOTLB bounce buffers. It will * mark the bounce buffers as decrypted so that their usage will * not cause "plain-text" data to be decrypted when accessed. */ mem_encrypt_init(); #ifdef CONFIG_BLK_DEV_INITRD if (initrd_start && !initrd_below_start_ok && page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) { pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n", page_to_pfn(virt_to_page((void *)initrd_start)), min_low_pfn); initrd_start = 0; } #endif kmemleak_init(); setup_per_cpu_pageset(); numa_policy_init(); acpi_early_init(); if (late_time_init) late_time_init(); sched_clock_init(); calibrate_delay(); pid_idr_init(); anon_vma_init(); #ifdef CONFIG_X86 if (efi_enabled(EFI_RUNTIME_SERVICES)) efi_enter_virtual_mode(); #endif thread_stack_cache_init(); cred_init(); fork_init(); proc_caches_init(); uts_ns_init(); buffer_init(); key_init(); security_init(); dbg_late_init(); vfs_caches_init(); pagecache_init(); signals_init(); seq_file_init(); proc_root_init(); nsfs_init(); cpuset_init(); cgroup_init(); taskstats_init_early(); delayacct_init(); check_bugs(); acpi_subsystem_init(); arch_post_acpi_subsys_init(); sfi_init_late(); /* Do the rest non-__init'ed, we're now alive */ arch_call_rest_init(); //調用rest_init() }
rest_init() 函數
void rest_init(void) { int pid; ……………… kernel_thread(kernel_init, NULL, CLONE_FS); numa_default_policy(); pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); rcu_read_lock(); kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); rcu_read_unlock(); complete(&kthreadd_done); init_idle_bootup_task(current); schedule_preempt_disabled(); cpu_startup_entry(CPUHP_ONLINE); }
根據學號後兩位93,在/usr/include/asm/unistd_32.h
中可查得#define __NR_ftruncate 93
。
編寫測試 在test.c 中添加兩個函數,main函數中添加相應的Menuconfig()
int update(int argc, char *argv[]){ FILE *out; char *file = "93temp"; int res = -2; int fd; out = fopen(file,"w+"); fd = fileno(out); if(out == NULL){ printf("openFailed!!!!!"); } //printf("res: %d\n",res); res = ftruncate(fd, 500); fclose(out); if(res == 0){ printf("success!\n"); out = fopen("93temp","r"); fseek(out,0L,SEEK_END); int size=ftell(out); printf("size %d\n",size); fclose(out); }else{ printf("fail\n"); } return res; } int updateAsm(int argc, char *argv[]){ FILE *out; char *file="93temp"; int fd; int res = -2; out = fopen(file,"w+"); if(out == NULL){ printf("openFailed!!!!!"); } fd = fileno(out); //printf("res: %d\n",res); asm volatile( "mov $0x5D, %%eax\n\t" "int $0x80\n\t" "mov %%eax, %0\n\t" :"=m"(res) :"b"(fd),"c"(200) ); fclose(out); printf("res: %d\n",res); if(res == 0){ printf("Success!\n"); out = fopen(file, "r"); fseek(out,0L,SEEK_END); int size=ftell(out); printf("size %d\n",size); }else{ printf("failed!\n"); } fclose(out); return res; } int main() { ................ MenuConfig("update","updateFilesize", update); MenuConfig("updateAsm","updateFilesizeAsm", updateAsm); ExecuteMenu(); }
從新make rootfs
cd ~/LinuxKernel/menu make rootfs
能夠看見在使用int 0x80中斷以後,CPU會運行arch/x86/entry/entry_32.S中的指令
分析entry_32.S代碼
#這段代碼就是系統調用處理的過程,其它的中斷過程也是與此相似 #系統調用就是一個特殊的中斷,也存在保護現場和回覆現場 ENTRY(system_call) #這是0x80以後的下一條指令 RING0_INT_FRAME # can't unwind into user space anyway ASM_CLAC pushl_cfi %eax # save orig_eax SAVE_ALL #保護現場 GET_THREAD_INFO(%ebp) # system call tracing in operation / emulation testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp) jnz syscall_trace_entry cmpl $(NR_syscalls), %eax jae syscall_badsys syscall_call: # 調用了系統調用處理函數,實際的系統調用服務程序 call *sys_call_table(,%eax,4)#定義的系統調用的表,eax傳遞過來的就是系統調用號,在例子中就是調用的systime syscall_after_call: movl %eax,PT_EAX(%esp) # store the return value syscall_exit: LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret TRACE_IRQS_OFF movl TI_flags(%ebp), %ecx testl $_TIF_ALLWORK_MASK, %ecx # current->work jne syscall_exit_work #退出以前,syscall_exit_work #進入到syscall_exit_work裏邊有一個進程調度時機 restore_all: TRACE_IRQS_IRET restore_all_notrace: #返回到用戶態 #ifdef CONFIG_X86_ESPFIX32 movl PT_EFLAGS(%esp), %eax # mix EFLAGS, SS and CS # Warning: PT_OLDSS(%esp) contains the wrong/random values if we # are returning to the kernel. # See comments in process.c:copy_thread() for details. movb PT_OLDSS(%esp), %ah movb PT_CS(%esp), %al andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax CFI_REMEMBER_STATE je ldt_ss # returning to user-space with LDT SS #end RESTORE_REGS 4 # skip orig_eax/error_code irq_return: INTERRUPT_RETURN #iret(宏),系統調用過程到這裏結束
其原理是進程先用適當的值填充寄存器,而後調用一個特殊的指令,這個指令會跳到一個事先定義的內核中的一個位置。在Intel CPU中,這個由中斷0x80實現。硬件知道一旦你跳到這個位置,你就不是在限制模式下運行的用戶,而是做爲操做系統的內核--由用戶態轉爲內核態。
進程能夠跳轉到的內核位置叫作sysem_call。這個過程檢查系統調用號,這個號碼告訴內核進程請求哪一種服務。而後,它查看系統調用表(sys_call_table)找到所調用的內核函數入口地址。接着,就調用函數,等返回後,作一些系統檢查,最後返回到進程(或到其餘進程,若是這個進程時間用盡)。
進程號是由eax寄存器存儲的,參數通常是由ebx、ecx、edx、esl、edl、ebp來存儲的。