實驗二 跟蹤分析Linux內核5.0系統調用處理過程

實驗二 跟蹤分析Linux內核5.0系統調用處理過程html

學號293 原創做品轉載請註明出處 https://github.com/mengning/linuxkernel/linux

實驗要求

實驗環境

Ubuntu 18.04 LTSgit

gcc 7.3.0github

實驗步驟

1. 下載內核代碼並編譯

下載地址app

能夠直接下載後而後手動解壓dom

也能夠按照如下方式下載解壓函數

mkdir ~/LinuxKernel
cd ~/LinuxKernel
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.1.tar.xz
xz -d linux-5.0.1.tar.xz
tar -xvf linux-5.0.1.tar
cd linux-5.0.1

而後安裝內核編譯工具,(能夠考慮換源爲國內的,參考)工具

sudo apt install build-essential flex bison libssl-dev libelf-dev libncurses-dev

而後post

make menuconfig

選擇kernel hacking -> Compile-time checks and compiler options -> [*]compile the kernel with debug info測試

再make

make -j8

此時已經編譯完成,生成的文件地址在 ./arch/x86/boot/bzImage

2.製做根文件系統

cd ~/LinuxKernel
mkdir rootfs
git clone https://github.com/mengning/menu.git
cd menu
sudo apt install gcc-multilib   #不安裝,編譯時會提示缺乏文件,其實是安裝gcc環境不完善致使
gcc -pthread -o init linktable.c menu.c test.c -m32 -static
cd ../rootfs
cp ../menu/init ./
find . | cpio -o -Hnewc |gzip -9 > ../rootfs.img

3.啓動MenuOS

qemu-system-i386 -kernel linux-5.0.1/arch/x86/boot/bzImage -initrd rootfs.img

此時顯示

儘管我已經裝了qemu-system-i386

而後

我因而採用建議下載了qemu-system-x86,這裏主要是由於我編譯內核時編譯的時64位的版本因此不能用 也能夠經過從新make i386_defconfig 解決,可是這樣在以後的gdb過程當中,沒法顯示斷點所在的文件與行數 因此這裏我從新使用make menuconfig,去掉勾選64-bit kernel,同時勾選Kernel hacking -> Compile-time checks and compiler options -> Compile the kernel with debug info 使編譯成32位的內核而且能方便顯示文件位置

再編譯

make -j8

此時須要從新生成rootfs.img,爲了方便我這裏直接修改Makefile中的設置爲 而後進行編譯

cd ~/LinuxKernel/menu
make rootfs

獲得

4.調試跟蹤內核啓動

cd ..
qemu-system-i386 -kernel linux-5.0.1/arch/x86/boot/bzImage -initrd rootfs.img -S -s -append nokaslr

注意:-append nokaslr選項的說明見知乎。 運行qemu虛擬機後,在當前目錄新建一個終端窗口,運行下列命令:

cd linux-5.0
gdb vmlinux

進入gdb界面後鏈接到qemu,輸入

target remote:1234

而後便可正常的進行debug了

5 代碼分析

幾乎全部的內核模塊均會在start_kernel進行初始化.在start_kernel中,會對各項硬件設備進行初始化,包括一些page_address、tick等等,直到最後須要執行的rest_init中,會開始讓系統跑起來。

而後在rest_init()過程當中,會調用kernel_thread()來建立內核線程kernel_init,它建立用戶的init進程,初始化內核,並設置成1號進程,這個進程會繼續作相關的系統的初始化。

而後,start_kernel 會調用kernel_thread 並建立kthreadd,負責管理內核中得全部線程,而後進程ID會被設置爲2。

最後,會建立idle進程(0號進程),不能被調度,並利用循環來不斷調號空閒的CPU時間片,而且從不返回。

參考自:pianogirl123

void __init __weak arch_call_rest_init(void)
{
	rest_init();
}

asmlinkage __visible void __init start_kernel(void)
{
	char *command_line;
	char *after_dashes;

	set_task_stack_end_magic(&init_task);
	smp_setup_processor_id();
	debug_objects_early_init();

	cgroup_init_early();

	local_irq_disable();
	early_boot_irqs_disabled = true;

	/*
	 * Interrupts are still disabled. Do necessary setups, then
	 * enable them.
	 */
	boot_cpu_init();
	page_address_init();
	pr_notice("%s", linux_banner);
	setup_arch(&command_line);
	/*
	 * Set up the the initial canary and entropy after arch
	 * and after adding latent and command line entropy.
	 */
	add_latent_entropy();
	add_device_randomness(command_line, strlen(command_line));
	boot_init_stack_canary();
	mm_init_cpumask(&init_mm);
	setup_command_line(command_line);
	setup_nr_cpu_ids();
	setup_per_cpu_areas();
	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
	boot_cpu_hotplug_init();

	build_all_zonelists(NULL);
	page_alloc_init();

	pr_notice("Kernel command line: %s\n", boot_command_line);
	parse_early_param();
	after_dashes = parse_args("Booting kernel",
				  static_command_line, __start___param,
				  __stop___param - __start___param,
				  -1, -1, NULL, &unknown_bootoption);
	if (!IS_ERR_OR_NULL(after_dashes))
		parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,
			   NULL, set_init_arg);

	jump_label_init();

	/*
	 * These use large bootmem allocations and must precede
	 * kmem_cache_init()
	 */
	setup_log_buf(0);
	vfs_caches_init_early();
	sort_main_extable();
	trap_init();
	mm_init();

	ftrace_init();

	/* trace_printk can be enabled here */
	early_trace_init();

	/*
	 * Set up the scheduler prior starting any interrupts (such as the
	 * timer interrupt). Full topology setup happens at smp_init()
	 * time - but meanwhile we still have a functioning scheduler.
	 */
	sched_init();
	/*
	 * Disable preemption - early bootup scheduling is extremely
	 * fragile until we cpu_idle() for the first time.
	 */
	preempt_disable();
	if (WARN(!irqs_disabled(),
		 "Interrupts were enabled *very* early, fixing it\n"))
		local_irq_disable();
	radix_tree_init();

	/*
	 * Set up housekeeping before setting up workqueues to allow the unbound
	 * workqueue to take non-housekeeping into account.
	 */
	housekeeping_init();

	/*
	 * Allow workqueue creation and work item queueing/cancelling
	 * early.  Work item execution depends on kthreads and starts after
	 * workqueue_init().
	 */
	workqueue_init_early();

	rcu_init();

	/* Trace events are available after this */
	trace_init();

	if (initcall_debug)
		initcall_debug_enable();

	context_tracking_init();
	/* init some links before init_ISA_irqs() */
	early_irq_init();
	init_IRQ();
	tick_init();
	rcu_init_nohz();
	init_timers();
	hrtimers_init();
	softirq_init();
	timekeeping_init();
	time_init();
	printk_safe_init();
	perf_event_init();
	profile_init();
	call_function_init();
	WARN(!irqs_disabled(), "Interrupts were enabled early\n");

	early_boot_irqs_disabled = false;
	local_irq_enable();

	kmem_cache_init_late();

	/*
	 * HACK ALERT! This is early. We're enabling the console before
	 * we've done PCI setups etc, and console_init() must be aware of
	 * this. But we do want output early, in case something goes wrong.
	 */
	console_init();
	if (panic_later)
		panic("Too many boot %s vars at `%s'", panic_later,
		      panic_param);

	lockdep_init();

	/*
	 * Need to run this when irqs are enabled, because it wants
	 * to self-test [hard/soft]-irqs on/off lock inversion bugs
	 * too:
	 */
	locking_selftest();

	/*
	 * This needs to be called before any devices perform DMA
	 * operations that might use the SWIOTLB bounce buffers. It will
	 * mark the bounce buffers as decrypted so that their usage will
	 * not cause "plain-text" data to be decrypted when accessed.
	 */
	mem_encrypt_init();

#ifdef CONFIG_BLK_DEV_INITRD
	if (initrd_start && !initrd_below_start_ok &&
	    page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
		pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
		    page_to_pfn(virt_to_page((void *)initrd_start)),
		    min_low_pfn);
		initrd_start = 0;
	}
#endif
	kmemleak_init();
	setup_per_cpu_pageset();
	numa_policy_init();
	acpi_early_init();
	if (late_time_init)
		late_time_init();
	sched_clock_init();
	calibrate_delay();
	pid_idr_init();
	anon_vma_init();
#ifdef CONFIG_X86
	if (efi_enabled(EFI_RUNTIME_SERVICES))
		efi_enter_virtual_mode();
#endif
	thread_stack_cache_init();
	cred_init();
	fork_init();
	proc_caches_init();
	uts_ns_init();
	buffer_init();
	key_init();
	security_init();
	dbg_late_init();
	vfs_caches_init();
	pagecache_init();
	signals_init();
	seq_file_init();
	proc_root_init();
	nsfs_init();
	cpuset_init();
	cgroup_init();
	taskstats_init_early();
	delayacct_init();

	check_bugs();

	acpi_subsystem_init();
	arch_post_acpi_subsys_init();
	sfi_init_late();

	/* Do the rest non-__init'ed, we're now alive */
	arch_call_rest_init();   //調用rest_init()
}

rest_init() 函數

void rest_init(void) 
{
    int pid; 
    ……………… 
    kernel_thread(kernel_init, NULL, CLONE_FS);
    numa_default_policy(); 
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock(); 
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); 
    rcu_read_unlock(); 
    complete(&kthreadd_done); 
    init_idle_bootup_task(current); 
    schedule_preempt_disabled(); 
    cpu_startup_entry(CPUHP_ONLINE); 
}

6.跟蹤系統調用

增長系統調用
  • 根據學號後兩位93,在/usr/include/asm/unistd_32.h中可查得#define __NR_ftruncate 93

  • 編寫測試 在test.c 中添加兩個函數,main函數中添加相應的Menuconfig()

int update(int argc, char *argv[]){
	FILE *out;
	char *file = "93temp";
	int res = -2;
	int fd;
	out = fopen(file,"w+");
	fd = fileno(out);
	if(out == NULL){
		printf("openFailed!!!!!");	
	}
	//printf("res: %d\n",res);
	res = ftruncate(fd, 500);
	fclose(out);
	if(res == 0){
		printf("success!\n");
		out = fopen("93temp","r");
		fseek(out,0L,SEEK_END);  
    		int size=ftell(out);
		printf("size %d\n",size);
		fclose(out);
	}else{
		printf("fail\n");
	}
	return res;
}

int updateAsm(int argc, char *argv[]){
	FILE *out;
	char *file="93temp";
	int fd;
	int res = -2;
	out = fopen(file,"w+");
	if(out == NULL){
		printf("openFailed!!!!!");	
	}
	fd = fileno(out);
	//printf("res: %d\n",res);
	asm volatile(
		"mov $0x5D, %%eax\n\t"
		"int $0x80\n\t"
		"mov %%eax, %0\n\t"
		:"=m"(res)
		:"b"(fd),"c"(200)		
	);
	fclose(out);
	printf("res: %d\n",res);
	if(res == 0){
		printf("Success!\n");
		out = fopen(file, "r");
		fseek(out,0L,SEEK_END);  
    		int size=ftell(out);
		printf("size %d\n",size);
	}else{
		printf("failed!\n");
	}
	fclose(out);
	return res;
}

int main()
{
    ................
    MenuConfig("update","updateFilesize", update);
    MenuConfig("updateAsm","updateFilesizeAsm", updateAsm);
    ExecuteMenu();
}
  • 添加Makefile中的開始暫停設置爲

從新make rootfs

cd ~/LinuxKernel/menu
make rootfs
  • 使用gdb跟蹤查看

能夠看見在使用int 0x80中斷以後,CPU會運行arch/x86/entry/entry_32.S中的指令

  • 分析entry_32.S代碼

    #這段代碼就是系統調用處理的過程,其它的中斷過程也是與此相似
    #系統調用就是一個特殊的中斷,也存在保護現場和回覆現場
    ENTRY(system_call)          #這是0x80以後的下一條指令
        RING0_INT_FRAME         # can't unwind into user space anyway
        ASM_CLAC
        pushl_cfi %eax          # save orig_eax
        SAVE_ALL                 #保護現場
        GET_THREAD_INFO(%ebp)
                        # system call tracing in operation / emulation
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
        jnz syscall_trace_entry
        cmpl $(NR_syscalls), %eax
        jae syscall_badsys
    syscall_call:
        # 調用了系統調用處理函數,實際的系統調用服務程序
        call *sys_call_table(,%eax,4)#定義的系統調用的表,eax傳遞過來的就是系統調用號,在例子中就是調用的systime
    syscall_after_call:
        movl %eax,PT_EAX(%esp)      # store the return value
    syscall_exit:
        LOCKDEP_SYS_EXIT
        DISABLE_INTERRUPTS(CLBR_ANY)    # make sure we don't miss an interrupt
                        # setting need_resched or sigpending
                        # between sampling and the iret
        TRACE_IRQS_OFF
        movl TI_flags(%ebp), %ecx
        testl $_TIF_ALLWORK_MASK, %ecx  # current->work
        jne syscall_exit_work          #退出以前,syscall_exit_work 
        #進入到syscall_exit_work裏邊有一個進程調度時機
    
    restore_all:
        TRACE_IRQS_IRET
    restore_all_notrace:        #返回到用戶態
    #ifdef CONFIG_X86_ESPFIX32
        movl PT_EFLAGS(%esp), %eax  # mix EFLAGS, SS and CS
        # Warning: PT_OLDSS(%esp) contains the wrong/random values if we
        # are returning to the kernel.
        # See comments in process.c:copy_thread() for details.
        movb PT_OLDSS(%esp), %ah
        movb PT_CS(%esp), %al
        andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
        cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
        CFI_REMEMBER_STATE
        je ldt_ss           # returning to user-space with LDT SS
    #end
        RESTORE_REGS 4          # skip orig_eax/error_code
    irq_return:
        INTERRUPT_RETURN      #iret(宏),系統調用過程到這裏結束

    實驗總結

    ​ 其原理是進程先用適當的值填充寄存器,而後調用一個特殊的指令,這個指令會跳到一個事先定義的內核中的一個位置。在Intel CPU中,這個由中斷0x80實現。硬件知道一旦你跳到這個位置,你就不是在限制模式下運行的用戶,而是做爲操做系統的內核--由用戶態轉爲內核態。

    ​ 進程能夠跳轉到的內核位置叫作sysem_call。這個過程檢查系統調用號,這個號碼告訴內核進程請求哪一種服務。而後,它查看系統調用表(sys_call_table)找到所調用的內核函數入口地址。接着,就調用函數,等返回後,作一些系統檢查,最後返回到進程(或到其餘進程,若是這個進程時間用盡)。

    ​ 進程號是由eax寄存器存儲的,參數通常是由ebx、ecx、edx、esl、edl、ebp來存儲的。

相關文章
相關標籤/搜索