目錄css
1. 進程相關數據結構 1) struct task_struct 2) struct cred 3) struct pid_link 4) struct pid 5) struct signal_struct 6) struct rlimit 2. 內核中的隊列/鏈表對象 1) singly-linked lists 2) singly-linked tail queues 3) doubly-linked lists 4) doubly-linked tail queues 3. 內核模塊相關數據結構 1) struct module 4. 文件系統相關數據結構 1) struct file 2) struct inode 3) struct stat 4) struct fs_struct 5) struct files_struct 6) struct fdtable 7) struct dentry 8) struct vfsmount 9) struct nameidata 10) struct super_block 11) struct file_system_type 5. 內核安全相關數據結構 1) struct security_operations 2) struct kprobe 3) struct jprobe 4) struct kretprobe 5) struct kretprobe_instance 6) struct kretprobe_blackpoint 、struct kprobe_blacklist_entry 7) struct linux_binprm 8) struct linux_binfmt 6. 系統網絡狀態相關的數據結構 1) struct ifconf 2) struct ifreq 3) struct socket 4) struct sock 5) struct proto_ops 6) struct inet_sock 7) struct sockaddr 7. 系統內存相關的數據結構 1) struct mm_struct 2) struct vm_area_struct 3) struct pg_data_t 4) struct zone 5) struct page 8. 中斷相關的數據結構 1) struct irq_desc 2) struct irq_chip 3) struct irqaction 9. 進程間通訊(IPC)相關數據結構 1) struct ipc_namespace 2) struct ipc_ids 3) struct kern_ipc_perm 4) struct sysv_sem 5) struct sem_queue 6) struct msg_queue 7) struct msg_msg 8) struct msg_sender 9) struct msg_receiver 10) struct msqid_ds 10. 命名空間(namespace)相關數據結構 1) struct pid_namespace 2) struct pid、struct upid 3) struct nsproxy 4) struct mnt_namespace
1. 進程相關數據結構html
0x0: CURRENT宏
java
咱們知道,在windows中使用PCB(進程控制塊)來對進程的運行狀態進行描述,對應的,在linux中使用task_struct結構體存儲相關的進程信息node
task_struct在linux/sched.h文件裏定義(在使用current宏的時候必定要引入這個頭文件)linux
值得注意的是,在linux內核編程中經常使用的current宏能夠很是簡單地獲取到指向task_struct的指針,這個宏和體系結構有關,大多數狀況下,咱們都是x86體系結構的,因此在arch/x86目錄下,其餘體系結構的類推算法
目前主流的體系結構有x8六、ARM、MIPS架構,在繼續學習以前,咱們先來簡單瞭解一下什麼是體系結構shell
在計算世界中,"體系結構"一詞被用來描述一個抽象的機器,而不是一個具體的機器實現。通常而言,一個CPU的體系結構有一個指令集加上一些寄存器而組成。"指令集"與"體系結構"這兩個術語是同義詞
X8六、MIPS、ARM三種cpu的體系結構和特色編程
1. X86 X86採用了CISC指令集。在CISC指令集的各類指令中,大約有20%的指令會被反覆使用,佔整個程序代碼的80%。而餘下的80%的指令卻不常用,在程序設計中只佔20%。 1.1 總線接口部件BIU 總線接口部件由如下幾部分組成 1) 4個16位段寄存器(DS、ES、SS、CS) 2) 一個16位指令指針寄存器(IP) 3) 20位物理地址加法器 4) 6字節指令隊列(8088爲4字節) 5) 總線控制電路組成,負責與存儲器及I/O端口的數據傳送 1.2 執行部件EU 執行部件由如下幾部分組成,其任務就是從指令隊列流中取出指令,而後分析和執行指令,還負責計算操做數的16位偏移地址 1) ALU 2) 寄存器陣列(AX、BX、CX、DX、SI、DI、BP、SP) 3) 標誌寄存器(PSW)等幾個部分組成 1.3 寄存器的結構 1) 數據寄存器AX、BX、CX、DX均爲16位的寄存器,它們中的每個又可分爲高字節H和低字節L。即AH、BH、CH、DH及AL、BL、CL、DL可做爲單獨的8位寄存器使用。不論16位寄存器仍是8位寄存器,它們都可寄存操做數及
運算的中間結果。有少數指令指定某個寄存器專用,例如,串操做指令指定CX專門用做記錄串中元素個數的計數器。 2) 段寄存器組:CS、DS、SS、ES。8086/8088的20位物理地址在CPU內部要由兩部分相加造成的 2.1) 指明其偏移地址 SP、BP、SI、DI標識20位物理地址的低16位,用於指明其偏移地址 2.2) 指明20位物理地址的高16位,故稱做段寄存器,4個存儲器使用專注,不能互換 2.2.1) CS: CS識別當前代碼段 2.2.2) DS: DS識別當前數據段 2.2.3) SS: SS識別當前堆棧段 2.2.4) ES: ES識別當前附加段 通常狀況下,DS和ES都須用戶在程序中設置初值 3) 控制寄存器組 3.1) IP 指令指針IP用以指明當前要執行指令的偏移地址(段地址由CS提供) 3.2) FLAG 標誌寄存器FLAG有16位,用了其中的九位,分兩組: 3.2.1) 狀態標誌: 用以記錄狀態信息,由6位組成,包括CF、AF、OF、SF、PF和ZF,它反映前一次涉及ALU操做的結果,對用戶它"只讀不寫" 3.2.2) 控制標誌: 用以記錄控制信息由3位組成,包括方向標誌DF,中斷容許標誌IF及陷阱標誌TF,中斷容許標誌IF及陷阱標誌TF,可經過指令設置 2. MIPS: 1) 全部指令都是32位編碼; 2) 有些指令有26位供目標地址編碼;有些則只有16位。所以要想加載任何一個32位值,就得用兩個加載指令。16位的目標地址意味着,指令的跳轉或子函數的位置必須在64K之內(上下32K) 3) 全部的動做原理上要求必須在1個時鐘週期內完成,一個動做一個階段 4) 有32個通用寄存器,每一個寄存器32位(對32位機)或64位(對64位機) 5) 對於MIPS體系結構來講,自己沒有任何幫助運算判斷的標誌寄存器,要實現相應的功能時,是經過測試兩個寄存器是否相等來完成 6) 全部的運算都是基於32位的,沒有對字節和對半字的運算(MIPS裏,字定義爲32位,半字定義爲16位) 7) 沒有單獨的棧指令,全部對棧的操做都是統一的內存訪問方式。由於push和pop指令其實是一個複合操做,包含對內存的寫入和對棧指針的移動; 8) 因爲MIPS固定指令長度,因此形成其編譯後的二進制文件和內存佔用空間比x86的要大,(x86平均指令長度只有3個字節多一點,而MIPS是4個字節) 9) 尋址方式:只有一種內存尋址方式。就是基地址加一個16位的地址偏移 10) 內存中的數據訪問必須嚴格對齊(至少4字節對齊) 11) 跳轉指令只有26位目標地址,再加上2位的對齊位,可尋址28位的空間,即256M 12) 條件分支指令只有16位跳轉地址,加上2位的對齊位,共18位尋址空間,即256K 13) MIPS默認不把子函數的返回地址(就是調用函數的受害指令地址)存放到棧中,而是存放到$31寄存器中;這對那些葉子函數有利。若是遇到嵌套的函數的話,有另外的機制處理; 14) 高度的流水線: *MIPS指令的五級流水線:(每條指令都包含五個執行階段) 14.1) 第一階段:從指令緩衝區中取指令。佔一個時鐘週期 14.2) 第二階段:從指令中的源寄存器域(可能有兩個)的值(爲一個數字,指定$0~$31中的某一個)所表明的寄存器中讀出數據。佔半個時鐘週期 14.3) 第三階段:在一個時鐘週期內作一次算術或邏輯運算。佔一個時鐘週期 14.4) 第四階段:指令從數據緩衝中讀取內存變量的階段。從平均來說,大約有3/4的指令在這個階段沒作什麼事情,但它是指令有序性的保證。佔一個時鐘週期 14.5) 第五階段:存儲計算結果到緩衝或內存的階段。佔半個時鐘週期 因此一條指令要佔用四個時鐘週期 3. ARM ARM處理器是一個32位元精簡指令集(RISC)處理器架構,其普遍地使用在許多嵌入式系統設計 1) RISC(Reduced Instruction Set Computer,精簡指令集計算機) RISC體系結構應具備以下特色: 1.1) 採用固定長度的指令格式,指令歸整、簡單、基本尋址方式有2~3種 1.2) 使用單週期指令,便於流水線操做執行。 1.3) 大量使用寄存器,數據處理指令只對寄存器進行操做,只有加載/ 存儲指令能夠訪問存儲器,以提升指令的執行效率 2) ARM體系結構還採用了一些特別的技術,在保證高性能的前提下儘可能縮小芯片的面積,並下降功耗 2.1) 全部的指令均可根據前面的執行結果決定是否被執行,從而提升指令的執行效率 2.2) 可用加載/存儲指令批量傳輸數據,以提升數據的傳輸效率。 3) 寄存器結構 ARM處理器共有37個寄存器,被分爲若干個組(BANK),這些寄存器包括 3.1) 31個通用寄存器,包括程序計數器(PC指針),均爲32位的寄存器 3.2) 6個狀態寄存器,用以標識CPU的工做狀態及程序的運行狀態,均爲32位 4) 指令結構 ARM微處理器的在較新的體系結構中支持兩種指令集:ARM指令集和Thumb指令集。其中,ARM指令爲32位的長度,Thumb指令爲16位長度。Thumb指令集爲ARM指令集的功能子集,但與等價的ARM代碼相比較,可節省30%~40%以上的
存儲空間,同時具有32位代碼的全部優勢。
咱們接下來來看看內核代碼中是如何實現current這個宏定義的vim
#ifndef __ASSEMBLY__ struct task_struct; //用於在編譯時候聲明一個perCPU變量該變量被放在一個特殊的段中,原型爲DECLARE_PER_CPU(type,name),主要做用是爲處理器建立一個type類型,名爲name的變量 DECLARE_PER_CPU(struct task_struct *, current_task); static __always_inline struct task_struct *get_current(void) { return percpu_read_stable(current_task); } #define current get_current() #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_CURRENT_H */
繼續跟蹤percpu_read_stable()這個函數windows
\linux-2.6.32.63\arch\x86\include\asm\percpu.h
#define percpu_read_stable(var) percpu_from_op("mov", per_cpu__##var, "p" (&per_cpu__##var))
繼續跟進percpu_from_op()這個函數
/* percpu_from_op宏中根據不一樣的sizeof(var)選擇不一樣的分支,執行不一樣的流程,由於這裏是x86體系,因此sizeof(current_task)的值爲4 在每一個分支中使用了一條的內聯彙編代碼,其中__percpu_arg(1)爲%%fs:%P1(X86)或者%%gs:%P1(X86_64),將上述代碼整理後current獲取代碼以下: 1. x86: asm(movl "%%fs:%P1","%0" : "=r" (pfo_ret__) :"p" (&(var)) 2. x86_64: asm(movl "%%gs:%P1","%0" : "=r" (pfo_ret__) :"p" (&(var)) */ #define percpu_from_op(op, var, constraint) \ ({ \ typeof(var) ret__; \ switch (sizeof(var)) { \ case 1: \ asm(op "b "__percpu_arg(1)",%0" \ : "=q" (ret__) \ : constraint); \ break; \ case 2: \ asm(op "w "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : constraint); \ break; \ case 4: \ asm(op "l "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : constraint); \ break; \ case 8: \ asm(op "q "__percpu_arg(1)",%0" \ : "=r" (ret__) \ : constraint); \ break; \ default: __bad_percpu_size(); \ } \ ret__; \ })
將fs(或者gs)段中P1偏移處的值傳送給pfo_ret__變量
繼續跟進per_cpu__kernel_stack的定義
linux-2.6.32.63\arch\x86\kernel\cpu\common.c
/* The following four percpu variables are hot. Align current_task to cacheline size such that all four fall in the same cacheline. */ DEFINE_PER_CPU(struct task_struct *, current_task) ____cacheline_aligned = &init_task; EXPORT_PER_CPU_SYMBOL(current_task); DEFINE_PER_CPU(unsigned long, kernel_stack) = (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE; EXPORT_PER_CPU_SYMBOL(kernel_stack); DEFINE_PER_CPU(char *, irq_stack_ptr) = init_per_cpu_var(irq_stack_union.irq_stack) + IRQ_STACK_SIZE - 64; DEFINE_PER_CPU(unsigned int, irq_count) = -1;
繼續進程內核棧初始化的關鍵代碼: DEFINE_PER_CPU(unsigned long, kernel_stack) = (unsigned long)&init_thread_union - KERNEL_STACK_OFFSET + THREAD_SIZE;
//linux-2.6.32.63\arch\x86\kernel\init_task.c /* * Initial task structure. * * All other task structs will be allocated on slabs in fork.c */ struct task_struct init_task = INIT_TASK(init_task); EXPORT_SYMBOL(init_task); /* * Initial thread structure. * * We need to make sure that this is THREAD_SIZE aligned due to the * way process stacks are handled. This is done by having a special * "init_task" linker map entry.. */ union thread_union init_thread_union __init_task_data = { INIT_THREAD_INFO(init_task) };
\linux-2.6.32.63\include\linux\init_task.h
/* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) */ #define INIT_TASK(tsk) \ { \ .state = 0, \ .stack = &init_thread_info, \ .usage = ATOMIC_INIT(2), \ .flags = PF_KTHREAD, \ .lock_depth = -1, \ .prio = MAX_PRIO-20, \ .static_prio = MAX_PRIO-20, \ .normal_prio = MAX_PRIO-20, \ .policy = SCHED_NORMAL, \ .cpus_allowed = CPU_MASK_ALL, \ .mm = NULL, \ .active_mm = &init_mm, \ .se = { \ .group_node = LIST_HEAD_INIT(tsk.se.group_node), \ }, \ .rt = { \ .run_list = LIST_HEAD_INIT(tsk.rt.run_list), \ .time_slice = HZ, \ .nr_cpus_allowed = NR_CPUS, \ }, \ .tasks = LIST_HEAD_INIT(tsk.tasks), \ .pushable_tasks = PLIST_NODE_INIT(tsk.pushable_tasks, MAX_PRIO), \ .ptraced = LIST_HEAD_INIT(tsk.ptraced), \ .ptrace_entry = LIST_HEAD_INIT(tsk.ptrace_entry), \ .real_parent = &tsk, \ .parent = &tsk, \ .children = LIST_HEAD_INIT(tsk.children), \ .sibling = LIST_HEAD_INIT(tsk.sibling), \ .group_leader = &tsk, \ .real_cred = &init_cred, \ .cred = &init_cred, \ .cred_guard_mutex = \ __MUTEX_INITIALIZER(tsk.cred_guard_mutex), \ .comm = "swapper", \ .thread = INIT_THREAD, \ .fs = &init_fs, \ .files = &init_files, \ .signal = &init_signals, \ .sighand = &init_sighand, \ .nsproxy = &init_nsproxy, \ .pending = { \ .list = LIST_HEAD_INIT(tsk.pending.list), \ .signal = {{0}}}, \ .blocked = {{0}}, \ .alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \ .journal_info = NULL, \ .cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \ .fs_excl = ATOMIC_INIT(0), \ .pi_lock = __SPIN_LOCK_UNLOCKED(tsk.pi_lock), \ .timer_slack_ns = 50000, /* 50 usec default slack */ \ .pids = { \ [PIDTYPE_PID] = INIT_PID_LINK(PIDTYPE_PID), \ [PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID), \ [PIDTYPE_SID] = INIT_PID_LINK(PIDTYPE_SID), \ }, \ .dirties = INIT_PROP_LOCAL_SINGLE(dirties), \ INIT_IDS \ INIT_PERF_EVENTS(tsk) \ INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP \ INIT_FTRACE_GRAPH \ INIT_TRACE_RECURSION \ INIT_TASK_RCU_PREEMPT(tsk) \ }
咱們繼續跟進和進程信息密切相關的數據結構
\linux-2.6.32.63\include\linux\sched.h
/* THREAD_SIZE在32位平臺上通常定義爲4K,因此stack的大小其實就是4KB,這就是初始任務在覈內心所擁有的全部空間,除去thread_info和KERNEL_STACK_OFFSET佔用的空間後,就是任務在覈內心實際擁有堆棧的大小。
KERNEL_STACK_OFFSET定義爲5*8,因爲是unsigned long,因此堆棧底部以上還有5*8*4B=200B的空間用來存放程序運行時相關的環境參數 */ union thread_union { struct thread_info thread_info; unsigned long stack[THREAD_SIZE/sizeof(long)]; };
學習到這裏,咱們須要總結一下
1. 在linux中,整個內核棧是全部進程公用的,每一個進程會像切蛋糕同樣從中切去一份指定大小的內存區域 2. 每一個進程都在當前內核棧中分配一段內存區域: thread_union,這段內核棧內存被分爲兩個部分: 1) 低地址部分保存的: thread_info 2) 剩餘的高地址部分保存的: 當前進程的內核棧內核棧stack 3. struct thread_info thread_info;當中就保存着當前進程的信息,因此咱們能夠從本質上理解,current宏其實並不神祕,它就是在作一個內存棧上的取地址操做
Relevant Link:
http://www.pagefault.info/?p=36 http://www.cnblogs.com/justinzhang/archive/2011/07/18/2109923.html
僅僅只需檢查內核棧指針的值,而根本無需存取內存,內核就能夠導出task_struct結構的地址,能夠把它看做全局變量來用
0x1: struct task_struct
struct task_struct { /* 1. state: 進程執行時,它會根據具體狀況改變狀態。進程狀態是進程調度和對換的依據。Linux中的進程主要有以下狀態: 1) TASK_RUNNING: 可運行 處於這種狀態的進程,只有兩種狀態: 1.1) 正在運行 正在運行的進程就是當前進程(由current所指向的進程) 1.2) 正準備運行 準備運行的進程只要獲得CPU就能夠當即投入運行,CPU是這些進程惟一等待的系統資源,系統中有一個運行隊列(run_queue),用來容納全部處於可運行狀態的進程,調度程序執行時,從中選擇一個進程投入運行 2) TASK_INTERRUPTIBLE: 可中斷的等待狀態,是針對等待某事件或其餘資源的睡眠進程設置的,在內核發送信號給該進程代表事件已經發生時,進程狀態變爲TASK_RUNNING,它只要調度器選中該進程便可恢復執行 3) TASK_UNINTERRUPTIBLE: 不可中斷的等待狀態 處於該狀態的進程正在等待某個事件(event)或某個資源,它確定位於系統中的某個等待隊列(wait_queue)中,處於不可中斷等待態的進程是由於硬件環境不能知足而等待,例如等待特定的系統資源,它任何狀況下都不能被打斷,只能用特定的方式來喚醒它,例如喚醒函數wake_up()等 它們不能由外部信號喚醒,只能由內核親自喚醒
4) TASK_ZOMBIE: 僵死 進程雖然已經終止,但因爲某種緣由,父進程尚未執行wait()系統調用,終止進程的信息也尚未回收。顧名思義,處於該狀態的進程就是死進程,這種進程其實是系統中的垃圾,必須進行相應處理以釋放其佔用的資源。 5) TASK_STOPPED: 暫停 此時的進程暫時中止運行來接受某種特殊處理。一般當進程接收到SIGSTOP、SIGTSTP、SIGTTIN或 SIGTTOU信號後就處於這種狀態。例如,正接受調試的進程就處於這種狀態
6) TASK_TRACED
從本質上來講,這屬於TASK_STOPPED狀態,用於從中止的進程中,將當前被調試的進程與常規的進程區分開來
7) TASK_DEAD
父進程wait系統調用發出後,當子進程退出時,父進程負責回收子進程的所有資源,子進程進入TASK_DEAD狀態
8) TASK_SWAPPING: 換入/換出 */ volatile long state; /* 2. stack 進程內核棧,進程經過alloc_thread_info函數分配它的內核棧,經過free_thread_info函數釋放所分配的內核棧 */ void *stack; /* 3. usage 進程描述符使用計數,被置爲2時,表示進程描述符正在被使用並且其相應的進程處於活動狀態 */ atomic_t usage; /* 4. flags flags是進程當前的狀態標誌(注意和運行狀態區分) 1) #define PF_ALIGNWARN 0x00000001: 顯示內存地址未對齊警告 2) #define PF_PTRACED 0x00000010: 標識是不是否調用了ptrace 3) #define PF_TRACESYS 0x00000020: 跟蹤系統調用 4) #define PF_FORKNOEXEC 0x00000040: 已經完成fork,但尚未調用exec 5) #define PF_SUPERPRIV 0x00000100: 使用超級用戶(root)權限 6) #define PF_DUMPCORE 0x00000200: dumped core 7) #define PF_SIGNALED 0x00000400: 此進程因爲其餘進程發送相關信號而被殺死 8) #define PF_STARTING 0x00000002: 當前進程正在被建立 9) #define PF_EXITING 0x00000004: 當前進程正在關閉 10) #define PF_USEDFPU 0x00100000: Process used the FPU this quantum(SMP only) #define PF_DTRACE 0x00200000: delayed trace (used on m68k) */ unsigned int flags; /* 5. ptrace ptrace系統調用,成員ptrace被設置爲0時表示不須要被跟蹤,它的可能取值以下: linux-2.6.38.8/include/linux/ptrace.h 1) #define PT_PTRACED 0x00000001 2) #define PT_DTRACE 0x00000002: delayed trace (used on m68k, i386) 3) #define PT_TRACESYSGOOD 0x00000004 4) #define PT_PTRACE_CAP 0x00000008: ptracer can follow suid-exec 5) #define PT_TRACE_FORK 0x00000010 6) #define PT_TRACE_VFORK 0x00000020 7) #define PT_TRACE_CLONE 0x00000040 8) #define PT_TRACE_EXEC 0x00000080 9) #define PT_TRACE_VFORK_DONE 0x00000100 10) #define PT_TRACE_EXIT 0x00000200 */ unsigned int ptrace; unsigned long ptrace_message; siginfo_t *last_siginfo; /* 6. lock_depth 用於表示獲取大內核鎖的次數,若是進程未得到過鎖,則置爲-1 */ int lock_depth; /* 7. oncpu 在SMP上幫助實現無加鎖的進程切換(unlocked context switches) */ #ifdef CONFIG_SMP #ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif #endif /* 8. 進程調度 1) prio: 調度器考慮的優先級保存在prio,因爲在某些狀況下內核須要暫時提升進程的優先級,所以須要第三個成員來表示(除了static_prio、normal_prio以外),因爲這些改變不是持久的,所以靜態(static_prio)和普通(normal_prio)優先級不受影響 2) static_prio: 用於保存進程的"靜態優先級",靜態優先級是進程"啓動"時分配的優先級,它能夠用nice、sched_setscheduler系統調用修改,不然在進程運行期間會一直保持恆定 3) normal_prio: 表示基於進程的"靜態優先級"和"調度策略"計算出的優先級,所以,即便普通進程和實時進程具備相同的靜態優先級(static_prio),其普通優先級(normal_prio)也是不一樣的。進程分支時(fork),新建立的子進程會集成普通優先級 */ int prio, static_prio, normal_prio; /* 4) rt_priority: 表示實時進程的優先級,須要明白的是,"實時進程優先級"和"普通進程優先級"有兩個獨立的範疇,實時進程即便是最低優先級也高於普通進程,最低的實時優先級爲0,最高的優先級爲99,值越大,代表優先級越高 */ unsigned int rt_priority; /* 5) sched_class: 該進程所屬的調度類,目前內核中有實現如下四種: 5.1) static const struct sched_class fair_sched_class; 5.2) static const struct sched_class rt_sched_class; 5.3) static const struct sched_class idle_sched_class; 5.4) static const struct sched_class stop_sched_class; */ const struct sched_class *sched_class; /* 6) se: 用於普通進程的調用實體
調度器不限於調度進程,還能夠處理更大的實體,這能夠實現"組調度",可用的CPU時間能夠首先在通常的進程組(例如全部進程能夠按全部者分組)之間分配,接下來分配的時間在組內再次分配
這種通常性要求調度器不直接操做進程,而是處理"可調度實體",一個實體有sched_entity的一個實例標識
在最簡單的狀況下,調度在各個進程上執行,因爲調度器設計爲處理可調度的實體,在調度器看來各個進程也必須也像這樣的實體,所以se在task_struct中內嵌了一個sched_entity實例,調度器可據此操做各個task_struct */ struct sched_entity se; /* 7) rt: 用於實時進程的調用實體 */ struct sched_rt_entity rt; #ifdef CONFIG_PREEMPT_NOTIFIERS /* 9. preempt_notifier preempt_notifiers結構體鏈表 */ struct hlist_head preempt_notifiers; #endif /* 10. fpu_counter FPU使用計數 */ unsigned char fpu_counter; #ifdef CONFIG_BLK_DEV_IO_TRACE /* 11. btrace_seq blktrace是一個針對Linux內核中塊設備I/O層的跟蹤工具 */ unsigned int btrace_seq; #endif /* 12. policy policy表示進程的調度策略,目前主要有如下五種: 1) #define SCHED_NORMAL 0: 用於普通進程,它們經過徹底公平調度器來處理 2) #define SCHED_FIFO 1: 先來先服務調度,由實時調度類處理 3) #define SCHED_RR 2: 時間片輪轉調度,由實時調度類處理 4) #define SCHED_BATCH 3: 用於非交互、CPU使用密集的批處理進程,經過徹底公平調度器來處理,調度決策對此類進程給與"冷處理",它們毫不會搶佔CFS調度器處理的另外一個進程,所以不會干擾交互式進程,若是不打算用nice下降進程的靜態優先級,同時又不但願該進程影響系統的交互性,最適合用該調度策略 5) #define SCHED_IDLE 5: 可用於次要的進程,其相對權重老是最小的,也經過徹底公平調度器來處理。要注意的是,SCHED_IDLE不負責調度空閒進程,空閒進程由內核提供單獨的機制來處理 只有root用戶能經過sched_setscheduler()系統調用來改變調度策略 */ unsigned int policy; /* 13. cpus_allowed cpus_allowed是一個位域,在多處理器系統上使用,用於控制進程能夠在哪裏處理器上運行 */ cpumask_t cpus_allowed; /* 14. RCU同步原語 */ #ifdef CONFIG_TREE_PREEMPT_RCU int rcu_read_lock_nesting; char rcu_read_unlock_special; struct rcu_node *rcu_blocked_node; struct list_head rcu_node_entry; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) /* 15. sched_info 用於調度器統計進程的運行信息 */ struct sched_info sched_info; #endif /* 16. tasks 經過list_head將當前進程的task_struct串聯進內核的進程列表中,構建;linux進程鏈表 */ struct list_head tasks; /* 17. pushable_tasks limit pushing to one attempt */ struct plist_node pushable_tasks; /* 18. 進程地址空間 1) mm: 指向進程所擁有的內存描述符 2) active_mm: active_mm指向進程運行時所使用的內存描述符 對於普通進程而言,這兩個指針變量的值相同。可是,內核線程不擁有任何內存描述符,因此它們的mm成員老是爲NULL。當內核線程得以運行時,它的active_mm成員被初始化爲前一個運行進程的active_mm值 */ struct mm_struct *mm, *active_mm; /* 19. exit_state 進程退出狀態碼 */ int exit_state; /* 20. 判斷標誌 1) exit_code exit_code用於設置進程的終止代號,這個值要麼是_exit()或exit_group()系統調用參數(正常終止),要麼是由內核提供的一個錯誤代號(異常終止) 2) exit_signal exit_signal被置爲-1時表示是某個線程組中的一員。只有當線程組的最後一個成員終止時,纔會產生一個信號,以通知線程組的領頭進程的父進程 */ int exit_code, exit_signal; /* 3) pdeath_signal pdeath_signal用於判斷父進程終止時發送信號 */ int pdeath_signal; /* 4) personality用於處理不一樣的ABI,它的可能取值以下: enum { PER_LINUX = 0x0000, PER_LINUX_32BIT = 0x0000 | ADDR_LIMIT_32BIT, PER_LINUX_FDPIC = 0x0000 | FDPIC_FUNCPTRS, PER_SVR4 = 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_SVR3 = 0x0002 | STICKY_TIMEOUTS | SHORT_INODE, PER_SCOSVR3 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE, PER_OSR5 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS, PER_WYSEV386 = 0x0004 | STICKY_TIMEOUTS | SHORT_INODE, PER_ISCR4 = 0x0005 | STICKY_TIMEOUTS, PER_BSD = 0x0006, PER_SUNOS = 0x0006 | STICKY_TIMEOUTS, PER_XENIX = 0x0007 | STICKY_TIMEOUTS | SHORT_INODE, PER_LINUX32 = 0x0008, PER_LINUX32_3GB = 0x0008 | ADDR_LIMIT_3GB, PER_IRIX32 = 0x0009 | STICKY_TIMEOUTS, PER_IRIXN32 = 0x000a | STICKY_TIMEOUTS, PER_IRIX64 = 0x000b | STICKY_TIMEOUTS, PER_RISCOS = 0x000c, PER_SOLARIS = 0x000d | STICKY_TIMEOUTS, PER_UW7 = 0x000e | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_OSF4 = 0x000f, PER_HPUX = 0x0010, PER_MASK = 0x00ff, }; */ unsigned int personality; /* 5) did_exec did_exec用於記錄進程代碼是否被execve()函數所執行 */ unsigned did_exec:1; /* 6) in_execve in_execve用於通知LSM是否被do_execve()函數所調用 */ unsigned in_execve:1; /* 7) in_iowait in_iowait用於判斷是否進行iowait計數 */ unsigned in_iowait:1; /* 8) sched_reset_on_fork sched_reset_on_fork用於判斷是否恢復默認的優先級或調度策略 */ unsigned sched_reset_on_fork:1; /* 21. 進程標識符(PID) 在CONFIG_BASE_SMALL配置爲0的狀況下,PID的取值範圍是0到32767,即系統中的進程數最大爲32768個 #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) 在Linux系統中,一個線程組中的全部線程使用和該線程組的領頭線程(該組中的第一個輕量級進程)相同的PID,並被存放在tgid成員中。只有線程組的領頭線程的pid成員纔會被設置爲與tgid相同的值。注意,getpid()系統調用
返回的是當前進程的tgid值而不是pid值。 */ pid_t pid; pid_t tgid; #ifdef CONFIG_CC_STACKPROTECTOR /* 22. stack_canary 防止內核堆棧溢出,在GCC編譯內核時,須要加上-fstack-protector選項 */ unsigned long stack_canary; #endif /* 23. 表示進程親屬關係的成員 1) real_parent: 指向其父進程,若是建立它的父進程再也不存在,則指向PID爲1的init進程 2) parent: 指向其父進程,當它終止時,必須向它的父進程發送信號。它的值一般與real_parent相同 */ struct task_struct *real_parent; struct task_struct *parent; /* 3) children: 表示鏈表的頭部,鏈表中的全部元素都是它的子進程(子進程鏈表) 4) sibling: 用於把當前進程插入到兄弟鏈表中(鏈接到父進程的子進程鏈表(兄弟鏈表)) 5) group_leader: 指向其所在進程組的領頭進程 */ struct list_head children; struct list_head sibling; struct task_struct *group_leader; struct list_head ptraced; struct list_head ptrace_entry; struct bts_context *bts; /* 24. pids PID散列表和鏈表 */ struct pid_link pids[PIDTYPE_MAX]; /* 25. thread_group 線程組中全部進程的鏈表 */ struct list_head thread_group; /* 26. do_fork函數 1) vfork_done 在執行do_fork()時,若是給定特別標誌,則vfork_done會指向一個特殊地址 2) set_child_tid、clear_child_tid 若是copy_process函數的clone_flags參數的值被置爲CLONE_CHILD_SETTID或CLONE_CHILD_CLEARTID,則會把child_tidptr參數的值分別複製到set_child_tid和clear_child_tid成員。這些標誌說明必須改變子
進程用戶態地址空間的child_tidptr所指向的變量的值。 */ struct completion *vfork_done; int __user *set_child_tid; int __user *clear_child_tid; /* 27. 記錄進程的I/O計數(時間) 1) utime 用於記錄進程在"用戶態"下所通過的節拍數(定時器) 2) stime 用於記錄進程在"內核態"下所通過的節拍數(定時器) 3) utimescaled 用於記錄進程在"用戶態"的運行時間,但它們以處理器的頻率爲刻度 4) stimescaled 用於記錄進程在"內核態"的運行時間,但它們以處理器的頻率爲刻度 */ cputime_t utime, stime, utimescaled, stimescaled; /* 5) gtime 以節拍計數的虛擬機運行時間(guest time) */ cputime_t gtime; /* 6) prev_utime、prev_stime是先前的運行時間 */ cputime_t prev_utime, prev_stime; /* 7) nvcsw 自願(voluntary)上下文切換計數 8) nivcsw 非自願(involuntary)上下文切換計數 */ unsigned long nvcsw, nivcsw; /* 9) start_time 進程建立時間 10) real_start_time 進程睡眠時間,還包含了進程睡眠時間,經常使用於/proc/pid/stat, */ struct timespec start_time; struct timespec real_start_time; /* 11) cputime_expires 用來統計進程或進程組被跟蹤的處理器時間,其中的三個成員對應着cpu_timers[3]的三個鏈表 */ struct task_cputime cputime_expires; struct list_head cpu_timers[3]; #ifdef CONFIG_DETECT_HUNG_TASK /* 12) last_switch_count nvcsw和nivcsw的總和 */ unsigned long last_switch_count; #endif struct task_io_accounting ioac; #if defined(CONFIG_TASK_XACCT) u64 acct_rss_mem1; u64 acct_vm_mem1; cputime_t acct_timexpd; #endif /* 28. 缺頁統計 */ unsigned long min_flt, maj_flt; /* 29. 進程權能 */ const struct cred *real_cred; const struct cred *cred; struct mutex cred_guard_mutex; struct cred *replacement_session_keyring; /* 30. comm[TASK_COMM_LEN] 相應的程序名 */ char comm[TASK_COMM_LEN]; /* 31. 文件 1) fs 用來表示進程與文件系統的聯繫,包括當前目錄和根目錄 2) files 表示進程當前打開的文件 */ int link_count, total_link_count; struct fs_struct *fs; struct files_struct *files; #ifdef CONFIG_SYSVIPC /* 32. sysvsem 進程通訊(SYSVIPC) */ struct sysv_sem sysvsem; #endif /* 33. 處理器特有數據 */ struct thread_struct thread; /* 34. nsproxy 命名空間 */ struct nsproxy *nsproxy; /* 35. 信號處理 1) signal: 指向進程的信號描述符 2) sighand: 指向進程的信號處理程序描述符 */ struct signal_struct *signal; struct sighand_struct *sighand; /* 3) blocked: 表示被阻塞信號的掩碼 4) real_blocked: 表示臨時掩碼 */ sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* 5) pending: 存放私有掛起信號的數據結構 */ struct sigpending pending; /* 6) sas_ss_sp: 信號處理程序備用堆棧的地址 7) sas_ss_size: 表示堆棧的大小 */ unsigned long sas_ss_sp; size_t sas_ss_size; /* 8) notifier 設備驅動程序經常使用notifier指向的函數來阻塞進程的某些信號 9) otifier_data 指的是notifier所指向的函數可能使用的數據。 10) otifier_mask 標識這些信號的位掩碼 */ int (*notifier)(void *priv); void *notifier_data; sigset_t *notifier_mask; /* 36. 進程審計 */ struct audit_context *audit_context; #ifdef CONFIG_AUDITSYSCALL uid_t loginuid; unsigned int sessionid; #endif /* 37. secure computing */ seccomp_t seccomp; /* 38. 用於copy_process函數使用CLONE_PARENT標記時 */ u32 parent_exec_id; u32 self_exec_id; /* 39. alloc_lock 用於保護資源分配或釋放的自旋鎖 */ spinlock_t alloc_lock; /* 40. 中斷 */ #ifdef CONFIG_GENERIC_HARDIRQS struct irqaction *irqaction; #endif #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; int hardirqs_enabled; unsigned long hardirq_enable_ip; unsigned int hardirq_enable_event; unsigned long hardirq_disable_ip; unsigned int hardirq_disable_event; int softirqs_enabled; unsigned long softirq_disable_ip; unsigned int softirq_disable_event; unsigned long softirq_enable_ip; unsigned int softirq_enable_event; int hardirq_context; int softirq_context; #endif /* 41. pi_lock task_rq_lock函數所使用的鎖 */ spinlock_t pi_lock; #ifdef CONFIG_RT_MUTEXES /* 42. 基於PI協議的等待互斥鎖,其中PI指的是priority inheritance/9優先級繼承) */ struct plist_head pi_waiters; struct rt_mutex_waiter *pi_blocked_on; #endif #ifdef CONFIG_DEBUG_MUTEXES /* 43. blocked_on 死鎖檢測 */ struct mutex_waiter *blocked_on; #endif /* 44. lockdep, */ #ifdef CONFIG_LOCKDEP # define MAX_LOCK_DEPTH 48UL u64 curr_chain_key; int lockdep_depth; unsigned int lockdep_recursion; struct held_lock held_locks[MAX_LOCK_DEPTH]; gfp_t lockdep_reclaim_gfp; #endif /* 45. journal_info JFS文件系統 */ void *journal_info; /* 46. 塊設備鏈表 */ struct bio *bio_list, **bio_tail; /* 47. reclaim_state 內存回收 */ struct reclaim_state *reclaim_state; /* 48. backing_dev_info 存放塊設備I/O數據流量信息 */ struct backing_dev_info *backing_dev_info; /* 49. io_context I/O調度器所使用的信息 */ struct io_context *io_context; /* 50. CPUSET功能 */ #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; int cpuset_mem_spread_rotor; #endif /* 51. Control Groups */ #ifdef CONFIG_CGROUPS struct css_set *cgroups; struct list_head cg_list; #endif /* 52. robust_list futex同步機制 */ #ifdef CONFIG_FUTEX struct robust_list_head __user *robust_list; #ifdef CONFIG_COMPAT struct compat_robust_list_head __user *compat_robust_list; #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp; struct mutex perf_event_mutex; struct list_head perf_event_list; #endif /* 53. 非一致內存訪問(NUMA Non-Uniform Memory Access) */ #ifdef CONFIG_NUMA struct mempolicy *mempolicy; /* Protected by alloc_lock */ short il_next; #endif /* 54. fs_excl 文件系統互斥資源 */ atomic_t fs_excl; /* 55. rcu RCU鏈表 */ struct rcu_head rcu; /* 56. splice_pipe 管道 */ struct pipe_inode_info *splice_pipe; /* 57. delays 延遲計數 */ #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif /* 58. make_it_fail fault injection */ #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif /* 59. dirties FLoating proportions */ struct prop_local_single dirties; /* 60. Infrastructure for displayinglatency */ #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; #endif /* 61. time slack values,經常使用於poll和select函數 */ unsigned long timer_slack_ns; unsigned long default_timer_slack_ns; /* 62. scm_work_list socket控制消息(control message) */ struct list_head *scm_work_list; /* 63. ftrace跟蹤器 */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER int curr_ret_stack; struct ftrace_ret_stack *ret_stack; unsigned long long ftrace_timestamp; atomic_t trace_overrun; atomic_t tracing_graph_pause; #endif #ifdef CONFIG_TRACING unsigned long trace; unsigned long trace_recursion; #endif };
http://oss.org.cn/kernel-book/ch04/4.3.htm http://www.eecs.harvard.edu/~margo/cs161/videos/sched.h.html http://memorymyann.iteye.com/blog/235363 http://blog.csdn.net/hongchangfirst/article/details/7075026 http://oss.org.cn/kernel-book/ch04/4.4.2.htm http://blog.csdn.net/npy_lp/article/details/7335187 http://blog.csdn.net/npy_lp/article/details/7292563
0x2: struct cred
\linux-2.6.32.63\include\linux\cred.h
//保存了當前進程的相關權限信息 struct cred { atomic_t usage; #ifdef CONFIG_DEBUG_CREDENTIALS atomic_t subscribers; /* number of processes subscribed */ void *put_addr; unsigned magic; #define CRED_MAGIC 0x43736564 #define CRED_MAGIC_DEAD 0x44656144 #endif uid_t uid; /* real UID of the task */ gid_t gid; /* real GID of the task */ uid_t suid; /* saved UID of the task */ gid_t sgid; /* saved GID of the task */ uid_t euid; /* effective UID of the task */ gid_t egid; /* effective GID of the task */ uid_t fsuid; /* UID for VFS ops */ gid_t fsgid; /* GID for VFS ops */ unsigned securebits; /* SUID-less security management */ kernel_cap_t cap_inheritable; /* caps our children can inherit */ kernel_cap_t cap_permitted; /* caps we're permitted */ kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ struct key *thread_keyring; /* keyring private to this thread */ struct key *request_key_auth; /* assumed request_key authority */ struct thread_group_cred *tgcred; /* thread-group shared credentials */ #endif #ifdef CONFIG_SECURITY void *security; /* subjective LSM security */ #endif struct user_struct *user; /* real user ID subscription */ struct group_info *group_info; /* supplementary groups for euid/fsgid */ struct rcu_head rcu; /* RCU deletion hook */ };
0x3: struct pid_link
/* PID/PID hash table linkage. */ struct pid_link pids[PIDTYPE_MAX];
/include/linux/pid.h
enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX };
struct definition
struct pid_link { struct hlist_node node; struct pid *pid; };
/include/linux/types.h
struct hlist_node { struct hlist_node *next, **pprev; };
0x4: struct pid
struct pid { //1. 指向該數據結構的引用次數 atomic_t count; /* 2. 該pid在pid_namespace中處於第幾層 1) 當level=0時 表示是global namespace,即最高層 */ unsigned int level; /* lists of tasks that use this pid */ //3. tasks[i]指向的是一個哈希表。譬如說tasks[PIDTYPE_PID]指向的是PID的哈希表 struct hlist_head tasks[PIDTYPE_MAX]; //4. struct rcu_head rcu; /* 5. numbers[1]域指向的是upid結構體 numbers數組的本意是想表示不一樣的pid_namespace,一個PID能夠屬於不一樣的namespace 1) umbers[0]表示global namespace 2) numbers[i]表示第i層namespace 3) i越大所在層級越低 目前該數組只有一個元素, 即global namespace。因此namepace的概念雖然引入了pid,可是並未真正使用,在將來的版本可能會用到 */ struct upid numbers[1]; };
Relevant Link:
http://blog.csdn.net/zhanglei4214/article/details/6765913
0x5: struct signal_struct
/* NOTE! "signal_struct" does not have it's own locking, because a shared signal_struct always implies a shared sighand_struct, so locking sighand_struct is always a proper superset of the locking of signal_struct. */ struct signal_struct { atomic_t count; atomic_t live; /* for wait4() */ wait_queue_head_t wait_chldexit; /* current thread group signal load-balancing target: */ struct task_struct *curr_target; /* shared signal handling: */ struct sigpending shared_pending; /* thread group exit support */ int group_exit_code; /* overloaded: notify group_exit_task when ->count is equal to notify_count,everyone except group_exit_task is stopped during signal delivery of fatal signals, group_exit_task processes the signal. */ int notify_count; struct task_struct *group_exit_task; /* thread group stop support, overloads group_exit_code too */ int group_stop_count; unsigned int flags; /* see SIGNAL_* flags below */ /* POSIX.1b Interval Timers */ struct list_head posix_timers; /* ITIMER_REAL timer for the process */ struct hrtimer real_timer; struct pid *leader_pid; ktime_t it_real_incr; /* ITIMER_PROF and ITIMER_VIRTUAL timers for the process, we use CPUCLOCK_PROF and CPUCLOCK_VIRT for indexing array as these values are defined to 0 and 1 respectively */ struct cpu_itimer it[2]; /* Thread group totals for process CPU timers. See thread_group_cputimer(), et al, for details. */ struct thread_group_cputimer cputimer; /* Earliest-expiration cache. */ struct task_cputime cputime_expires; struct list_head cpu_timers[3]; struct pid *tty_old_pgrp; /* boolean value for session group leader */ int leader; struct tty_struct *tty; /* NULL if no tty */ /* Cumulative resource counters for dead threads in the group, and for reaped dead child processes forked by this group. Live threads maintain their own counters and add to these in __exit_signal, except for the group leader. */ cputime_t utime, stime, cutime, cstime; cputime_t gtime; cputime_t cgtime; #ifndef CONFIG_VIRT_CPU_ACCOUNTING cputime_t prev_utime, prev_stime; #endif unsigned long nvcsw, nivcsw, cnvcsw, cnivcsw; unsigned long min_flt, maj_flt, cmin_flt, cmaj_flt; unsigned long inblock, oublock, cinblock, coublock; unsigned long maxrss, cmaxrss; struct task_io_accounting ioac; /* Cumulative ns of schedule CPU time fo dead threads in the group, not including a zombie group leader, (This only differs from jiffies_to_ns(utime + stime) if sched_clock uses something other than jiffies.) */ unsigned long long sum_sched_runtime; /* We don't bother to synchronize most readers of this at all, because there is no reader checking a limit that actually needs to get both rlim_cur and rlim_max atomically, and either one alone is a single word that can safely be read normally. getrlimit/setrlimit use task_lock(current->group_leader) to protect this instead of the siglock, because they really have no need to disable irqs. struct rlimit { rlim_t rlim_cur; //Soft limit(軟限制): 進程當前的資源限制 rlim_t rlim_max; //Hard limit(硬限制): 該限制的最大允許值(ceiling for rlim_cur) }; rlim是一個數組,其中每一項保存了一種類型的資源限制,RLIM_NLIMITS表示資源限制類型的類型數量 要說明的是,hard limit只針對非特權進程,也就是進程的有效用戶ID(effective user ID)不是0的進程 */ struct rlimit rlim[RLIM_NLIMITS]; #ifdef CONFIG_BSD_PROCESS_ACCT struct pacct_struct pacct; /* per-process accounting information */ #endif #ifdef CONFIG_TASKSTATS struct taskstats *stats; #endif #ifdef CONFIG_AUDIT unsigned audit_tty; struct tty_audit_buf *tty_audit_buf; #endif int oom_adj; /* OOM kill score adjustment (bit shift) */ };
Relevant Link:
http://blog.csdn.net/walkingman321/article/details/6167435
0x6: struct rlimit
\linux-2.6.32.63\include\linux\resource.h
struct rlimit { //Soft limit(軟限制): 進程當前的資源限制 unsigned long rlim_cur; //Hard limit(硬限制): 該限制的最大允許值(ceiling for rlim_cur) unsigned long rlim_max; };
Linux提供資源限制(resources limit rlimit)機制,對進程使用系統資源施加限制,該機制利用了task_struct中的rlim數組
rlim數組中的位置標識了受限制資源的類型,這也是內核須要定義預處理器常數,將資源與位置關聯起來的緣由,如下是全部的常數及其含義
1. RLIMIT_CPU: CPU time in ms CPU時間的最大量值(秒),當超過此軟限制時向該進程發送SIGXCPU信號 2. RLIMIT_FSIZE: Maximum file size 能夠建立的文件的最大字節長度,當超過此軟限制時向進程發送SIGXFSZ 3. RLIMIT_DATA: Maximum size of the data segment 數據段的最大字節長度 4. RLIMIT_STACK: Maximum stack size 棧的最大長度 5. RLIMIT_CORE: Maximum core file size 設定最大的core文件,當值爲0時將禁止core文件非0時將設定產生的最大core文件大小爲設定的值 6. RLIMIT_RSS: Maximum resident set size 最大駐內存集字節長度(RSS)若是物理存儲器供不該求則內核將從進程處取回超過RSS的部份 7. RLIMIT_NPROC: Maximum number of processes 每一個實際用戶ID所擁有的最大子進程數,更改此限制將影響到sysconf函數在參數_SC_CHILD_MAX中返回的值 8. RLIMIT_NOFILE: aximum number of open files 每一個進程可以打開的最多文件數。更改此限制將影響到sysconf函數在參數_SC_CHILD_MAX中的返回值 9. RLIMIT_MEMLOCK: Maximum locked-in-memory address space The maximum number of bytes of virtual memory that may be locked into RAM using mlock() and mlockall(). 不可換出頁的最大數目 10. RLIMIT_AS: Maximum address space size in bytes The maximum size of the process virtual memory (address space) in bytes. This limit affects calls to brk(2), mmap(2) and mremap(2), which fail with the error ENOMEM upon exceeding this limit. Also automatic stack expansion will fail (and generate a SIGSEGV that kills the process when no alternate stack has been made available). Since the value is a long, on machines with a 32-bit long either this limit is at most 2 GiB, or this resource is unlimited. 進程佔用的虛擬地址空間的最大尺寸 11. RLIMIT_LOCKS: Maximum file locks held 文件鎖的最大數目 12. RLIMIT_SIGPENDING: Maximum number of pending signals 待決信號的最大數目 13. RLIMIT_MSGQUEUE: Maximum bytes in POSIX mqueues 消息隊列的最大數目 14. RLIMIT_NICE: Maximum nice prio allowed to raise to 非實時進程的優先級(nice level) 15. RLIMIT_RTPRIO: Maximum realtime priority 最大的實時優先級
由於涉及內核的各個不一樣部分,內核必須確認子系統遵照了相應限制。須要注意的是,若是某一類資源沒有使用限制(這是幾乎全部資源的默認設置),則將rlim_max設置爲RLIM_INFINITY,例外狀況包括下列
1. 打開文件的數目(RLIMIT_NOFILE): 默認限制在1024 2. 每用戶的最大進程數(RLIMIT_NPROC): 定義爲max_threads / 2,max_threads是一個全局變量,指定了在把 1/8 可用內存用於管理線程信息的狀況下,能夠建立的進程數目。在計算時,提早給定了20個線程的最小可能內存用量
init進程在Linux中是一個特殊的進程,init的進程限制在系統啓動時就生效了
\linux-2.6.32.63\include\asm-generic\resource.h
/* * boot-time rlimit defaults for the init task: */ #define INIT_RLIMITS \ { \ [RLIMIT_CPU] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_FSIZE] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_DATA] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_STACK] = { _STK_LIM, _STK_LIM_MAX }, \ [RLIMIT_CORE] = { 0, RLIM_INFINITY }, \ [RLIMIT_RSS] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_NPROC] = { 0, 0 }, \ [RLIMIT_NOFILE] = { INR_OPEN, INR_OPEN }, \ [RLIMIT_MEMLOCK] = { MLOCK_LIMIT, MLOCK_LIMIT }, \ [RLIMIT_AS] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_LOCKS] = { RLIM_INFINITY, RLIM_INFINITY }, \ [RLIMIT_SIGPENDING] = { 0, 0 }, \ [RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \ [RLIMIT_NICE] = { 0, 0 }, \ [RLIMIT_RTPRIO] = { 0, 0 }, \ [RLIMIT_RTTIME] = { RLIM_INFINITY, RLIM_INFINITY }, \ }
在proc文件系統中,對每一個進程都包含了對應的一個文件,能夠查看當前的rlimit值
cat /proc/self/limits
1. singly-linked lists 2. singly-linked tail queues 3. doubly-linked lists 4. doubly-linked tail queues
1. 儘量的代碼重用,化大堆的鏈表設計爲單個鏈表 2. 在後面的學習中咱們會發現,內核中大部分都是"雙向循環鏈表",由於"雙向循環鏈表"的效率是最高的,找頭節點,尾節點,直接前驅,直接後繼時間複雜度都是O(1) ,而使用單鏈表,單向循環鏈表或其餘形式的鏈表是不能完成的。 3. 若是須要構造某類對象的特定列表,則在其結構中定義一個類型爲"list_head"指針的成員
linux-2.6.32.63\include\linux\list.h struct list_head { struct list_head *next, *prev; }; 經過這個成員將這類對象鏈接起來,造成所需列表,並經過通用鏈表函數對其進行操做(list_head內嵌在原始結構中就像一個鉤子,將原始對象串起來) 在這種架構設計下,內核開發人員只需編寫通用鏈表函數,便可構造和操做不一樣對象的列表,而無需爲每類對象的每種列表編寫專用函數,實現了代碼的重用。 4. 若是想對某種類型建立鏈表,就把一個list_head類型的變量嵌入到該類型中,用list_head中的成員和相對應的處理函數來對鏈表進行遍歷
如今咱們知道內核中鏈表的基本元素數據結構、也知道它們的設計原則以及組成原理,接下來的問題是在內核是怎麼初始化並使用這些數據結構的呢?那些咱們熟知的一個個鏈表都是怎麼造成的呢?
linux內核爲這些鏈表數據結構配套了相應的"操做宏"、以及內嵌函數
1. 鏈表初始化 1.1 LIST_HEAD_INIT #define LIST_HEAD_INIT(name) { &(name), &(name) } LIST_HEAD_INIT這個宏的做用是初始化當前鏈表節點,即將頭指針和尾指針都指向本身 1.2 LIST_HEAD #define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name) 從代碼能夠看出,LIST_HEAD這個宏的做用是定義了一個雙向鏈表的頭,並調用LIST_HEAD_INIT進行"鏈表頭初始化",將頭指針和尾指針都指向本身,所以能夠得知在Linux中用頭指針的next是否指向本身來判斷鏈表是否爲空 1.3 INIT_LIST_HEAD(struct list_head *list) 除了LIST_HEAD宏在編譯時靜態初始化,還能夠使用內嵌函數INIT_LIST_HEAD(struct list_head *list)在運行時進行初始化 static inline void INIT_LIST_HEAD(struct list_head *list) { list->next = list; list->prev = list; } 不管是採用哪一種方式,新生成的鏈表頭的指針next,prev都初始化爲指向本身 2. 判斷一個鏈表是否是爲空鏈表 2.1 list_empty(const struct list_head *head) static inline int list_empty(const struct list_head *head) { return head->next == head; } 2.2 list_empty_careful(const struct list_head *head) 和list_empty()的差異在於: 函數使用的檢測方法是判斷表頭的前一個結點和後一個結點是否爲其自己,若是同時知足則返回0,不然返回值爲1。 這主要是爲了應付另外一個cpu正在處理同一個鏈表而形成next、prev不一致的狀況。但代碼註釋也認可,這一安全保障能力有限:除非其餘cpu的鏈表操做只有list_del_init(),不然仍然不能保證安全,也就是說,仍是須要加
鎖保護 static inline int list_empty_careful(const struct list_head *head) { struct list_head *next = head->next; return (next == head) && (next == head->prev); } 3. 鏈表的插入操做 3.1 list_add(struct list_head *new, struct list_head *head) 在head和head->next之間加入一個新的節點。即表頭插入法(即先插入的後輸出,能夠用來實現一個棧) static inline void list_add(struct list_head *new, struct list_head *head) { __list_add(new, head, head->next); } 3.2 list_add_tail(struct list_head *new, struct list_head *head) 在head->prev(雙向循環鏈表的最後一個結點)和head之間添加一個新的結點。即表尾插入法(先插入的先輸出,能夠用來實現一個隊列) static inline void list_add_tail(struct list_head *new, struct list_head *head) { __list_add(new, head->prev, head); } #ifndef CONFIG_DEBUG_LIST static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) { next->prev = new; new->next = next; new->prev = prev; prev->next = new; } #else extern void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next); #endif 4. 鏈表的刪除 4.1 list_del(struct list_head *entry) #ifndef CONFIG_DEBUG_LIST static inline void list_del(struct list_head *entry) { /* __list_del(entry->prev, entry->next)表示將entry的前一個和後一個之間創建關聯(即架空中間的元素) */ __list_del(entry->prev, entry->next); /* list_del()函數將刪除後的prev、next指針分別被設爲LIST_POSITION2和LIST_POSITION1兩個特殊值,這樣設置是爲了保證不在鏈表中的節點項不可訪問。對LIST_POSITION1和LIST_POSITION2的訪問都將引發
"頁故障" */ entry->next = LIST_POISON1; entry->prev = LIST_POISON2; } #else extern void list_del(struct list_head *entry); #endif 4.2 list_del_init(struct list_head *entry) /* list_del_init這個函數首先將entry從雙向鏈表中刪除以後,而且將entry初始化爲一個空鏈表。 要注意區分和理解的是: list_del(entry)和list_del_init(entry)惟一不一樣的是對entry的處理,前者是將entry設置爲不可用,後者是將其設置爲一個空的鏈表的開始。 */ static inline void list_del_init(struct list_head *entry) { __list_del(entry->prev, entry->next); INIT_LIST_HEAD(entry); } 5. 鏈表節點的替換 結點的替換是將old的結點替換成new 5.1 list_replace(struct list_head *old, struct list_head *new) list_repleace()函數只是改變new和old的指針關係,然而old指針並無釋放 static inline void list_replace(struct list_head *old, struct list_head *new) { new->next = old->next; new->next->prev = new; new->prev = old->prev; new->prev->next = new; } 5.2 list_replace_init(struct list_head *old, struct list_head *new) static inline void list_replace_init(struct list_head *old, struct list_head *new) { list_replace(old, new); INIT_LIST_HEAD(old); } 6. 分割鏈表 6.1 list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) 函數將head(不包括head結點)到entry結點之間的全部結點截取下來添加到list鏈表中。該函數完成後就產生了兩個鏈表head和list static inline void list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) { if (list_empty(head)) return; if (list_is_singular(head) && (head->next != entry && head != entry)) return; if (entry == head) INIT_LIST_HEAD(list); else __list_cut_position(list, head, entry); } static inline void __list_cut_position(struct list_head *list, struct list_head *head, struct list_head *entry) { struct list_head *new_first = entry->next; list->next = head->next; list->next->prev = list; list->prev = entry; entry->next = list; head->next = new_first; new_first->prev = head; } 7. 內核鏈表的遍歷操做(重點) 7.1 list_entry Linux鏈表中僅保存了數據項結構中list_head成員變量的地址,能夠經過list_entry宏經過list_head成員訪問到做爲它的全部者的的起始基地址(思考結構體的成員偏移量的概念,只有知道告終構體基地址才能經過offset獲得
成員地址,以後才能繼續遍歷) 這裏的ptr是一個鏈表的頭結點,這個宏就是取的這個鏈表"頭結點(注意不是第一個元素哦,是頭結點,要獲得第一個元素還得繼續往下走一個)"所指結構體的首地址 #define list_entry(ptr, type, member) container_of(ptr, type, member) 7.2 list_first_entry 這裏的ptr是一個鏈表的頭結點,這個宏就是取的這個鏈表"第一元素"所指結構體的首地址 #define list_first_entry(ptr, type, member) list_entry((ptr)->next, type, member) 7.3 list_for_each(pos, head) 獲得了鏈表的第一個元素的基地址以後,才能夠開始元素的遍歷 #define list_for_each(pos, head) \ /* prefetch()的功能是預取內存的內容,也就是程序告訴CPU哪些內容可能立刻用到,CPU預先其取出內存操做數,而後將其送入高速緩存,用於優化,是的執行速度更快 */ for (pos = (head)->next; prefetch(pos->next), pos != (head); \ pos = pos->next) 7.4 __list_for_each(pos, head) __list_for_each沒有采用pretetch來進行預取 #define __list_for_each(pos, head) \ for (pos = (head)->next; pos != (head); pos = pos->next) 7.5 list_for_each_prev(pos, head) 實現方法與list_for_each相同,不一樣的是用head的前趨結點進行遍歷。實現鏈表的逆向遍歷 #define list_for_each_prev(pos, head) \ for (pos = (head)->prev; prefetch(pos->prev), pos != (head); \ pos = pos->prev) 7.6 list_for_each_entry(pos, head, member) 用鏈表外的結構體地址來進行遍歷,而不用鏈表的地址進行遍歷 #define list_for_each_entry(pos, head, member) \ for (pos = list_entry((head)->next, typeof(*pos), member); \ prefetch(pos->member.next), &pos->member != (head); \ pos = list_entry(pos->member.next, typeof(*pos), member))
#include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/version.h> #include <linux/list.h> MODULE_LICENSE("Dual BSD/GPL"); struct module *m = &__this_module; static void list_module_test(void) { struct module *mod; list_for_each_entry(mod, m->list.prev, list) printk ("%s\n", mod->name); } static int list_module_init (void) { list_module_test(); return 0; } static void list_module_exit (void) { printk ("unload listmodule.ko\n"); } module_init(list_module_init); module_exit(list_module_exit);
Makefile
# # Variables needed to build the kernel module # name = mod_ls obj-m += $(name).o all: build .PHONY: build install clean build: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules CONFIG_DEBUG_SECTION_MISMATCH=y install: build -mkdir -p /lib/modules/`uname -r`/kernel/arch/x86/kernel/ cp $(name).ko /lib/modules/`uname -r`/kernel/arch/x86/kernel/ depmod /lib/modules/`uname -r`/kernel/arch/x86/kernel/$(name).ko clean: [ -d /lib/modules/$(shell uname -r)/build ] && \ make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
編譯並加載運行,使用dmesg tail命令能夠看到咱們的內核代碼使用list_for_each_entry將當前系統內核中的"LKM內核模塊雙鏈表"給遍歷出來了
#include <linux/module.h> #include <linux/init.h> #include <linux/list.h> #include <linux/sched.h> #include <linux/time.h> #include <linux/fs.h> #include <asm/uaccess.h> #include <linux/mm.h> MODULE_AUTHOR( "Along" ) ; MODULE_LICENSE( "GPL" ) ; struct task_struct * task = NULL , * p = NULL ; struct list_head * pos = NULL ; struct timeval start, end; int count = 0; /*function_use表示使用哪種方法測試, * 0:三個方法同時使用, * 1:list_for_each, * 2:list_for_each_entry, * 3:for_each_process */ int function_use = 0; char * method; char * filename= "testlog" ; void print_message( void ) ; void writefile( char * filename, char * data ) ; void traversal_list_for_each( void ) ; void traversal_list_for_each_entry( void ) ; void traversal_for_each_process( void ) ; static int init_module_list( void ) { switch ( function_use) { case 1: traversal_list_for_each( ) ; break ; case 2: traversal_list_for_each_entry( ) ; break ; case 3: traversal_for_each_process( ) ; break ; default : traversal_list_for_each( ) ; traversal_list_for_each_entry( ) ; traversal_for_each_process( ) ; break ; } return 0; } static void exit_module_list( void ) { printk( KERN_ALERT "GOOD BYE!!/n" ) ; } module_init( init_module_list ) ; module_exit( exit_module_list ) ; module_param( function_use, int , S_IRUGO) ; void print_message( void ) { char * str1 = "the method is: " ; char * str2 = "系統當前共 " ; char * str3 = " 個進程/n" ; char * str4 = "開始時間: " ; char * str5 = "/n結束時間: " ; char * str6 = "/n時間間隔: " ; char * str7 = "." ; char * str8 = "ms" ; char data[ 1024] ; char tmp[ 50] ; int cost; printk( "系統當前共 %d 個進程!!/n" , count ) ; printk( "the method is : %s/n" , method) ; printk( "開始時間:%10i.%06i/n" , ( int ) start. tv_sec, ( int ) start. tv_usec) ; printk( "結束時間:%10i.%06i/n" , ( int ) end. tv_sec, ( int ) end. tv_usec) ; printk( "時間間隔:%10i/n" , ( int ) end. tv_usec- ( int ) start. tv_usec) ; memset ( data, 0, sizeof ( data) ) ; memset ( tmp, 0, sizeof ( tmp) ) ; strcat ( data, str1) ; strcat ( data, method) ; strcat ( data, str2) ; snprintf( tmp, sizeof ( count ) , "%d" , count ) ; strcat ( data, tmp) ; strcat ( data, str3) ; strcat ( data, str4) ; memset ( tmp, 0, sizeof ( tmp) ) ; /* * 下面這種轉換秒的方法是錯誤的,由於sizeof最終獲得的長度實際是Int類型的 * 長度,而實際的妙數有10位數字,因此最終存到tmp中的字符串也就只有三位 * 數字 * snprintf(tmp, sizeof((int)start.tv_sec),"%d",(int)start.tv_usec ); */ /*取得開始時間的秒數和毫秒數*/ snprintf( tmp, 10, "%d" , ( int ) start. tv_sec ) ; strcat ( data, tmp) ; snprintf( tmp, sizeof ( str7) , "%s" , str7 ) ; strcat ( data, tmp) ; snprintf( tmp, 6, "%d" , ( int ) start. tv_usec ) ; strcat ( data, tmp) ; strcat ( data, str5) ; /*取得結束時間的秒數和毫秒數*/ snprintf( tmp, 10, "%d" , ( int ) end. tv_sec ) ; strcat ( data, tmp) ; snprintf( tmp, sizeof ( str7) , "%s" , str7 ) ; strcat ( data, tmp) ; snprintf( tmp, 6, "%d" , ( int ) end. tv_usec ) ; strcat ( data, tmp) ; /*計算時間差,由於能夠知道咱們這個程序花費的時間是在 *毫秒級別的,因此計算時間差時咱們就沒有考慮秒,只是 *計算毫秒的差值 */ strcat ( data, str6) ; cost = ( int ) end. tv_usec- ( int ) start. tv_usec; snprintf( tmp, sizeof ( cost) , "%d" , cost ) ; strcat ( data, tmp) ; strcat ( data, str8) ; strcat ( data, "/n/n" ) ; writefile( filename, data) ; printk( "%d/n" , sizeof ( data) ) ; } void writefile( char * filename, char * data ) { struct file * filp; mm_segment_t fs; filp = filp_open( filename, O_RDWR| O_APPEND| O_CREAT, 0644) ; ; if ( IS_ERR( filp) ) { printk( "open file error.../n" ) ; return ; } fs = get_fs( ) ; set_fs( KERNEL_DS) ; filp->f_op->write(filp, data, strlen ( data) , &filp->f_pos); set_fs( fs) ; filp_close( filp, NULL ) ; } void traversal_list_for_each( void ) { task = & init_task; count = 0; method= "list_for_each/n" ; do_gettimeofday( & start) ; list_for_each( pos, &task->tasks ) { p = list_entry( pos, struct task_struct, tasks ) ; count++ ; printk( KERN_ALERT "%d/t%s/n" , p->pid, p->comm ) ; } do_gettimeofday( & end) ; print_message( ) ; } void traversal_list_for_each_entry( void ) { task = & init_task; count = 0; method= "list_for_each_entry/n" ; do_gettimeofday( & start) ; list_for_each_entry( p, & task->tasks, tasks ) { count++ ; printk( KERN_ALERT "%d/t%s/n" , p->pid, p->comm ) ; } do_gettimeofday( & end) ; print_message( ) ; } void traversal_for_each_process( void ) { count = 0; method= "for_each_process/n" ; do_gettimeofday( & start) ; for_each_process( task) { count++; printk( KERN_ALERT "%d/t%s/n" , task->pid, task->comm ) ; } do_gettimeofday( & end) ; print_message( ) ; }
Makefile
# ## Variables needed to build the kernel module # # name = trave_process obj-m += $(name).o all: build .PHONY: build install clean build: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules CONFIG_DEBUG_SECTION_MISMATCH=y install: build -mkdir -p /lib/modules/`uname -r`/kernel/arch/x86/kernel/ cp $(name).ko /lib/modules/`uname -r`/kernel/arch/x86/kernel/ depmod /lib/modules/`uname -r`/kernel/arch/x86/kernel/$(name).ko clean: [ -d /lib/modules/$(shell uname -r)/build ] && \ make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
編譯、加載並運行後,能夠根據進程鏈表(task_struct鏈表)遍歷出當前系統內核中存在的進程
http://blog.csdn.net/tigerjibo/article/details/8299599 http://www.cnblogs.com/chengxuyuancc/p/3376627.html http://blog.csdn.net/tody_guo/article/details/5447402
3. 內核模塊相關數據結構
0x0: THIS_MODULE宏
和CURRENT宏有幾分類似,能夠經過THIS_MODULE宏來引用模塊的struct module結構指針
\linux-2.6.32.63\include\linux\module.h
#ifdef MODULE #define MODULE_GENERIC_TABLE(gtype,name) \ extern const struct gtype##_id __mod_##gtype##_table \ __attribute__ ((unused, alias(__stringify(name)))) extern struct module __this_module; #define THIS_MODULE (&__this_module) #else /* !MODULE */ #define MODULE_GENERIC_TABLE(gtype,name) #define THIS_MODULE ((struct module *)0) #endif
__this_module這個符號是在加載到內核後才產生的。insmod命令執行後,會調用kernel/module.c裏的一個系統調用sys_init_module,它會調用load_module函數,將用戶空間傳入的整個內核模塊文件建立成一個內核模塊,並返回一個struct module結構體,今後,內核中便以這個結構體表明這個內核模塊。THIS_MODULE相似進程的CURRENT
關於sys_init_module、load_module的系統調用內核代碼原理分析,請參閱另外一篇文章
http://www.cnblogs.com/LittleHann/p/3920387.html
0x1: struct module
結構體struct module在內核中表明一個內核模塊,經過insmod(實際執行init_module系統調用)把本身編寫的內核模塊插入內核時,模塊便與一個 struct module結構體相關聯,併成爲內核的一部分,也就是說在內核中,以module這個結構體表明一個內核模塊(和windows下kprocess、kthread的概念很相似),從這裏也能夠看出,在內核領域,windows和linux在不少地方是殊途同歸的
struct module { /* 1. enum module_state state enum module_state { MODULE_STATE_LIVE, //模塊當前正常使用中(存活狀態) MODULE_STATE_COMING, //模塊當前正在被加載 MODULE_STATE_GOING, //模塊當前正在被卸載 }; load_module函數中完成模塊的部分建立工做後,把狀態置爲 MODULE_STATE_COMING sys_init_module函數中完成模塊的所有初始化工做後(包括把模塊加入全局的模塊列表,調用模塊自己的初始化函數),把模塊狀態置爲MODULE_STATE_LIVE 使用rmmod工具卸載模塊時,會調用系統調用delete_module,會把模塊的狀態置爲MODULE_STATE_GOING 這是模塊內部維護的一個狀態 */ enum module_state state; /* 2. struct list_head list list是做爲一個列表的成員,全部的內核模塊都被維護在一個全局鏈表中,鏈表頭是一個全局變量struct module *modules。任何一個新建立的模塊,都會被加入到這個鏈表的頭部 struct list_head { struct list_head *next, *prev; }; 這裏,咱們須要再次理解一下,鏈表是內核中的一個重要的機制,包括進程、模塊在內的不少東西都被以鏈表的形式進行組織,由於是雙向循環鏈表,咱們能夠任何一個modules->next遍歷並獲取到當前內核中的任何鏈表元素,這 在不少的枚舉場景、隱藏、反隱藏的技術中得以應用 */ struct list_head list; /* 3. char name[MODULE_NAME_LEN] name是模塊的名字,通常會拿模塊文件的文件名做爲模塊名。它是這個模塊的一個標識 */ char name[MODULE_NAME_LEN]; /* 4. struct module_kobject mkobj 該成員是一個結構體類型,結構體的定義以下: struct module_kobject { /* 4.1 struct kobject kobj kobj是一個struct kobject結構體 kobject是組成設備模型的基本結構。設備模型是在2.6內核中出現的新的概念,由於隨着拓撲結構愈來愈複雜,以及要支持諸如電源管理等新特性的要求,向新版本的內核明確提出了這樣的要求:須要有一個對系統的通常性
抽象描述。設備模型提供了這樣的抽象 kobject最初只是被理解爲一個簡單的引用計數,但如今也有了不少成員,它所能處理的任務以及它所支持的代碼包括:對象的引用計數;sysfs表述;結構關聯;熱插拔事件處理。下面是kobject結構的定義: struct kobject { //k_name和name都是該內核對象的名稱,在內核模塊的內嵌kobject中,名稱即爲內核模塊的名稱 const char *k_name; char name[KOBJ_NAME_LEN]; /* kref是該kobject的引用計數,新建立的kobject被加入到kset時(調用kobject_init),引用計數被加1,而後kobject跟它的parent創建關聯時,引用計數被加1,因此一個新建立的kobject,其引用計數老是爲2 */ struct kref kref; //entry是做爲鏈表的節點,全部同一子系統下的全部相同類型的kobject被連接成一個鏈表組織在一塊兒 struct list_head entry; //parent指向該kobject所屬分層結構中的上一層節點,全部內核模塊的parent是module struct kobject *parent; /* 成員kset就是嵌入相同類型結構的kobject集合。下面是struct kset結構體的定義: struct kset { struct subsystem *subsys; struct kobj_type *ktype; struct list_head list; spinlock_t list_lock; struct kobject kobj; struct kset_uevent_ops * uevent_ops; }; */ struct kset *kset; //ktype則是模塊的屬性,這些屬性都會在kobject的sysfs目錄中顯示 struct kobj_type *ktype; //dentry則是文件系統相關的一個節點 struct dentry *dentry; }; */ struct kobject kobj; //mod指向包容它的struct module成員 struct module *mod; }; */ struct module_kobject mkobj; struct module_param_attrs *param_attrs; const char *version; const char *srcversion; /* Exported symbols */ const struct kernel_symbol *syms; unsigned int num_syms; const unsigned long *crcs; /* GPL-only exported symbols. */ const struct kernel_symbol *gpl_syms; unsigned int num_gpl_syms; const unsigned long *gpl_crcs; unsigned int num_exentries; const struct exception_table_entry *extable; int (*init)(void); /* 初始化相關 */ void *module_init; void *module_core; unsigned long init_size, core_size; unsigned long init_text_size, core_text_size; struct mod_arch_specific arch; int unsafe; int license_gplok; #ifdef CONFIG_MODULE_UNLOAD struct module_ref ref[NR_CPUS]; struct list_head modules_which_use_me; struct task_struct *waiter; void (*exit)(void); #endif #ifdef CONFIG_KALLSYMS Elf_Sym *symtab; unsigned long num_symtab; char *strtab; struct module_sect_attrs *sect_attrs; #endif void *percpu; char *args; };
從struct module結構體能夠看出,在內核態,咱們若是要枚舉當前模塊列表,能夠使用
1. struct module->list 2. struct module->mkobj->kobj->entry 3. struct module->mkobj->kobj->kset //經過它們三個均可以指向一個內核模塊的鏈表
Relevant Link:
http://lxr.free-electrons.com/source/include/linux/module.h http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/include/linux/module.h http://blog.chinaunix.net/uid-9525959-id-2001630.html http://blog.csdn.net/linweig/article/details/5044722
0x2: struct module_use
source/include/linux/module.h
/* modules using other modules: kdb wants to see this. */ struct module_use { struct list_head source_list; struct list_head target_list; struct module *source, *target; };
"struct module_use"和"struct module->module_which_use_me"這兩個結果共同組合和保證了內核模塊中的依賴關係。
若是模塊B使用了模塊A提供的函數,那麼模塊A和模塊B之間就存在關係,能夠從兩個方面來看這種關係
1. 模塊B依賴模塊A 除非模塊A已經駐留在內核內存,不然模塊B沒法裝載 2. 模塊B引用模塊A 除非模塊B已經移除,不然模塊A沒法從內核移除,在內核中,這種關係稱之爲"模塊B使用模塊A"
對每一個使用了模塊A中函數的模塊B,都會建立一個module_use結構體實例,該實例將被添加到模塊A(被依賴的模塊)的module實例中的modules_which_use_me鏈表中,modules_which_use_me指向模塊B的module實例。
明白了模塊間的依賴關係在數據結構上的表現,能夠很容易地枚舉出全部模塊的依賴關係
4. 文件系統相關數據結構
0x1: struct file
文件結構體表明一個打開的文件,系統中的每一個打開的文件在內核空間都有一個關聯的struct file。它由內核在打開文件時建立,並傳遞給在文件上進行操做的任何函數。在文件的全部實例都關閉後,內核釋放這個數據結構
struct file { /* * fu_list becomes invalid after file_free is called and queued via * fu_rcuhead for RCU freeing */ union { /* 定義在 linux/include/linux/list.h中 struct list_head { struct list_head *next, *prev; }; 用於通用文件對象鏈表的指針,全部打開的文件造成一個鏈表 */ struct list_head fu_list; /* 定義在linux/include/linux/rcupdate.h中 struct rcu_head { struct rcu_head *next; void (*func)(struct rcu_head *head); }; RCU(Read-Copy Update)是Linux 2.6內核中新的鎖機制 */ struct rcu_head fu_rcuhead; } f_u; /* 定義在linux/include/linux/namei.h中 struct path { /* struct vfsmount *mnt的做用是指出該文件的已安裝的文件系統,即指向VFS安裝點的指針 */ struct vfsmount *mnt; /* struct dentry *dentry是與文件相關的目錄項對象,指向相關目錄項的指針 */ struct dentry *dentry; }; */ struct path f_path; #define f_dentry f_path.dentry #define f_vfsmnt f_path.mnt /*
指向文件操做表的指針 定義在linux/include/linux/fs.h中,其中包含着與文件關聯的操做,例如 struct file_operations { struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); }; 當打開一個文件時,內核就建立一個與該文件相關聯的struct file結構,其中的*f_op就指向的是具體對該文件進行操做的函數 例如用戶調用系統調用read來讀取該文件的內容時,那麼系統調用read最終會陷入內核調用sys_read函數,而sys_read最終會調用於該文件關聯的struct file結構中的f_op->read函數對文件內容進行讀取 */ const struct file_operations *f_op; spinlock_t f_lock; /* f_ep_links, f_flags, no IRQ */ /* typedef struct { volatile int counter; } atomic_t; volatile修飾字段告訴gcc不要對該類型的數據作優化處理,對它的訪問都是對內存的訪問,而不是對寄存器的訪問 f_count的做用是記錄對文件對象的引用計數,也即當前有多少個進程在使用該文件 */ atomic_long_t f_count; /* 當打開文件時指定的標誌,對應系統調用open的int flags參數。驅動程序爲了支持非阻塞型操做須要檢查這個標誌 */ unsigned int f_flags; /* 對文件的讀寫模式,對應系統調用open的mod_t mode參數。若是驅動程序須要這個值,能夠直接讀取這個字段。 mod_t被定義爲: typedef unsigned int __kernel_mode_t; typedef __kernel_mode_t mode_t; */ fmode_t f_mode; /* 當前的文件指針位置,即文件的讀寫位置 loff_t被定義爲: typedef long long __kernel_loff_t; typedef __kernel_loff_t loff_t; */ loff_t f_pos; /* struct fown_struct在linux/include/linux/fs.h被定義 struct fown_struct { rwlock_t lock; /* protects pid, uid, euid fields */ struct pid *pid; /* pid or -pgrp where SIGIO should be sent */ enum pid_type pid_type; /* Kind of process group SIGIO should be sent to */ uid_t uid, euid; /* uid/euid of process setting the owner */ int signum; /* posix.1b rt signal to be delivered on IO */ }; 該結構的做用是經過信號進行I/O時間通知的數據 */ struct fown_struct f_owner; const struct cred *f_cred; /* struct file_ra_state結構被定義在/linux/include/linux/fs.h中 struct file_ra_state { pgoff_t start; /* where readahead started */ unsigned long size; /* # of readahead pages */ unsigned long async_size; /* do asynchronous readahead when there are only # of pages ahead */ unsigned long ra_pages; /* Maximum readahead window */ unsigned long mmap_hit; /* Cache hit stat for mmap accesses */ unsigned long mmap_miss; /* Cache miss stat for mmap accesses */ unsigned long prev_index; /* Cache last read() position */ unsigned int prev_offset; /* Offset where last read() ended in a page */ }; 該結構標識了文件預讀狀態,文件預讀算法使用的主要數據結構,當打開一個文件時,f_ra中出了perv_page(默認爲-1)和ra_apges(對該文件容許的最大預讀量)這兩個字段外,其餘的全部西端都置爲0 */ struct file_ra_state f_ra; /* 記錄文件的版本號,每次使用後都自動遞增 */ u64 f_version; #ifdef CONFIG_SECURITY /* #ifdef CONFIG_SECURITY void *f_security; #endif 若是在編譯內核時配置了安全措施,那麼struct file結構中就會有void *f_security數據項,用來描述安全措施或者是記錄與安全有關的信息。 */ void *f_security; #endif /* 系統在調用驅動程序的open方法前將這個指針置爲NULL。驅動程序能夠將這個字段用於任意目的,也能夠忽略這個字段。驅動程序能夠用這個字段指向已分配的數據,可是必定要在內核釋放file結構前的release方法中清除它 */ void *private_data; #ifdef CONFIG_EPOLL /* 被用在fs/eventpoll.c來連接全部鉤到這個文件上。其中 1) f_ep_links是文件的事件輪詢等待者鏈表的頭 2) f_ep_lock是保護f_ep_links鏈表的自旋鎖 */ struct list_head f_ep_links; struct list_head f_tfile_llink; #endif /* #ifdef CONFIG_EPOLL */ /* struct address_space被定義在/linux/include/linux/fs.h中,此處是指向文件地址空間的指針 */ struct address_space *f_mapping; #ifdef CONFIG_DEBUG_WRITECOUNT unsigned long f_mnt_write_state; #endif };
每一個文件對象老是包含在下列的一個雙向循環鏈表之中
1. "未使用"文件對象的鏈表 該鏈表既能夠用作文件對象的內存高速緩存,又能夠看成超級用戶的備用存儲器,也就是說,即便系統的動態內存用完,也容許超級用戶打開文件。因爲這些對象是未使用的,它們的f_count域是NULL,該鏈表首元素的地址存放在變量
free_list中,內核必須確認該鏈表老是至少包含NR_RESERVED_FILES個對象,一般該值設爲10 2. "正在使用"文件對的象鏈表 該鏈表中的每一個元素至少由一個進程使用,所以,各個元素的f_count域不會爲NULL,該鏈表中第一個元素的地址存放在變量anon_list中 若是VFS須要分配一個新的文件對象,就調用函數get_empty_filp()。該函數檢測"未使用"文件對象鏈表的元素個數是否多於NR_RESERVED_FILES,若是是,能夠爲新打開的文件使用其中的一個元素;若是沒有,則退回到正常的內存
分配(也就是說這是一種高速緩存機制)
Relevant Link:
http://linux.chinaunix.net/techdoc/system/2008/07/24/1020195.shtml http://blog.csdn.net/fantasyhujian/article/details/9166117
0x2: struct inode
咱們知道,在linux內核中,用file結構表示打開的文件描述符,而用inode結構表示具體的文件
struct inode { /* 哈希表 */ struct hlist_node i_hash; /* 索引節點鏈表(backing dev IO list) */ struct list_head i_list; struct list_head i_sb_list; /* 目錄項鍊表 */ struct list_head i_dentry; /* 節點號 */ unsigned long i_ino; /* 引用記數 */ atomic_t i_count; /* 硬連接數 */ unsigned int i_nlink; /* 使用者id */ uid_t i_uid; /* 使用者所在組id */ gid_t i_gid; /* 實設備標識符 */ dev_t i_rdev; /* 版本號 */ u64 i_version; /* 以字節爲單位的文件大小 */ loff_t i_size; #ifdef __NEED_I_SIZE_ORDERED seqcount_t i_size_seqcount; #endif /* 最後訪問時間 */ struct timespec i_atime; /* 最後修改(modify)時間 */ struct timespec i_mtime; /* 最後改變(change)時間 */ struct timespec i_ctime; /* 文件的塊數 */ blkcnt_t i_blocks; /* 以位爲單位的塊大小 */ unsigned int i_blkbits; /* 使用的字節數 */ unsigned short i_bytes; /* 訪問權限控制 */ umode_t i_mode; /* 自旋鎖 */ spinlock_t i_lock; struct mutex i_mutex; /* 索引節點信號量 */ struct rw_semaphore i_alloc_sem; /* 索引節點操做表 索引節點的操做inode_operations定義在linux/fs.h struct inode_operations { /* 1. VFS經過系統調用create()和open()來調用該函數,從而爲dentry對象建立一個新的索引節點。在建立時使用mode制定初始模式 */ int (*create) (struct inode *, struct dentry *,int); /* 2. 該函數在特定目錄中尋找索引節點,該索引節點要對應於dentry中給出的文件名 */ struct dentry * (*lookup) (struct inode *, struct dentry *); /* 3. 該函數被系統調用link()調用,用來建立硬鏈接。硬連接名稱由dentry參數指定,鏈接對象是dir目錄中ld_dentry目錄想所表明的文件 */ int (*link) (struct dentry *, struct inode *, struct dentry *); /* 4. 該函數被系統調用unlink()調用,從目錄dir中刪除由目錄項dentry制動的索引節點對象 */ int (*unlink) (struct inode *, struct dentry *); /* 5. 該函數被系統調用symlik()調用,建立符號鏈接,該符號鏈接名稱由symname指定,鏈接對象是dir目錄中的dentry目錄項 */ int (*symlink) (struct inode *, struct dentry *, const char *); /* 6. 該函數被mkdir()調用,建立一個新路徑。建立時使用mode制定的初始模式 */ int (*mkdir) (struct inode *, struct dentry *, int); /* 7. 該函數被系統調用rmdir()調用,刪除dir目錄中的dentry目錄項表明的文件 */ int (*rmdir) (struct inode *, struct dentry *); /* 8. 該函數被系統調用mknod()調用,建立特殊文件(設備文件、命名管道或套接字)。要建立的文件放在dir目錄中,其目錄項問dentry,關聯的設備爲rdev,初始權限由mode指定 */ int (*mknod) (struct inode *, struct dentry *, int, dev_t); /* 9. VFS調用該函數來移動文件。文件源路徑在old_dir目錄中,源文件由old_dentry目錄項所指定,目標路徑在new_dir目錄中,目標文件由new_dentry指定 */ int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); /* 10. 該函數被系統調用readlink()調用,拷貝數據到特定的緩衝buffer中。拷貝的數據來自dentry指定的符號連接,最大拷貝大小可達到buflen字節 */ int (*readlink) (struct dentry *, char *, int); /* 11. 該函數由VFS調用,從一個符號鏈接查找他指向的索引節點,由dentry指向的鏈接被解析 */ int (*follow_link) (struct dentry *, struct nameidata *); /* 12. 在follow_link()調用以後,該函數由vfs調用進行清楚工做 */ int (*put_link) (struct dentry *, struct nameidata *); /* 13. 該函數由VFS調用,修改文件的大小,在調用以前,索引節點的i_size項必須被設置成預期的大小 */ void (*truncate) (struct inode *); /* 該函數用來檢查inode所表明的文件是否容許特定的訪問模式,若是容許特定的訪問模式,返回0,不然返回負值的錯誤碼。多數文件系統都將此區域設置爲null,使用VFS提供的通用方法進行檢查,這種檢查操做僅僅比較索引
及誒但對象中的訪問模式位是否和mask一致,比較複雜的系統, 好比支持訪問控制鏈(ACL)的文件系統,須要使用特殊的permission()方法 */ int (*permission) (struct inode *, int); /* 該函數被notify_change調用,在修改索引節點以後,通知發生了改變事件 */ int (*setattr) (struct dentry *, struct iattr *); /* 在通知索引節點須要從磁盤中更新時,VFS會調用該函數 */ int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); /* 該函數由VFS調用,向dentry指定的文件設置擴展屬性,屬性名爲name,值爲value */ int (*setxattr) (struct dentry *, const char *, const void *, size_t, int); /* 該函數被VFS調用,向value中拷貝給定文件的擴展屬性name對應的數值 */ ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); /* 該函數將特定文件全部屬性別表拷貝到一個緩衝列表中 */ ssize_t (*listxattr) (struct dentry *, char *, size_t); /* 該函數從給定文件中刪除指定的屬性 */ int (*removexattr) (struct dentry *, const char *); }; */ const struct inode_operations *i_op; /* 默認的索引節點操做 former ->i_op->default_file_ops */ const struct file_operations *i_fop; /* 相關的超級塊 */ struct super_block *i_sb; /* 文件鎖鏈表 */ struct file_lock *i_flock; /* 相關的地址映射 */ struct address_space *i_mapping; /* 設備地址映射
address_space結構與文件的對應:一個具體的文件在打開後,內核會在內存中爲之創建一個struct inode結構,其中的i_mapping域指向一個address_space結構。這樣,一個文件就對應一個address_space結構,一個 address_space與一個偏移量可以肯定一個page cache 或swap cache中的一個頁面。所以,當要尋址某個數據時,很容易根據給定的文件及數據在文件內的偏移量而找到相應的頁面 */ struct address_space i_data; #ifdef CONFIG_QUOTA /* 節點的磁盤限額 */ struct dquot *i_dquot[MAXQUOTAS]; #endif /* 塊設備鏈表 */ struct list_head i_devices; union { //管道信息 struct pipe_inode_info *i_pipe; //塊設備驅動 struct block_device *i_bdev; struct cdev *i_cdev; }; /* 索引節點版本號 */ __u32 i_generation; #ifdef CONFIG_FSNOTIFY /* 目錄通知掩碼 all events this inode cares about */ __u32 i_fsnotify_mask; struct hlist_head i_fsnotify_mark_entries; /* fsnotify mark entries */ #endif #ifdef CONFIG_INOTIFY struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif /* 狀態標誌 */ unsigned long i_state; /* 首次修改時間 jiffies of first dirtying */ unsigned long dirtied_when; /* 文件系統標誌 */ unsigned int i_flags; /* 寫者記數 */ atomic_t i_writecount; #ifdef CONFIG_SECURITY /* 安全模塊 */ void *i_security; #endif #ifdef CONFIG_FS_POSIX_ACL struct posix_acl *i_acl; struct posix_acl *i_default_acl; #endif void *i_private; /* fs or device private pointer */ };
0x3: struct stat
struct stat在咱們進行文件、目錄屬性讀寫的時候、磁盤IO狀態監控的時候經常會用到的數據結構
/* struct stat { dev_t st_dev; // ID of device containing file -文件所在設備的ID ino_t st_ino; // inode number -inode節點號 mode_t st_mode; // protection -保護模式? nlink_t st_nlink; // number of hard links -鏈向此文件的鏈接數(硬鏈接) uid_t st_uid; // user ID of owner -user id gid_t st_gid; // group ID of owner - group id dev_t st_rdev; // device ID (if special file) -設備號,針對設備文件 off_t st_size; // total size, in bytes -文件大小,字節爲單位 blksize_t st_blksize; // blocksize for filesystem I/O -系統塊的大小 blkcnt_t st_blocks; // number of blocks allocated -文件所佔塊數 time_t st_atime; // time of last access - 最近存取時間 time_t st_mtime; // time of last modification - 最近修改時間 time_t st_ctime; // time of last status change - 最近建立時間 }; */
Relevant Link:
http://blog.sina.com.cn/s/blog_7943319e01018m4h.html http://www.cnblogs.com/QJohnson/archive/2011/06/24/2089414.html http://blog.csdn.net/tianmohust/article/details/6609470
Each process on the system has its own list of open files, root filesystem, current working directory, mount points, and so on. Three data structures tie together the VFS layer and the processes on the system: the files_struct,fs_struct, and namespace structure.
The second process-related structure is fs_struct, which contains filesystem information related to a process and is pointed at by the fs field in the process descriptor. The structure is defined in <linux/fs_struct.h>. Here it is, with comments:
0x4: struct fs_struct
文件系統相關信息結構體
struct fs_struct { atomic_t count; //共享這個表的進程個數 rwlock_t lock; //用於表中字段的讀/寫自旋鎖 int umask; //當打開文件設置文件權限時所使用的位掩碼 struct dentry * root; //根目錄的目錄項 struct dentry * pwd; //當前工做目錄的目錄項 struct dentry * altroot; //模擬根目錄的目錄項(在80x86結構上始終爲NULL) struct vfsmount * rootmnt; //根目錄所安裝的文件系統對象 struct vfsmount* pwdmnt; //當前工做目錄所安裝的文件系統對象 struct vfsmount* altrootmnt; //模擬根目錄所安裝的文件系統對象(在80x86結構上始終爲NULL) };
0x5: struct files_struct
The files_struct is defined in <linux/file.h>. This table's address is pointed to by the files enTRy in the processor descriptor. All per-process information about open files and file descriptors is contained therein. Here it is, with comments:
表示進程當前打開的文件,表的地址存放於進程描述符task_struct的files字段,每一個進程用一個files_struct結構來記錄文件描述符的使用狀況,這個files_struct結構稱爲用戶打開文件表,它是進程的私有數據
struct files_struct { atomic_t count; //共享該表的進程數 struct fdtable *fdt; //指向fdtable結構的指針 struct fdtable fdtab; //指向fdtable結構 spinlock_t file_lock ____cacheline_aligned_in_smp; int next_fd; //已分配的文件描述符加1 struct embedded_fd_set close_on_exec_init; //指向執行exec()時須要關閉的文件描述符 struct embedded_fd_set open_fds_init; //文件描述符的初值集合 struct file * fd_array[NR_OPEN_DEFAULT]; //文件對象指針的初始化數組 };
0x6: struct fdtable
struct fdtable { unsigned int max_fds; int max_fdset; /* current fd array 指向文件對象的指針數組,一般,fd字段指向files_struct結構的fd_array字段,該字段包括32個文件對象指針。若是進程打開的文件數目多於32,內核就分配一個新的、更大的文件指針數組,並將其地址存放在fd字段中,
內核同時也更新max_fds字段的值 對於在fd數組中全部元素的每一個文件來講,數組的索引就是文件描述符(file descriptor)。一般,數組的第一個元素(索引爲0)是進程的標準輸入文件,數組的第二個元素(索引爲1)是進程的標準輸出文件,數組的第三個元素
(索引爲2)是進程的標準錯誤文件 */ struct file ** fd; fd_set *close_on_exec; fd_set *open_fds; struct rcu_head rcu; struct files_struct *free_files; struct fdtable *next; }; #define NR_OPEN_DEFAULT BITS_PER_LONG #define BITS_PER_LONG 32 /* asm-i386 */
用一張圖表示task_struct、fs_struct、files_struct、fdtable、file的關係
Relevant Link:
http://oss.org.cn/kernel-book/ch08/8.2.4.htm http://www.makelinux.net/books/lkd2/ch12lev1sec10
0x7: struct dentry
struct dentry { //目錄項引用計數器 atomic_t d_count; /* 目錄項標誌 protected by d_lock #define DCACHE_AUTOFS_PENDING 0x0001 // autofs: "under construction" #define DCACHE_NFSFS_RENAMED 0x0002 // this dentry has been "silly renamed" and has to be eleted on the last dput() #define DCACHE_DISCONNECTED 0x0004 //指定了一個dentry當前沒有鏈接到超級塊的dentry樹 #define DCACHE_REFERENCED 0x0008 //Recently used, don't discard. #define DCACHE_UNHASHED 0x0010 //該dentry實例沒有包含在任何inode的散列表中 #define DCACHE_INOTIFY_PARENT_WATCHED 0x0020 // Parent inode is watched by inotify #define DCACHE_COOKIE 0x0040 // For use by dcookie subsystem #define DCACHE_FSNOTIFY_PARENT_WATCHED 0x0080 // Parent inode is watched by some fsnotify listener */ unsigned int d_flags; //per dentry lock spinlock_t d_lock; //當前dentry對象表示一個裝載點,那麼d_mounted設置爲1,不然爲0 int d_mounted; /* 文件名所屬的inode,若是爲NULL,則表示不存在的文件名 若是dentry對象是一個不存在的文件名創建的,則d_inode爲NULL指針,這有助於加速查找不存在的文件名,一般狀況下,這與查找實際存在的文件名一樣耗時 */ struct inode *d_inode; /* The next three fields are touched by __d_lookup. Place them here so they all fit in a cache line. */ //用於查找的散列表 lookup hash list struct hlist_node d_hash; /* 指向當前的dentry實例的父母了的dentry實例 parent directory 當前的dentry實例即位於父目錄的d_subdirs鏈表中,對於根目錄(沒有父目錄),d_parent指向其自身的dentry實例 */ struct dentry *d_parent; /* d_iname指定了文件的名稱,qstr是一個內核字符串的包裝器,它存儲了實際的char*字符串以及字符串長度和散列值,這使得更容易處理查找工做 要注意的是,這裏並不存儲絕對路徑,而是隻有路徑的最後一個份量,例如對/usr/bin/emacs只存儲emacs,由於在linux中,路徑信息隱含在了dentry層次鏈表結構中了 */ struct qstr d_name; //LRU list struct list_head d_lru; /* * d_child and d_rcu can share memory */ union { /* child of parent list */ struct list_head d_child; //鏈表元素,用於將dentry鏈接到inode的i_dentry鏈表中 struct rcu_head d_rcu; } d_u; //our children 子目錄/文件的目錄項鍊表 struct list_head d_subdirs; /* inode alias list 鏈表元素,用於將dentry鏈接到inode的i_dentry鏈表中 d_alias用做鏈表元素,以鏈接表示相同文件的各個dentry對象,在利用硬連接用兩個不一樣名稱表示同一文件時,會發生這種狀況,對應於文件的inode的i_dentry成員用做該鏈表的表頭,各個dentry對象經過d_alias鏈接到該鏈表中 */ struct list_head d_alias; //used by d_revalidate unsigned long d_time; /* d_op指向一個結構,其中包含了各類函數指針,提供對dentry對象的各類操做,這些操做必須由底層文件系統實現 struct dentry_operations { //在把目錄項對象轉換爲一個文件路徑名以前,斷定該目錄項對象是否依然有效 int (*d_revalidate)(struct dentry *, struct nameidata *); //生成一個散列值,用於目錄項散列表 int (*d_hash) (struct dentry *, struct qstr *); //比較兩個文件名 int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); //當對目錄項對象的最後一個引用被刪除,調用該方法 int (*d_delete)(struct dentry *); //當要釋放一個目錄項對象時,調用該方法 void (*d_release)(struct dentry *); //當一個目錄對象變爲負狀態時,調用該方法 void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int); }; */ const struct dentry_operations *d_op; //The root of the dentry tree dentry樹的根,超級塊 struct super_block *d_sb; //fs-specific data 特定文件系統的數據 void *d_fsdata; /* 短文件名small names存儲在這裏 若是文件名由少許字符組成,則只保存在d_iname中,而不是dnanme中,用於加速訪問 */ unsigned char d_iname[DNAME_INLINE_LEN_MIN]; };
Relevant Link:
http://blog.csdn.net/fudan_abc/article/details/1775313
0x8: struct vfsmount
struct vfsmount { struct list_head mnt_hash; //裝載點所在的父文件系統的vfsmount結構 fs we are mounted on,文件系統之間的父子關係就是這樣實現的 struct vfsmount *mnt_parent; //裝載點在父文件系統中的dentry(即裝載點自身對應的dentry) dentry of mountpoint struct dentry *mnt_mountpoint; //當前文件系統的相對根目錄的dentry root of the mounted tree struct dentry *mnt_root; /* 指向超級塊的指針 pointer to superblock mnt_sb指針創建了與相關的超級塊之間的關聯(對每一個裝載的文件系統而言,都有且只有一個超級塊實例) */ struct super_block *mnt_sb; //子文件系統鏈表 struct list_head mnt_mounts; //鏈表元素,用於父文件系統中的mnt_mounts鏈表 struct list_head mnt_child; /* #define MNT_NOSUID 0x01 (禁止setuid執行) #define MNT_NODEV 0x02 (裝載的文件系統是虛擬的,沒有物理後端設備) #define MNT_NOEXEC 0x04 #define MNT_NOATIME 0x08 #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 #define MNT_READONLY 0x40 // does the user want this to be r/o? #define MNT_STRICTATIME 0x80 #define MNT_SHRINKABLE 0x100 (專用於NFS、AFS 用來標記子裝載,設置了該標記的裝載容許自動移除) #define MNT_WRITE_HOLD 0x200 #define MNT_SHARED 0x1000 // if the vfsmount is a shared mount (共享裝載) #define MNT_UNBINDABLE 0x2000 // if the vfsmount is a unbindable mount (不可綁定裝載) #define MNT_PNODE_MASK 0x3000 // propagation flag mask (傳播標誌掩碼) */ int mnt_flags; /* 4 bytes hole on 64bits arches */ //設備名稱,例如/dev/dsk/hda1 Name of device e.g. /dev/dsk/hda1 const char *mnt_devname; struct list_head mnt_list; //鏈表元素,用於特定於文件系統的到期鏈表中 link in fs-specific expiry list struct list_head mnt_expire; //鏈表元素,用於共享裝載的循環鏈表 circular list of shared mounts struct list_head mnt_share; //從屬裝載的鏈表 list of slave mounts struct list_head mnt_slave_list; //鏈表元素,用於從屬裝載的鏈表 slave list entry struct list_head mnt_slave; //指向主裝載,從屬裝載位於master->mnt_slave_list鏈表上 slave is on master->mnt_slave_list struct vfsmount *mnt_master; //所屬的命名空間 containing namespace struct mnt_namespace *mnt_ns; int mnt_id; /* mount identifier */ int mnt_group_id; /* peer group identifier */ /* mnt_count實現了一個使用計數器,每當一個vfsmount實例再也不須要時,都必須用mntput將計數器減1.mntget與mntput相對 We put mnt_count & mnt_expiry_mark at the end of struct vfsmount to let these frequently modified fields in a separate cache line (so that reads of mnt_flags wont ping-pong on SMP machines) 把mnt_count和mnt_expiry_mark防止在struct vfsmount的末尾,以便讓這些頻繁修改的字段與結構的主體處於兩個不一樣的緩存行中(這樣在SMP機器上讀取mnt_flags不會形成高速緩存的顛簸) */ atomic_t mnt_count; //若是標記爲到期,則其值爲true true if marked for expiry int mnt_expiry_mark; int mnt_pinned; int mnt_ghosts; #ifdef CONFIG_SMP int *mnt_writers; #else int mnt_writers; #endif };
Relevant Link:
http://www.cnblogs.com/Wandererzj/archive/2012/04/12/2444888.html
0x9: struct nameidata
路徑查找是VFS的一個很重要的操做:給定一個文件名,獲取該文件名的inode。路徑查找是VFS中至關繁瑣的一部分,主要是由於
1. 符號連接 一個文件可能經過符號連接引用另外一個文件,查找代碼必須考慮到這種可能性,可以識別出連接,並在相應的處理後跳出循環 2. 文件系統裝載點 必須檢測裝載點,然後據此重定向查找操做 3. 在通向目標文件名的路徑上,必須檢查全部目錄的訪問權限,進程必須有適當的權限,不然操做將終止,並給出錯誤信息 4. . ..和//等特殊路徑引入了複雜性
路徑查找過程涉及到不少函數調用,在這些調用過程當中,nameidata起到了很重要的做用:
1. 向查找函數傳遞參數 2. 保存查找結果
inode是類Unix系統的文件系統的基本索引方法,每一個文件都對應一個inode,再經過inode找到文件中的實際數據,所以根據文件路徑名找到具體的inode節點就是一個很重要的處理步驟。系統會緩存用過的每一個文件或目錄對應的dentry結構, 從該結構能夠指向相應的inode, 每次打開文件, 都會最終對應到文件的inode,中間查找過程稱爲namei
結構體定義以下
struct nameidata { /* 用於肯定文件路徑 struct path { struct vfsmount *mnt; struct dentry *dentry; }; */ struct path path; //須要查找的名稱,這是一個快速字符串,除了路徑字符串自己外,還包含字符串的長度和一個散列值 struct qstr last; // struct path root; unsigned int flags; int last_type; //當前路徑深度 unsigned depth; //因爲在符號連接處理時,nd的名字一直髮生變化,這裏用來保存符號連接處理中的路徑名 char *saved_names[MAX_NESTED_LINKS + 1]; /* Intent data */ union { struct open_intent open; } intent; };
Relevant Link:
http://man7.org/linux/man-pages/man7/path_resolution.7.html http://blog.sina.com.cn/s/blog_4a2f24830100l2h4.html http://blog.csdn.net/kickxxx/article/details/9529961 http://blog.csdn.net/air_snake/article/details/2690554 http://losemyheaven.blog.163.com/blog/static/17071980920124593256317/
0x10: struct super_block
/source/include/linux/fs.h
struct super_block { /* Keep this first 指向超級塊鏈表的指針,用於將系統中全部的超級塊彙集到一個鏈表中,該鏈表的表頭是全局變量super_blocks */ struct list_head s_list; /* search index; _not_ kdev_t 設備標識符 */ dev_t s_dev; //以字節爲單位的塊大小 unsigned long s_blocksize; //以位爲單位的塊大小 unsigned char s_blocksize_bits; //修改髒標誌,若是以任何方式改變了超級塊,須要向磁盤迴寫,都會將s_dirt設置爲1,不然爲0 unsigned char s_dirt; //文件大小上限 Max file size loff_t s_maxbytes; //文件系統類型 struct file_system_type *s_type; /* struct super_operations { //給定的超級塊下建立和初始化一個新的索引節點對象; struct inode *(*alloc_inode)(struct super_block *sb); //用於釋放給定的索引節點; void (*destroy_inode)(struct inode *); //VFS在索引節點髒(被修改)時會調用此函數,日誌文件系統(如ext3,ext4)執行該函數進行日誌更新; void (*dirty_inode) (struct inode *); //用於將給定的索引節點寫入磁盤,wait參數指明寫操做是否須要同步; int (*write_inode) (struct inode *, struct writeback_control *wbc); //在最後一個指向索引節點的引用被釋放後,VFS會調用該函數,VFS只須要簡單地刪除這個索引節點後,普通Uinx文件系統就不會定義這個函數了; void (*drop_inode) (struct inode *); //用於從磁盤上刪除給定的索引節點; void (*delete_inode) (struct inode *); //在卸載文件系統時由VFS調用,用來釋放超級塊,調用者必須一直持有s_lock鎖; void (*put_super) (struct super_block *); //用給定的超級塊更新磁盤上的超級塊。VFS經過該函數對內存中的超級塊和磁盤中的超級塊進行同步。調用者必須一直持有s_lock鎖; void (*write_super) (struct super_block *); //使文件系統的數據元與磁盤上的文件系統同步。wait參數指定操做是否同步; int (*sync_fs)(struct super_block *sb, int wait); int (*freeze_fs) (struct super_block *); int (*unfreeze_fs) (struct super_block *); //VFS經過調用該函數獲取文件系統狀態。指定文件系統縣官的統計信息將放置在statfs中; int (*statfs) (struct dentry *, struct kstatfs *); //當指定新的安裝選項從新安裝文件系統時,VFS會調用該函數。調用者必須一直持有s_lock鎖; int (*remount_fs) (struct super_block *, int *, char *); //VFS調用該函數釋放索引節點,並清空包含相關數據的全部頁面; void (*clear_inode) (struct inode *); //VFS調用該函數中斷安裝操做。該函數被網絡文件系統使用,如NFS; void (*umount_begin) (struct super_block *); int (*show_options)(struct seq_file *, struct vfsmount *); int (*show_stats)(struct seq_file *, struct vfsmount *); #ifdef CONFIG_QUOTA ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); }; */ const struct super_operations *s_op; //磁盤限額方法 const struct dquot_operations *dq_op; //磁盤限額方法 const struct quotactl_ops *s_qcop; //導出方法 const struct export_operations *s_export_op; //掛載標誌 unsigned long s_flags; //文件系統魔數 unsigned long s_magic; //目錄掛載點,s_root將超級塊與全局根目錄的dentry項關聯起來,只有一般可見的文件系統的超級塊,才指向/(根)目錄的dentry實例。具備特殊功能、不出如今一般的目錄層次結構中的文件系統(例如管道或套接字文件系統),指向專門的項,不能經過普通的文件命令訪問。處理文件系統對象的代碼常常須要檢查文件系統是否已經裝載,而s_root可用於該目的,若是它爲NULL,則該文件系統是一個僞文件系統,只在內核內部可見。不然,該文件系統在用戶空間中是可見的 struct dentry *s_root; //卸載信號量 struct rw_semaphore s_umount; //超級塊信號量 struct mutex s_lock; //引用計數 int s_count; //還沒有同步標誌 int s_need_sync; //活動引用計數 atomic_t s_active; #ifdef CONFIG_SECURITY //安全模塊 void *s_security; #endif struct xattr_handler **s_xattr; //all inodes struct list_head s_inodes; //匿名目錄項 anonymous dentries for (nfs) exporting struct hlist_head s_anon; //被分配文件鏈表,列出了該超級塊表示的文件系統上全部打開的文件。內核在卸載文件系統時將參考該列表,若是其中仍然包含爲寫入而打開的文件,則文件系統仍然處於使用中,卸載操做失敗,並將返回適當的錯誤信息 struct list_head s_files; /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */ struct list_head s_dentry_lru; //unused dentry lru of dentry on lru int s_nr_dentry_unused; //指向了底層文件系統的數據所在的相關塊設備 struct block_device *s_bdev; struct backing_dev_info *s_bdi; struct mtd_info *s_mtd; //該類型文件系統 struct list_head s_instances; //限額相關選項 Diskquota specific options struct quota_info s_dquot; int s_frozen; wait_queue_head_t s_wait_unfrozen; //文本名字 Informational name char s_id[32]; //Filesystem private info void *s_fs_info; fmode_t s_mode; /* * The next field is for VFS *only*. No filesystems have any business * even looking at it. You had been warned. */ struct mutex s_vfs_rename_mutex; /* Kludge */ /* Granularity of c/m/atime in ns. Cannot be worse than a second 指定了文件系統支持的各類時間戳的最大可能的粒度 */ u32 s_time_gran; /* * Filesystem subtype. If non-empty the filesystem type field * in /proc/mounts will be "type.subtype" */ char *s_subtype; /* * Saved mount options for lazy filesystems using * generic_show_options() */ char *s_options; };
Relevant Link:
http://linux.chinaunix.net/techdoc/system/2008/09/06/1030468.shtml http://lxr.free-electrons.com/source/include/linux/fs.h
0x11: struct file_system_type
struct file_system_type { //文件系統的類型名,以字符串的形式出現,保存了文件系統的名稱(例如reiserfs、ext3) const char *name; /* 使用的標誌,指明具體文件系統的一些特性,有關標誌定義於fs.h中 #define FS_REQUIRES_DEV 1 #define FS_BINARY_MOUNTDATA 2 #define FS_HAS_SUBTYPE 4 #define FS_REVAL_DOT 16384 // Check the paths ".", ".." for staleness #define FS_RENAME_DOES_D_MOVE 32768 // FS will handle d_move() during rename() internally. */ int fs_flags; //用於從底層存儲介質讀取超級塊的函數,地址保存在get_sb中,這個函數對裝載過程很重要,邏輯上,該函數依賴具體的文件系統,不能實現爲抽象,並且該函數也不能保存在super_operations結構中,由於超級塊對象和指向該結構的指針都是在調用get_sb以後建立的 int (*get_sb) (struct file_system_type *, int, const char *, void *, struct vfsmount *); //kill_sb在再也不須要某個文件系統類型時執行清理工做 void (*kill_sb) (struct super_block *); /* 1. 若是file_system_type所表明的文件系統是經過可安裝模塊(LKM)實現的,則該指針指向表明着具體模塊的module結構 2. 若是文件系統是靜態地連接到內核,則這個域爲NULL 實際上,咱們只須要把這個域置爲THIS_MODLUE(宏),它就能自動地完成上述工做 */ struct module *owner; //把全部的file_system_type結構連接成單項鍊表的連接指針,變量file_systems指向這個鏈表。這個鏈表是一個臨界資源,受file_systems_lock自旋讀寫鎖的保護 struct file_system_type * next; /* 對於每一個已經裝載的文件系統,在內存中都建立了一個超級塊結構,該結構保存了文件系統它自己和裝載點的有關信息。因爲能夠裝載幾個同一類型的文件系統(例如home、root分區,它們的文件系統類型一般相同),同一文件系統類型可能對應了多個超級塊結構,這些超級塊彙集在一個鏈表中。fs_supers是對應的表頭 這個域是Linux2.4.10之後的內核版本中新增長的,這是一個雙向鏈表。鏈表中的元素是超級塊結構,每一個文件系統都有一個超級塊,但有些文件系統可能被安裝在不一樣的設備上,並且每一個具體的設備都有一個超級塊,這些超級塊就造成一個雙向鏈表 */ struct list_head fs_supers; struct lock_class_key s_lock_key; struct lock_class_key s_umount_key; struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; struct lock_class_key i_mutex_dir_key; struct lock_class_key i_alloc_sem_key; };
Relevant Link:
http://oss.org.cn/kernel-book/ch08/8.4.1.htm
5. 內核安全相關數據結構
0x1: struct security_operations
這是一個鉤子函數的指針數組,其中每個數組元素都是一個SELINUX安全鉤子函數,在2.6以上的內核中,大部分涉及安全控制的系統調用都被替換爲了這個結構體中的對應鉤子函數項,從而使SELINUX能在代碼執行流這個層面實現安全訪問控制
這個結構中包含了按照內核對象或內核子系統分組的鉤子組成的子結構,以及一些用於系統操做的頂層鉤子。在內核源代碼中很容易找到對鉤子函數的調用: 其前綴是security_ops->xxxx
struct security_operations { char name[SECURITY_NAME_MAX + 1]; int (*ptrace_access_check) (struct task_struct *child, unsigned int mode); int (*ptrace_traceme) (struct task_struct *parent); int (*capget) (struct task_struct *target, kernel_cap_t *effective, kernel_cap_t *inheritable, kernel_cap_t *permitted); int (*capset) (struct cred *new, const struct cred *old, const kernel_cap_t *effective, const kernel_cap_t *inheritable, const kernel_cap_t *permitted); int (*capable) (struct task_struct *tsk, const struct cred *cred, int cap, int audit); int (*acct) (struct file *file); int (*sysctl) (struct ctl_table *table, int op); int (*quotactl) (int cmds, int type, int id, struct super_block *sb); int (*quota_on) (struct dentry *dentry); int (*syslog) (int type); int (*settime) (struct timespec *ts, struct timezone *tz); int (*vm_enough_memory) (struct mm_struct *mm, long pages); int (*bprm_set_creds) (struct linux_binprm *bprm); int (*bprm_check_security) (struct linux_binprm *bprm); int (*bprm_secureexec) (struct linux_binprm *bprm); void (*bprm_committing_creds) (struct linux_binprm *bprm); void (*bprm_committed_creds) (struct linux_binprm *bprm); int (*sb_alloc_security) (struct super_block *sb); void (*sb_free_security) (struct super_block *sb); int (*sb_copy_data) (char *orig, char *copy); int (*sb_kern_mount) (struct super_block *sb, int flags, void *data); int (*sb_show_options) (struct seq_file *m, struct super_block *sb); int (*sb_statfs) (struct dentry *dentry); int (*sb_mount) (char *dev_name, struct path *path, char *type, unsigned long flags, void *data); int (*sb_check_sb) (struct vfsmount *mnt, struct path *path); int (*sb_umount) (struct vfsmount *mnt, int flags); void (*sb_umount_close) (struct vfsmount *mnt); void (*sb_umount_busy) (struct vfsmount *mnt); void (*sb_post_remount) (struct vfsmount *mnt, unsigned long flags, void *data); void (*sb_post_addmount) (struct vfsmount *mnt, struct path *mountpoint); int (*sb_pivotroot) (struct path *old_path, struct path *new_path); void (*sb_post_pivotroot) (struct path *old_path, struct path *new_path); int (*sb_set_mnt_opts) (struct super_block *sb, struct security_mnt_opts *opts); void (*sb_clone_mnt_opts) (const struct super_block *oldsb, struct super_block *newsb); int (*sb_parse_opts_str) (char *options, struct security_mnt_opts *opts); #ifdef CONFIG_SECURITY_PATH int (*path_unlink) (struct path *dir, struct dentry *dentry); int (*path_mkdir) (struct path *dir, struct dentry *dentry, int mode); int (*path_rmdir) (struct path *dir, struct dentry *dentry); int (*path_mknod) (struct path *dir, struct dentry *dentry, int mode, unsigned int dev); int (*path_truncate) (struct path *path, loff_t length, unsigned int time_attrs); int (*path_symlink) (struct path *dir, struct dentry *dentry, const char *old_name); int (*path_link) (struct dentry *old_dentry, struct path *new_dir, struct dentry *new_dentry); int (*path_rename) (struct path *old_dir, struct dentry *old_dentry, struct path *new_dir, struct dentry *new_dentry); #endif int (*inode_alloc_security) (struct inode *inode); void (*inode_free_security) (struct inode *inode); int (*inode_init_security) (struct inode *inode, struct inode *dir, char **name, void **value, size_t *len); int (*inode_create) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_link) (struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry); int (*inode_unlink) (struct inode *dir, struct dentry *dentry); int (*inode_symlink) (struct inode *dir, struct dentry *dentry, const char *old_name); int (*inode_mkdir) (struct inode *dir, struct dentry *dentry, int mode); int (*inode_rmdir) (struct inode *dir, struct dentry *dentry); int (*inode_mknod) (struct inode *dir, struct dentry *dentry, int mode, dev_t dev); int (*inode_rename) (struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry); int (*inode_readlink) (struct dentry *dentry); int (*inode_follow_link) (struct dentry *dentry, struct nameidata *nd); int (*inode_permission) (struct inode *inode, int mask); int (*inode_setattr) (struct dentry *dentry, struct iattr *attr); int (*inode_getattr) (struct vfsmount *mnt, struct dentry *dentry); void (*inode_delete) (struct inode *inode); int (*inode_setxattr) (struct dentry *dentry, const char *name, const void *value, size_t size, int flags); void (*inode_post_setxattr) (struct dentry *dentry, const char *name, const void *value, size_t size, int flags); int (*inode_getxattr) (struct dentry *dentry, const char *name); int (*inode_listxattr) (struct dentry *dentry); int (*inode_removexattr) (struct dentry *dentry, const char *name); int (*inode_need_killpriv) (struct dentry *dentry); int (*inode_killpriv) (struct dentry *dentry); int (*inode_getsecurity) (const struct inode *inode, const char *name, void **buffer, bool alloc); int (*inode_setsecurity) (struct inode *inode, const char *name, const void *value, size_t size, int flags); int (*inode_listsecurity) (struct inode *inode, char *buffer, size_t buffer_size); void (*inode_getsecid) (const struct inode *inode, u32 *secid); int (*file_permission) (struct file *file, int mask); int (*file_alloc_security) (struct file *file); void (*file_free_security) (struct file *file); int (*file_ioctl) (struct file *file, unsigned int cmd, unsigned long arg); int (*file_mmap) (struct file *file, unsigned long reqprot, unsigned long prot, unsigned long flags, unsigned long addr, unsigned long addr_only); int (*file_mprotect) (struct vm_area_struct *vma, unsigned long reqprot, unsigned long prot); int (*file_lock) (struct file *file, unsigned int cmd); int (*file_fcntl) (struct file *file, unsigned int cmd, unsigned long arg); int (*file_set_fowner) (struct file *file); int (*file_send_sigiotask) (struct task_struct *tsk, struct fown_struct *fown, int sig); int (*file_receive) (struct file *file); int (*dentry_open) (struct file *file, const struct cred *cred); int (*task_create) (unsigned long clone_flags); int (*cred_alloc_blank) (struct cred *cred, gfp_t gfp); void (*cred_free) (struct cred *cred); int (*cred_prepare)(struct cred *new, const struct cred *old, gfp_t gfp); void (*cred_commit)(struct cred *new, const struct cred *old); void (*cred_transfer)(struct cred *new, const struct cred *old); int (*kernel_act_as)(struct cred *new, u32 secid); int (*kernel_create_files_as)(struct cred *new, struct inode *inode); int (*kernel_module_request)(void); int (*task_setuid) (uid_t id0, uid_t id1, uid_t id2, int flags); int (*task_fix_setuid) (struct cred *new, const struct cred *old, int flags); int (*task_setgid) (gid_t id0, gid_t id1, gid_t id2, int flags); int (*task_setpgid) (struct task_struct *p, pid_t pgid); int (*task_getpgid) (struct task_struct *p); int (*task_getsid) (struct task_struct *p); void (*task_getsecid) (struct task_struct *p, u32 *secid); int (*task_setgroups) (struct group_info *group_info); int (*task_setnice) (struct task_struct *p, int nice); int (*task_setioprio) (struct task_struct *p, int ioprio); int (*task_getioprio) (struct task_struct *p); int (*task_setrlimit) (unsigned int resource, struct rlimit *new_rlim); int (*task_setscheduler) (struct task_struct *p, int policy, struct sched_param *lp); int (*task_getscheduler) (struct task_struct *p); int (*task_movememory) (struct task_struct *p); int (*task_kill) (struct task_struct *p, struct siginfo *info, int sig, u32 secid); int (*task_wait) (struct task_struct *p); int (*task_prctl) (int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5); void (*task_to_inode) (struct task_struct *p, struct inode *inode); int (*ipc_permission) (struct kern_ipc_perm *ipcp, short flag); void (*ipc_getsecid) (struct kern_ipc_perm *ipcp, u32 *secid); int (*msg_msg_alloc_security) (struct msg_msg *msg); void (*msg_msg_free_security) (struct msg_msg *msg); int (*msg_queue_alloc_security) (struct msg_queue *msq); void (*msg_queue_free_security) (struct msg_queue *msq); int (*msg_queue_associate) (struct msg_queue *msq, int msqflg); int (*msg_queue_msgctl) (struct msg_queue *msq, int cmd); int (*msg_queue_msgsnd) (struct msg_queue *msq, struct msg_msg *msg, int msqflg); int (*msg_queue_msgrcv) (struct msg_queue *msq, struct msg_msg *msg, struct task_struct *target, long type, int mode); int (*shm_alloc_security) (struct shmid_kernel *shp); void (*shm_free_security) (struct shmid_kernel *shp); int (*shm_associate) (struct shmid_kernel *shp, int shmflg); int (*shm_shmctl) (struct shmid_kernel *shp, int cmd); int (*shm_shmat) (struct shmid_kernel *shp, char __user *shmaddr, int shmflg); int (*sem_alloc_security) (struct sem_array *sma); void (*sem_free_security) (struct sem_array *sma); int (*sem_associate) (struct sem_array *sma, int semflg); int (*sem_semctl) (struct sem_array *sma, int cmd); int (*sem_semop) (struct sem_array *sma, struct sembuf *sops, unsigned nsops, int alter); int (*netlink_send) (struct sock *sk, struct sk_buff *skb); int (*netlink_recv) (struct sk_buff *skb, int cap); void (*d_instantiate) (struct dentry *dentry, struct inode *inode); int (*getprocattr) (struct task_struct *p, char *name, char **value); int (*setprocattr) (struct task_struct *p, char *name, void *value, size_t size); int (*secid_to_secctx) (u32 secid, char **secdata, u32 *seclen); int (*secctx_to_secid) (const char *secdata, u32 seclen, u32 *secid); void (*release_secctx) (char *secdata, u32 seclen); int (*inode_notifysecctx)(struct inode *inode, void *ctx, u32 ctxlen); int (*inode_setsecctx)(struct dentry *dentry, void *ctx, u32 ctxlen); int (*inode_getsecctx)(struct inode *inode, void **ctx, u32 *ctxlen); #ifdef CONFIG_SECURITY_NETWORK int (*unix_stream_connect) (struct socket *sock, struct socket *other, struct sock *newsk); int (*unix_may_send) (struct socket *sock, struct socket *other); int (*socket_create) (int family, int type, int protocol, int kern); int (*socket_post_create) (struct socket *sock, int family, int type, int protocol, int kern); int (*socket_bind) (struct socket *sock, struct sockaddr *address, int addrlen); int (*socket_connect) (struct socket *sock, struct sockaddr *address, int addrlen); int (*socket_listen) (struct socket *sock, int backlog); int (*socket_accept) (struct socket *sock, struct socket *newsock); int (*socket_sendmsg) (struct socket *sock, struct msghdr *msg, int size); int (*socket_recvmsg) (struct socket *sock, struct msghdr *msg, int size, int flags); int (*socket_getsockname) (struct socket *sock); int (*socket_getpeername) (struct socket *sock); int (*socket_getsockopt) (struct socket *sock, int level, int optname); int (*socket_setsockopt) (struct socket *sock, int level, int optname); int (*socket_shutdown) (struct socket *sock, int how); int (*socket_sock_rcv_skb) (struct sock *sk, struct sk_buff *skb); int (*socket_getpeersec_stream) (struct socket *sock, char __user *optval, int __user *optlen, unsigned len); int (*socket_getpeersec_dgram) (struct socket *sock, struct sk_buff *skb, u32 *secid); int (*sk_alloc_security) (struct sock *sk, int family, gfp_t priority); void (*sk_free_security) (struct sock *sk); void (*sk_clone_security) (const struct sock *sk, struct sock *newsk); void (*sk_getsecid) (struct sock *sk, u32 *secid); void (*sock_graft) (struct sock *sk, struct socket *parent); int (*inet_conn_request) (struct sock *sk, struct sk_buff *skb, struct request_sock *req); void (*inet_csk_clone) (struct sock *newsk, const struct request_sock *req); void (*inet_conn_established) (struct sock *sk, struct sk_buff *skb); void (*req_classify_flow) (const struct request_sock *req, struct flowi *fl); int (*tun_dev_create)(void); void (*tun_dev_post_create)(struct sock *sk); int (*tun_dev_attach)(struct sock *sk); #endif /* CONFIG_SECURITY_NETWORK */ #ifdef CONFIG_SECURITY_NETWORK_XFRM int (*xfrm_policy_alloc_security) (struct xfrm_sec_ctx **ctxp, struct xfrm_user_sec_ctx *sec_ctx); int (*xfrm_policy_clone_security) (struct xfrm_sec_ctx *old_ctx, struct xfrm_sec_ctx **new_ctx); void (*xfrm_policy_free_security) (struct xfrm_sec_ctx *ctx); int (*xfrm_policy_delete_security) (struct xfrm_sec_ctx *ctx); int (*xfrm_state_alloc_security) (struct xfrm_state *x, struct xfrm_user_sec_ctx *sec_ctx, u32 secid); void (*xfrm_state_free_security) (struct xfrm_state *x); int (*xfrm_state_delete_security) (struct xfrm_state *x); int (*xfrm_policy_lookup) (struct xfrm_sec_ctx *ctx, u32 fl_secid, u8 dir); int (*xfrm_state_pol_flow_match) (struct xfrm_state *x, struct xfrm_policy *xp, struct flowi *fl); int (*xfrm_decode_session) (struct sk_buff *skb, u32 *secid, int ckall); #endif /* CONFIG_SECURITY_NETWORK_XFRM */ /* key management security hooks */ #ifdef CONFIG_KEYS int (*key_alloc) (struct key *key, const struct cred *cred, unsigned long flags); void (*key_free) (struct key *key); int (*key_permission) (key_ref_t key_ref, const struct cred *cred, key_perm_t perm); int (*key_getsecurity)(struct key *key, char **_buffer); int (*key_session_to_parent)(const struct cred *cred, const struct cred *parent_cred, struct key *key); #endif /* CONFIG_KEYS */ #ifdef CONFIG_AUDIT int (*audit_rule_init) (u32 field, u32 op, char *rulestr, void **lsmrule); int (*audit_rule_known) (struct audit_krule *krule); int (*audit_rule_match) (u32 secid, u32 field, u32 op, void *lsmrule, struct audit_context *actx); void (*audit_rule_free) (void *lsmrule); #endif /* CONFIG_AUDIT */ };
Relevant Link:
http://www.hep.by/gnu/kernel/lsm/framework.html http://blog.sina.com.cn/s/blog_858820890101eb3c.html http://mirror.linux.org.au/linux-mandocs/2.6.4-cset-20040312_2111/security_operations.html
0x2: struct kprobe
用於存儲每一個探測點的基本結構
struct kprobe { /*用於保存kprobe的全局hash表,以被探測的addr爲key*/ struct hlist_node hlist; /* list of kprobes for multi-handler support */ /*當對同一個探測點存在多個探測函數時,全部的函數掛在這條鏈上*/ struct list_head list; /*count the number of times this probe was temporarily disarmed */ unsigned long nmissed; /* location of the probe point */ /*被探測的目標地址,要注意的是,只能是addr或是symbol_name其中一個填入了值,若是兩個都填入,在註冊這個探頭的時候就會出現錯誤-21非法符號*/ kprobe_opcode_t *addr; /* Allow user to indicate symbol name of the probe point */ /*symblo_name的存在,容許用戶指定函數名而非肯定的地址,咱們在設置的時候就能夠直接設置函數名,而有內核函數kallsyms_lookup_name("xx")去獲取具體的函數地址*/ const char *symbol_name; /* Offset into the symbol */ /* 若是被探測點爲函數內部某個指令,須要使用addr + offset的方式 從這點也能夠看出,kprobe能夠hook在內核中的任何位置 */ unsigned int offset; /* Called before addr is executed. */ /*探測函數,在目標探測點執行以前調用*/ kprobe_pre_handler_t pre_handler; /* Called after addr is executed, unless... */ /*探測函數,在目標探測點執行以後調用*/ kprobe_post_handler_t post_handler; /* ...called if executing addr causes a fault (eg. page fault). Return 1 if it handled fault, otherwise kernel will see it. */ kprobe_fault_handler_t fault_handler; /* called if breakpoint trap occurs in probe handler. Return 1 if it handled break, otherwise kernel will see it. */ kprobe_break_handler_t break_handler; /*opcode 以及 ainsn 用於保存被替換的指令碼*/ /* Saved opcode (which has been replaced with breakpoint) */ kprobe_opcode_t opcode; /* copy of the original instruction */ struct arch_specific_insn ainsn; /* Indicates various status flags. Protected by kprobe_mutex after this kprobe is registered. */ u32 flags; };
0x3: struct jprobe
咱們知道,jprobe是對kprobes的一層功能上的封裝,這點從數據結構上也能看出來
struct jprobe { struct kprobe kp; /* 定義的probe程序,要注意的是 1. 註冊進去的探頭程序應該和被註冊的函數的參數列表一致 2. 咱們在設置函數指針的時候須要使用(kprobe_opcode_t *)進行強制轉換 */ void *entry; }
0x4: struct kretprobe
kretprobe註冊(register_kretprobe)的時候須要傳遞這個結構體
struct kretprobe { struct kprobe kp; //註冊的回調函數,handler指定探測點的處理函數 kretprobe_handler_t handler; //註冊的預處理回調函數,相似於kprobes中的pre_handler() kretprobe_handler_t entry_handler; //maxactive指定能夠同時運行的最大處理函數實例數,它應當被恰當設置,不然可能丟失探測點的某些運行 int maxactive; int nmissed; //指示kretprobe須要爲回調監控預留多少內存空間 size_t data_size; struct hlist_head free_instances; raw_spinlock_t lock; };
0x5: struct kretprobe_instance
在kretprobe的註冊處理函數(.handler)中咱們能夠拿到這個結構體
struct kretprobe_instance { struct hlist_node hlist; //指向相應的kretprobe_instance變量(就是咱們在register_kretprobe時傳入的參數) struct kretprobe *rp; //返回地址 kprobe_opcode_t *ret_addr; //指向相應的task_struct struct task_struct *task; char data[0]; };
0x6: struct kretprobe_blackpoint 、struct kprobe_blacklist_entry
struct kretprobe_blackpoint { const char *name; void *addr; }; struct kprobe_blacklist_entry { struct list_head list; unsigned long start_addr; unsigned long end_addr; };
0x7: struct linux_binprm
在Linux內核中,每種二進制格式都表示爲struct linux_binprm數據結構,Linux支持的二進制格式有
1. flat_format: 平坦格式 用於沒有內存管理單元(MMU)的嵌入式CPU上,爲節省空間,可執行文件中的數據還能夠壓縮(若是內核可提供zlib支持) 2. script_format: 僞格式 用於運行使用#!機制的腳本,檢查文件的第一行,內核即知道使用何種解釋器,啓動適當的應用程序便可(例如: #! /usr/bin/perl 則啓動perl) 3. misc_format: 僞格式 用於啓動須要外部解釋器的應用程序,與#!機制相比,解釋器無須顯示指定,而能夠經過特定的文件標識符(後綴、文件頭..),例如該格式用於執行java字節碼或用wine運行windows程序 4. elf_format: 這是一種與計算機和體系結構無關的格式,可用於32/64位,它是linux的標準格式 5. elf_fdpic_format: ELF格式變體 提供了針對沒有MMU系統的特別特性 6. irix_format: ELF格式變體 提供了特定於irix的特性 7. som_format: 在PA-Risc計算機上使用,特定於HP-UX的格式 8. aout_format: a.out是引入ELF以前linux的標準格式
/source/include/linux/binfmts.h
/* * This structure is used to hold the arguments that are used when loading binaries. */ struct linux_binprm { //保存可執行文件的頭128字節 char buf[BINPRM_BUF_SIZE]; #ifdef CONFIG_MMU struct vm_area_struct *vma; unsigned long vma_pages; #else # define MAX_ARG_PAGES 32 struct page *page[MAX_ARG_PAGES]; #endif struct mm_struct *mm; /* 當前內存頁最高地址 current top of mem */ unsigned long p; unsigned int cred_prepared:1,/* true if creds already prepared (multiple * preps happen for interpreters) */ cap_effective:1;/* true if has elevated effective capabilities, * false if not; except for init which inherits * its parent's caps anyway */ #ifdef __alpha__ unsigned int taso:1; #endif unsigned int recursion_depth; //要執行的文件 struct file * file; //new credentials struct cred *cred; int unsafe; /* how unsafe this exec is (mask of LSM_UNSAFE_*) */ unsigned int per_clear; /* bits to clear in current->personality */ //命令行參數和環境變量數目 int argc, envc; /* 要執行的文件的名稱 Name of binary as seen by procps */ char * filename; /* 要執行的文件的真實名稱,一般和filename相同 Name of the binary really executed. Most of the time same as filename, but could be different for binfmt_{misc,script} */ char * interp; unsigned interp_flags; unsigned interp_data; unsigned long loader, exec; };
0x7: struct linux_binfmt
/source/include/linux/binfmts.h
/* * This structure defines the functions that are used to load the binary formats that * linux accepts. */ struct linux_binfmt { //鏈表結構 struct list_head lh; struct module *module; //裝入二進制代碼 int (*load_binary)(struct linux_binprm *, struct pt_regs * regs); //裝入公用庫 int (*load_shlib)(struct file *); int (*core_dump)(long signr, struct pt_regs *regs, struct file *file, unsigned long limit); unsigned long min_coredump; /* minimal dump size */ int hasvdso; };
6. 系統網絡狀態相關的數據結構
0x1: struct ifconf
\linux-2.6.32.63\include\linux\if.h
/* Structure used in SIOCGIFCONF request. Used to retrieve interface configuration for machine (useful for programs which must know all networks accessible). */ struct ifconf { int ifc_len; // Size of buffer. union { __caddr_t ifcu_buf; struct ifreq *ifcu_req; //保存每塊網卡的具體信息的結構體數組 } ifc_ifcu; }; #define ifc_buf ifc_ifcu.ifcu_buf /* Buffer address. */ #define ifc_req ifc_ifcu.ifcu_req /* Array of structures. */ #define _IOT_ifconf _IOT(_IOTS(struct ifconf),1,0,0,0,0) /* not right */
0x2: struct ifreq
\linux-2.6.32.63\include\linux\if.h
/* * Interface request structure used for socket * ioctl's. All interface ioctl's must have parameter * definitions which begin with ifr_name. The * remainder may be interface specific. */ struct ifreq { #define IFHWADDRLEN 6 union { char ifrn_name[IFNAMSIZ]; /* if name, e.g. "en0" */ } ifr_ifrn; //描述套接口的地址結構 union { struct sockaddr ifru_addr; struct sockaddr ifru_dstaddr; struct sockaddr ifru_broadaddr; struct sockaddr ifru_netmask; struct sockaddr ifru_hwaddr; short ifru_flags; int ifru_ivalue; int ifru_mtu; struct ifmap ifru_map; char ifru_slave[IFNAMSIZ]; /* Just fits the size */ char ifru_newname[IFNAMSIZ]; void __user * ifru_data; struct if_settings ifru_settings; } ifr_ifru; }; #define ifr_name ifr_ifrn.ifrn_name /* interface name */ #define ifr_hwaddr ifr_ifru.ifru_hwaddr /* MAC address */ #define ifr_addr ifr_ifru.ifru_addr /* address */ #define ifr_dstaddr ifr_ifru.ifru_dstaddr /* other end of p-p lnk */ #define ifr_broadaddr ifr_ifru.ifru_broadaddr /* broadcast address */ #define ifr_netmask ifr_ifru.ifru_netmask /* interface net mask */ #define ifr_flags ifr_ifru.ifru_flags /* flags */ #define ifr_metric ifr_ifru.ifru_ivalue /* metric */ #define ifr_mtu ifr_ifru.ifru_mtu /* mtu */ #define ifr_map ifr_ifru.ifru_map /* device map */ #define ifr_slave ifr_ifru.ifru_slave /* slave device */ #define ifr_data ifr_ifru.ifru_data /* for use by interface */ #define ifr_ifindex ifr_ifru.ifru_ivalue /* interface index */ #define ifr_bandwidth ifr_ifru.ifru_ivalue /* link bandwidth */ #define ifr_qlen ifr_ifru.ifru_ivalue /* Queue length */ #define ifr_newname ifr_ifru.ifru_newname /* New name */ #define ifr_settings ifr_ifru.ifru_settings /* Device/proto settings*/
code
#include <arpa/inet.h> #include <net/if.h> #include <net/if_arp.h> #include <netinet/in.h> #include <stdio.h> #include <sys/ioctl.h> #include <sys/socket.h> #include <unistd.h> #define MAXINTERFACES 16 /* 最大接口數 */ int fd; /* 套接字 */ int if_len; /* 接口數量 */ struct ifreq buf[MAXINTERFACES]; /* ifreq結構數組 */ struct ifconf ifc; /* ifconf結構 */ int main(argc, argv) { /* 創建IPv4的UDP套接字fd */ if ((fd = socket(AF_INET, SOCK_DGRAM, 0)) == -1) { perror("socket(AF_INET, SOCK_DGRAM, 0)"); return -1; } /* 初始化ifconf結構 */ ifc.ifc_len = sizeof(buf); ifc.ifc_buf = (caddr_t) buf; /* 得到接口列表 */ if (ioctl(fd, SIOCGIFCONF, (char *) &ifc) == -1) { perror("SIOCGIFCONF ioctl"); return -1; } if_len = ifc.ifc_len / sizeof(struct ifreq); /* 接口數量 */ printf("接口數量:%d/n/n", if_len); while (if_len– > 0) /* 遍歷每一個接口 */ { printf("接口:%s/n", buf[if_len].ifr_name); /* 接口名稱 */ /* 得到接口標誌 */ if (!(ioctl(fd, SIOCGIFFLAGS, (char *) &buf[if_len]))) { /* 接口狀態 */ if (buf[if_len].ifr_flags & IFF_UP) { printf("接口狀態: UP/n"); } else { printf("接口狀態: DOWN/n"); } } else { char str[256]; sprintf(str, "SIOCGIFFLAGS ioctl %s", buf[if_len].ifr_name); perror(str); } /* IP地址 */ if (!(ioctl(fd, SIOCGIFADDR, (char *) &buf[if_len]))) { printf("IP地址:%s/n", (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr)); } else { char str[256]; sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name); perror(str); } /* 子網掩碼 */ if (!(ioctl(fd, SIOCGIFNETMASK, (char *) &buf[if_len]))) { printf("子網掩碼:%s/n", (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr)); } else { char str[256]; sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name); perror(str); } /* 廣播地址 */ if (!(ioctl(fd, SIOCGIFBRDADDR, (char *) &buf[if_len]))) { printf("廣播地址:%s/n", (char*)inet_ntoa(((struct sockaddr_in*) (&buf[if_len].ifr_addr))->sin_addr)); } else { char str[256]; sprintf(str, "SIOCGIFADDR ioctl %s", buf[if_len].ifr_name); perror(str); } /*MAC地址 */ if (!(ioctl(fd, SIOCGIFHWADDR, (char *) &buf[if_len]))) { printf("MAC地址:%02x:%02x:%02x:%02x:%02x:%02x/n/n", (unsigned char) buf[if_len].ifr_hwaddr.sa_data[0], (unsigned char) buf[if_len].ifr_hwaddr.sa_data[1], (unsigned char) buf[if_len].ifr_hwaddr.sa_data[2], (unsigned char) buf[if_len].ifr_hwaddr.sa_data[3], (unsigned char) buf[if_len].ifr_hwaddr.sa_data[4], (unsigned char) buf[if_len].ifr_hwaddr.sa_data[5]); } else { char str[256]; sprintf(str, "SIOCGIFHWADDR ioctl %s", buf[if_len].ifr_name); perror(str); } }//–while end //關閉socket close(fd); return 0; }
Relevant Link:
http://blog.csdn.net/jk110333/article/details/8832077 http://www.360doc.com/content/12/0314/15/5782959_194281431.shtml
0x3: struct socket
\linux-2.6.32.63\include\linux\net.h
struct socket { /* 1. state:socket狀態 typedef enum { SS_FREE = 0, //該socket還未分配 SS_UNCONNECTED, //未連向任何socket SS_CONNECTING, //正在鏈接過程當中 SS_CONNECTED, //已連向一個socket SS_DISCONNECTING //正在斷開鏈接的過程當中 }socket_state; */ socket_state state; kmemcheck_bitfield_begin(type); /* 2. type:socket類型 enum sock_type { SOCK_STREAM = 1, //stream (connection) socket SOCK_DGRAM = 2, //datagram (conn.less) socket SOCK_RAW = 3, //raw socket SOCK_RDM = 4, //reliably-delivered message SOCK_SEQPACKET = 5,//sequential packet socket SOCK_DCCP = 6, //Datagram Congestion Control Protocol socket SOCK_PACKET = 10, //linux specific way of getting packets at the dev level. }; */ short type; kmemcheck_bitfield_end(type); /* 3. flags:socket標誌 1) #define SOCK_ASYNC_NOSPACE 0 2) #define SOCK_ASYNC_WAITDATA 1 3) #define SOCK_NOSPACE 2 4) #define SOCK_PASSCRED 3 5) #define SOCK_PASSSEC 4 */ unsigned long flags; //fasync_list is used when processes have chosen asynchronous handling of this 'file' struct fasync_struct *fasync_list; //4. Not used by sockets in AF_INET wait_queue_head_t wait; //5. file holds a reference to the primary file structure associated with this socket struct file *file; /* 6. sock This is very important, as it contains most of the useful state associated with a socket. */ struct sock *sk; //7. ops:定義了當前socket的處理函數 const struct proto_ops *ops; };
0x4: struct sock
struct sock自己不能獲取到當前socket的IP、Port相關信息,要經過inet_sk()進行轉換獲得struct inet_sock才能獲得IP、Port相關信息。但struct sock保存和當前socket大量的元描述信息
\linux-2.6.32.63\include\net\sock.h
struct sock { /* * Now struct inet_timewait_sock also uses sock_common, so please just * don't add nothing before this first member (__sk_common) --acme */ //shared layout with inet_timewait_sock struct sock_common __sk_common; #define sk_node __sk_common.skc_node #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_copy_start __sk_common.skc_hash #define sk_hash __sk_common.skc_hash #define sk_family __sk_common.skc_family #define sk_state __sk_common.skc_state #define sk_reuse __sk_common.skc_reuse #define sk_bound_dev_if __sk_common.skc_bound_dev_if #define sk_bind_node __sk_common.skc_bind_node #define sk_prot __sk_common.skc_prot #define sk_net __sk_common.skc_net kmemcheck_bitfield_begin(flags); //mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN unsigned int sk_shutdown : 2, //%SO_NO_CHECK setting, wether or not checkup packets sk_no_check : 2, //%SO_SNDBUF and %SO_RCVBUF settings sk_userlocks : 4, //which protocol this socket belongs in this network family sk_protocol : 8, //socket type (%SOCK_STREAM, etc) sk_type : 16; kmemcheck_bitfield_end(flags); //size of receive buffer in bytes int sk_rcvbuf; //synchronizer socket_lock_t sk_lock; /* * The backlog queue is special, it is always used with * the per-socket spinlock held and requires low latency * access. Therefore we special case it's implementation. */ struct { struct sk_buff *head; struct sk_buff *tail; } sk_backlog; //sock wait queue wait_queue_head_t *sk_sleep; //destination cache struct dst_entry *sk_dst_cache; #ifdef CONFIG_XFRM //flow policy struct xfrm_policy *sk_policy[2]; #endif //destination cache lock rwlock_t sk_dst_lock; //receive queue bytes committed atomic_t sk_rmem_alloc; //transmit queue bytes committed atomic_t sk_wmem_alloc; //"o" is "option" or "other" atomic_t sk_omem_alloc; //size of send buffer in bytes int sk_sndbuf; //incoming packets struct sk_buff_head sk_receive_queue; //Packet sending queue struct sk_buff_head sk_write_queue; #ifdef CONFIG_NET_DMA //DMA copied packets struct sk_buff_head sk_async_wait_queue; #endif //persistent queue size int sk_wmem_queued; //space allocated forward int sk_forward_alloc; //allocation mode gfp_t sk_allocation; //route capabilities (e.g. %NETIF_F_TSO) int sk_route_caps; //GSO type (e.g. %SKB_GSO_TCPV4) int sk_gso_type; //Maximum GSO segment size to build unsigned int sk_gso_max_size; //%SO_RCVLOWAT setting int sk_rcvlowat; /* 1. %SO_LINGER (l_onoff) 2. %SO_BROADCAST 3. %SO_KEEPALIVE 4. %SO_OOBINLINE settings 5. %SO_TIMESTAMPING settings */ unsigned long sk_flags; //%SO_LINGER l_linger setting unsigned long sk_lingertime; //rarely used struct sk_buff_head sk_error_queue; //sk_prot of original sock creator (see ipv6_setsockopt, IPV6_ADDRFORM for instance) struct proto *sk_prot_creator; //used with the callbacks in the end of this struct rwlock_t sk_callback_lock; //last error int sk_err, //rrors that don't cause failure but are the cause of a persistent failure not just 'timed out' sk_err_soft; //raw/udp drops counter atomic_t sk_drops; //always used with the per-socket spinlock held //current listen backlog unsigned short sk_ack_backlog; //listen backlog set in listen() unsigned short sk_max_ack_backlog; //%SO_PRIORITY setting __u32 sk_priority; //%SO_PEERCRED setting struct ucred sk_peercred; //%SO_RCVTIMEO setting long sk_rcvtimeo; //%SO_SNDTIMEO setting long sk_sndtimeo; //socket filtering instructions struct sk_filter *sk_filter; //private area, net family specific, when not using slab void *sk_protinfo; //sock cleanup timer struct timer_list sk_timer; //time stamp of last packet received ktime_t sk_stamp; //Identd and reporting IO signals struct socket *sk_socket; //RPC layer private data void *sk_user_data; //cached page for sendmsg struct page *sk_sndmsg_page; //front of stuff to transmit struct sk_buff *sk_send_head; //cached offset for sendmsg __u32 sk_sndmsg_off; //a write to stream socket waits to start int sk_write_pending; #ifdef CONFIG_SECURITY //used by security modules void *sk_security; #endif //generic packet mark __u32 sk_mark; /* XXX 4 bytes hole on 64 bit */ //callback to indicate change in the state of the sock void (*sk_state_change)(struct sock *sk); //callback to indicate there is data to be processed void (*sk_data_ready)(struct sock *sk, int bytes); //callback to indicate there is bf sending space available void (*sk_write_space)(struct sock *sk); //callback to indicate errors (e.g. %MSG_ERRQUEUE) void (*sk_error_report)(struct sock *sk); //callback to process the backlog int (*sk_backlog_rcv)(struct sock *sk, struct sk_buff *skb); //called at sock freeing time, i.e. when all refcnt == 0 void (*sk_destruct)(struct sock *sk); }
0x5: struct proto_ops
\linux-2.6.32.63\include\linux\net.h
struct proto_ops { int family; struct module *owner; int (*release) (struct socket *sock); int (*bind) (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len); int (*connect) (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags); int (*socketpair)(struct socket *sock1, struct socket *sock2); int (*accept) (struct socket *sock, struct socket *newsock, int flags); int (*getname) (struct socket *sock, struct sockaddr *addr, int *sockaddr_len, int peer); unsigned int (*poll) (struct file *file, struct socket *sock, struct poll_table_struct *wait); int (*ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); int (*compat_ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); int (*listen) (struct socket *sock, int len); int (*shutdown) (struct socket *sock, int flags); int (*setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen); int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); int (*compat_setsockopt)(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen); int (*compat_getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); int (*sendmsg) (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len); /* Notes for implementing recvmsg: * =============================== * msg->msg_namelen should get updated by the recvmsg handlers * iff msg_name != NULL. It is by default 0 to prevent * returning uninitialized memory to user space. The recvfrom * handlers can assume that msg.msg_name is either NULL or has * a minimum size of sizeof(struct sockaddr_storage). */ int (*recvmsg) (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len, int flags); int (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags); ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags); };
0x6: struct inet_sock
在實際編程中,咱們須要使用inet_sk(),將"struct sock"結果強制轉換爲"struct inet_sock"以後,才能夠從中取出咱們想要的IP、Port等信息
\linux-2.6.32.63\include\net\inet_sock.h
static inline struct inet_sock *inet_sk(const struct sock *sk) { return (struct inet_sock *)sk; }
inet_sock的結構體定義以下
struct inet_sock { /* sk and pinet6 has to be the first two members of inet_sock */ //ancestor class struct sock sk; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) //pointer to IPv6 control block struct ipv6_pinfo *pinet6; #endif /* Socket demultiplex comparisons on incoming packets. */ //Foreign IPv4 addr __be32 daddr; //Bound local IPv4 addr __be32 rcv_saddr; //Destination port __be16 dport; //Local port __u16 num; //Sending source __be32 saddr; //Unicast TTL __s16 uc_ttl; __u16 cmsg_flags; struct ip_options_rcu *inet_opt; //Source port __be16 sport; //ID counter for DF pkts __u16 id; //TOS __u8 tos; //Multicasting TTL __u8 mc_ttl; __u8 pmtudisc; __u8 recverr:1, //is this an inet_connection_sock? is_icsk:1, freebind:1, hdrincl:1, mc_loop:1, transparent:1, mc_all:1; //Multicast device index int mc_index; __be32 mc_addr; struct ip_mc_socklist *mc_list; //info to build ip hdr on each ip frag while socket is corked struct { unsigned int flags; unsigned int fragsize; struct ip_options *opt; struct dst_entry *dst; int length; /* Total length of all frames */ __be32 addr; struct flowi fl; } cork; };
0x7: struct sockaddr
struct sockaddr { // address family, AF_xxx unsigned short sa_family; // 14 bytes of protocol address char sa_data[14]; }; /* Structure describing an Internet (IP) socket address. */ #define __SOCK_SIZE__ 16 /* sizeof(struct sockaddr) */ struct sockaddr_in { /* Address family */ sa_family_t sin_family; /* Port number */ __be16 sin_port; /* Internet address */ struct in_addr sin_addr; /* Pad to size of `struct sockaddr'. */ unsigned char __pad[__SOCK_SIZE__ - sizeof(short int) - izeof(unsigned short int) - sizeof(struct in_addr)]; }; #define sin_zero __pad /* for BSD UNIX comp. -FvK */ /* Internet address. */ struct in_addr { __be32 s_addr; };
7. 系統內存相關的數據結構
0x1: struct mm_struct
指向進程所擁有的內存描述符,保存了進程的內存管理信息
struct mm_struct { struct vm_area_struct * mmap; /* list of VMAs */ struct rb_root mm_rb; struct vm_area_struct * mmap_cache; /* last find_vma result */ unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); void (*unmap_area) (struct mm_struct *mm, unsigned long addr); unsigned long mmap_base; /* base of mmap area */ unsigned long task_size; /* size of task vm space */ unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */ unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */ pgd_t * pgd; atomic_t mm_users; /* How many users with user space? */ atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */ int map_count; /* number of VMAs */ struct rw_semaphore mmap_sem; spinlock_t page_table_lock; /* Protects page tables and some counters */ /* List of maybe swapped mm's. These are globally strung together off init_mm.mmlist, and are protected by mmlist_lock */ struct list_head mmlist; /* Special counters, in some configurations protected by the * page_table_lock, in other configurations by being atomic. */ mm_counter_t _file_rss; mm_counter_t _anon_rss; unsigned long hiwater_rss; /* High-watermark of RSS usage */ unsigned long hiwater_vm; /* High-water virtual memory usage */ unsigned long total_vm, locked_vm, shared_vm, exec_vm; unsigned long stack_vm, reserved_vm, def_flags, nr_ptes; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ struct linux_binfmt *binfmt; cpumask_t cpu_vm_mask; /* Architecture-specific MM context */ mm_context_t context; /* Swap token stuff */ /* * Last value of global fault stamp as seen by this process. * In other words, this value gives an indication of how long * it has been since this task got the token. * Look at mm/thrash.c */ unsigned int faultstamp; unsigned int token_priority; unsigned int last_interval; unsigned long flags; /* Must use atomic bitops to access the bits */ struct core_state *core_state; /* coredumping support */ #ifdef CONFIG_AIO spinlock_t ioctx_lock; struct hlist_head ioctx_list; #endif #ifdef CONFIG_MM_OWNER /* * "owner" points to a task that is regarded as the canonical * user/owner of this mm. All of the following must be true in * order for it to be changed: * * current == mm->owner * current->mm != mm * new_owner->mm == mm * new_owner->alloc_lock is held */ struct task_struct *owner; #endif #ifdef CONFIG_PROC_FS /* store ref to file /proc/<pid>/exe symlink points to */ struct file *exe_file; unsigned long num_exe_file_vmas; #endif #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif };
0x2: struct vm_area_struct
進程虛擬內存的每一個區域表示爲struct vm_area_struct的一個實例
struct vm_area_struct { /* associated mm_struct vm_mm是一個反向指針,指向該區域所屬的mm_struct實例 */ struct mm_struct *vm_mm; /* VMA start, inclusive vm_mm內的起始地址 */ unsigned long vm_start; /* VMA end , exclusive 在vm_mm內結束地址以後的第一個字節的地址 */ unsigned long vm_end; /* list of VMA's 進程全部vm_area_struct實例的鏈表是經過vm_next實現的 各進程的虛擬內存區域鏈表,按地址排序 */ struct vm_area_struct *vm_next; /* access permissions 該虛擬內存區域的訪問權限 1) _PAGE_READ 2) _PAGE_WRITE 3) _PAGE_EXECUTE */ pgprot_t vm_page_prot; /* flags vm_flags是描述該區域的一組標誌,用於定義區域性質,這些都是在<mm.h>中聲明的預處理器常數 */ unsigned long vm_flags; struct rb_node vm_rb; /* VMA's node in the tree */ /* 對於有地址空間和後備存儲器的區域來講: shared鏈接到address_space->i_mmap優先樹 或鏈接到懸掛在優先樹結點以外、相似的一組虛擬內存區的鏈表 或鏈接到ddress_space->i_mmap_nonlinear鏈表中的虛擬內存區域 */ union { /* links to address_space->i_mmap or i_mmap_nonlinear */ struct { struct list_head list; void *parent; struct vm_area_struct *head; } vm_set; struct prio_tree_node prio_tree_node; } shared; /* 在文件的某一頁通過寫時複製以後,文件的MAP_PRIVATE虛擬內存區域可能同時在i_mmap樹和anon_vma鏈表中,MAP_SHARED虛擬內存區域只能在i_mmap樹中 匿名的MAP_PRIVATE、棧或brk虛擬內存區域(file指針爲NULL)只能處於anon_vma鏈表中 */ struct list_head anon_vma_node; /* anon_vma entry 對該成員的訪問經過anon_vma->lock串行化 */ struct anon_vma *anon_vma; /* anonymous VMA object 對該成員的訪問經過page_table_lock串行化 */ struct vm_operations_struct *vm_ops; /* associated ops 用於處理該結構的各個函數指針 */ unsigned long vm_pgoff; /* offset within file 後備存儲器的有關信息 */ struct file *vm_file; /* mapped file, if any 映射到的文件(多是NULL) */ void *vm_private_data; /* private data vm_pte(即共享內存) */ };
vm_flags是描述該區域的一組標誌,用於定義區域性質,這些都是在<mm.h>中聲明的預處理器常數
\linux-2.6.32.63\include\linux\mm.h
#define VM_READ 0x00000001 /* currently active flags */ #define VM_WRITE 0x00000002 #define VM_EXEC 0x00000004 #define VM_SHARED 0x00000008 /* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */ #define VM_MAYREAD 0x00000010 /* limits for mprotect() etc */ #define VM_MAYWRITE 0x00000020 #define VM_MAYEXEC 0x00000040 #define VM_MAYSHARE 0x00000080 /* VM_GROWSDOWN、VM_GROWSUP表示一個區域是否能夠向下、向上擴展 1. 因爲堆自下而上增加,其區域須要設置VM_GROWSUP 2. 棧自頂向下增加,對該區域設置VM_GROWSDOWN */ #define VM_GROWSDOWN 0x00000100 /* general info on the segment */ #if defined(CONFIG_STACK_GROWSUP) || defined(CONFIG_IA64) #define VM_GROWSUP 0x00000200 #else #define VM_GROWSUP 0x00000000 #endif #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */ #define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */ #define VM_EXECUTABLE 0x00001000 #define VM_LOCKED 0x00002000 #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ /* Used by sys_madvise() 因爲區域極可能從頭至尾順序讀取,則設置VM_SEQ_READ。VM_RAND_READ指定了讀取多是隨機的 這兩個標誌用於"提示"內存管理子系統和塊設備層,以優化其性能,例如若是訪問是順序的,則啓用頁的預讀 */ #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */ #define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */ #define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork 相關的區域在fork系統調用執行時不復制 */ #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() 禁止區域經過mremap系統調用擴展 */ #define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */ #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object VM_ACCOUNT指定區域是否被納入overcommit特性的計算中 */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM 若是區域是基於某些體系結構支持的巨型頁,則設置VM_HUGETLB */ #define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */ #define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */ #define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */ #define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */ #define VM_CAN_NONLINEAR 0x08000000 /* Has ->fault & does nonlinear pages */ #define VM_MIXEDMAP 0x10000000 /* Can contain "struct page" and pure PFN pages */ #define VM_SAO 0x20000000 /* Strong Access Ordering (powerpc) */ #define VM_PFN_AT_MMAP 0x40000000 /* PFNMAP vma that is fully mapped at mmap time */ #define VM_MERGEABLE 0x80000000 /* KSM may merge identical pages */
這些特性以多種方式限制內存分配
0x3: struct pg_data_t
\linux-2.6.32.63\include\linux\mmzone.h
在NUMA、UMA中,整個內存劃分爲"結點",每一個結點關聯到系統中的一個處理器,在內核中表示爲pg_data_t的實例,各個內存節點保存在一個單鏈表中,供內核遍歷
typedef struct pglist_data { //node_zones是一個數組,包含告終點中的管理區 struct zone node_zones[MAX_NR_ZONES]; //node_zonelists指定告終點及其內存域的列表,node_zonelist中zone的順序表明了分配內存的順序,前者分配內存失敗將會到後者的區域中分配內存,node_zonelist數組對每種可能的內存域類型都配置了一個獨立的數組項,包括類型爲zonelist的備用列表 struct zonelist node_zonelists[MAX_ZONELISTS]; //nr_zones保存結點中不一樣內存域的數目 int nr_zones; #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */ /* node_mem_map指向struct page實例數組的指針,用於描述結點的全部物理內存頁,它包含告終點中全部內存域的頁 每一個結點又劃分爲"內存域",是內存的進一步劃分,各個內存域都關聯了一個數組,用來組織屬於該內存域的物理內存頁(頁幀),對每一個頁幀,都分配一個struct page實例以及所需的管理數據 */ struct page *node_mem_map; #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct page_cgroup *node_page_cgroup; #endif #endif //在系統啓動期間,內存管理子系統初始化以前,內核也須要使用內存(必須保留部份內存用於初始化內存管理子系統),爲了解決這個問題,內核使用了"自舉內存分配器(boot memory allocator)",bdata指向自舉內存分配器數據結構的實例 struct bootmem_data *bdata; #ifdef CONFIG_MEMORY_HOTPLUG /* * Must be held any time you expect node_start_pfn, node_present_pages * or node_spanned_pages stay constant. Holding this will also * guarantee that any pfn_valid() stays that way. * * Nests above zone->lock and zone->size_seqlock. */ spinlock_t node_size_lock; #endif /* node_start_pfn是該NUMA結點第一個頁幀的邏輯編號,系統中全部結點的頁幀是依次編號的,每一個頁幀的號碼都是全局惟一的(不僅僅是結點內惟一) node_start_pfn在UMA系統中老是0,由於其中只有一個結點,所以其第一個頁幀編號老是0 */ unsigned long node_start_pfn; /* total number of physical pages node_present_pages指定告終點中頁幀的總數目 */ unsigned long node_present_pages; /* total size of physical page range, including holes node_spanned_pages給出了該結點以頁幀爲單位計算的長度 node_present_pages、node_spanned_pages的值不必定相同,由於結點中可能有一些空洞,並不對應真正的頁幀 */ unsigned long node_spanned_pages; //node_id是全局結點ID,系統中的NUMA結點都是從0開始編號 int node_id; //kswapd是交換守護進程(swap deamon)的等待隊列,在將頁幀換出時會用到 wait_queue_head_t kswapd_wait; //kswapd指向負責該結點的交換守護進程的task_strcut struct task_struct *kswapd; //kswapd_max_order用於頁交換子系統的實現,用來定義須要釋放的區域的長度 int kswapd_max_order; } pg_data_t;
0x4: struct zone
內存劃分爲"結點",每一個結點關聯到系統中的一個處理器,各個結點又劃分爲"內存域",是內存的進一步劃分
\linux-2.6.32.63\include\linux\mmzone.h
struct zone { /* Fields commonly accessed by the page allocator 一般由頁分配器訪問的字段*/ /* zone watermarks, access with *_wmark_pages(zone) macros pages_min、pages_high、pages_low是頁換出時使用的"水印",若是內存不足,內核能夠將頁寫到硬盤,這3個成員會影響交換守護進程的行爲 1. 若是空閒頁多於pages_high: 則內存域的狀態是理想的 2. 若是空閒頁的數目低於pages_low: 則內核開始將頁換出到硬盤 3. 若是空閒頁的數目低於pages_min: 則頁回收工做的壓力已經很大了,由於內存域中急需空閒頁,內核中有一些機制用於處理這種緊急狀況 */ unsigned long watermark[NR_WMARK]; /* * When free pages are below this point, additional steps are taken * when reading the number of free pages to avoid per-cpu counter * drift allowing watermarks to be breached */ unsigned long percpu_drift_mark; /* We don't know if the memory that we're going to allocate will be freeable or/and it will be released eventually, so to avoid totally wasting several GB of ram we must reserve some of the lower zone memory (otherwise we risk to run OOM on the lower zones despite there's tons of freeable ram on the higher zones). This array is recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl changes. lowmem_reserve數組分別爲各類內存域指定了若干頁,用於一些不管如何都不能失敗的關鍵性內存分配,各個內存域的份額根據重要性肯定
lowmem_reserve的計算由setup_per_zone_lowmem_reserve完成,內核迭代系統的全部結點,對每一個結點的各個內存域分別計算預留內存最小值,具體的算法是
內存域中頁幀的總數 / sysctl_lowmem_reserve_ratio[zone]
除數(sysctl_lowmem_reserve_ratio[zone])的默認設置對低端內存域是256,對高端內存域是32 */ unsigned long lowmem_reserve[MAX_NR_ZONES]; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; unsigned long min_slab_pages; struct per_cpu_pageset *pageset[NR_CPUS]; #else /* pageset是一個數組,用於實現每一個CPU的熱/冷頁幀列表,內核使用這些列表來保存可用於知足實現的"新鮮頁"。但冷熱幀對應的高速緩存狀態不一樣 1. 熱幀: 頁幀已經加載到高速緩存中,與在內存中的頁相比,所以能夠快速訪問,故稱之爲熱的
2. 冷幀: 未緩存的頁幀已經不在高速緩存中,故稱之爲冷的 */ struct per_cpu_pageset pageset[NR_CPUS]; #endif /* * free areas of different sizes */ spinlock_t lock; #ifdef CONFIG_MEMORY_HOTPLUG /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif /* 不一樣長度的空閒區域 free_area是同名數據結構的數組,用於實現夥伴系統,每一個數組元素都表示某種固定長度的一些連續內存區,對於包含在每一個區域中的空閒內存頁的管理,free_area是一個起點 */ struct free_area free_area[MAX_ORDER]; #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */ ZONE_PADDING(_pad1_) /* Fields commonly accessed by the page reclaim scanner 一般由頁面回收掃描程序訪問的字段 */ spinlock_t lru_lock; struct zone_lru { struct list_head list; } lru[NR_LRU_LISTS]; struct zone_reclaim_stat reclaim_stat; /* since last reclaim 上一次回收以來掃描過的頁 */ unsigned long pages_scanned; /* zone flags 內存域標誌 */ unsigned long flags; /* Zone statistics 內存域統計量,vm_stat維護了大量有關該內存域的統計信息,內核中不少地方都會更新其中的信息 */ atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; /* prev_priority holds the scanning priority for this zone. It is defined as the scanning priority at which we achieved our reclaim target at the previous try_to_free_pages() or balance_pgdat() invokation. We use prev_priority as a measure of how much stress page reclaim is under - it drives the swappiness decision: whether to unmap mapped pages. Access to both this field is quite racy even on uniprocessor. But it is expected to average out OK. prev_priority存儲了上一次掃描操做掃描該內存域的優先級,掃描操做是由try_to_free_pages進行的,直至釋放足夠的頁幀,掃描會根據該值判斷是否換出映射的頁 */ int prev_priority; /* * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on * this zone's LRU. Maintained by the pageout code. */ unsigned int inactive_ratio; ZONE_PADDING(_pad2_) /* Rarely used or read-mostly fields 不多使用或大多數狀況下是隻讀的字段 */ /* 1. wait_table: the array holding the hash table 2. wait_table_hash_nr_entries: the size of the hash table array 3. wait_table_bits: wait_table_size == (1 << wait_table_bits) The purpose of all these is to keep track of the people waiting for a page to become available and make them runnable again when possible. The trouble is that this consumes a lot of space, especially when so few things wait on pages at a given time. So instead of using per-page waitqueues, we use a waitqueue hash table. The bucket discipline is to sleep on the same queue when colliding and wake all in that wait queue when removing. When something wakes, it must check to be sure its page is truly available, a la thundering herd. The cost of a collision is great, but given the expected load of the table, they should be so rare as to be outweighed by the benefits from the saved space. __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the primary users of these fields, and in mm/page_alloc.c free_area_init_core() performs the initialization of them. wait_table、wait_table_hash_nr_entries、wait_table_bits實現了一個等待隊列,可用於存儲等待某一頁變爲可用的等待進程,進程排成一個隊列,等待某些條件,在條件變爲真時,內核會通知進程恢復工做 */ wait_queue_head_t * wait_table; unsigned long wait_table_hash_nr_entries; unsigned long wait_table_bits; /* Discontig memory support fields. 支持不連續內存模型的字段,內存域和父節點之間的關聯由zone_pgdat創建,zone_pgdat指向對應的pg_list_data實例(內存結點) */ struct pglist_data *zone_pgdat; /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT zone_start_pfn是內存域第一個頁幀的索引 */ unsigned long zone_start_pfn; /* zone_start_pfn, spanned_pages and present_pages are all protected by span_seqlock. It is a seqlock because it has to be read outside of zone->lock, and it is done in the main allocator path. But, it is written quite infrequently. The lock is declared along with zone->lock because it is frequently read in proximity to zone->lock. It's good to give them a chance of being in the same cacheline. */ unsigned long spanned_pages; /* total size, including holes 內存域中頁的總數,包含空洞*/ unsigned long present_pages; /* amount of memory (excluding holes) 內存域中頁的實際數量(除去空洞) */ /*rarely used fields:*/ /* name是一個字符串,保存該內存域的管用名稱,有3個選項可用 1. Normal 2. DMA 3. HighMem */ const char *name; } ____cacheline_internodealigned_in_smp;
該結構比較特殊的方面是它由ZONE_PADDING分隔爲幾個部分,這是由於對zone結構的訪問很是頻繁,在多處理器系統上,一般會有不一樣的CPU試圖同時訪問結構成員,所以使用了鎖防止它們彼此干擾,避免錯誤和不一致。因爲內核對該結構的訪問很是頻繁,所以會常常性地獲取該結構的兩個自旋鎖zone->lock、zone->lru_lock
所以,若是數據保存在CPU高速緩存中,那麼會處理的更快速。而高速緩存分爲行,每一行負責不一樣的內存區,內核使用ZONE_PADDING宏生成"填充"字段添加到結構中,以確保每一個自旋鎖都處於自身的"緩存行"中,還使用了編譯器關鍵字____cacheline_internodealigned_in_smp,用以實現最優的高速緩存對齊方式
這是內核在基於對CPU底層硬件的深入理解後作出的優化,經過看似浪費空間的"冗餘"操做,提升了CPU的並行處理效率,防止了由於鎖致使的等待損耗
0x5: struct page
\linux-2.6.32.63\include\linux\mm_types.h
該結構的格式是體系結構無關的,不依賴於使用的CPU類型,每一個頁幀都由該結構描述
/* Each physical page in the system has a struct page associated with it to keep track of whatever it is we are using the page for at the moment. Note that we have no way to track which tasks are using a page, though if it is a pagecache page, rmap structures can tell us who is mapping it. */ struct page { /* Atomic flags, some possibly updated asynchronously flag存儲了體系結構無關的標誌,用來存放頁的狀態屬性,每一位表明一種狀態,因此至少能夠同時表示出32中不一樣的狀態,這些狀態定義在linux/page-flags.h中 enum pageflags { PG_locked, //Page is locked. Don't touch. 指定了頁是否鎖定,若是該比特位置位,內核的其餘部分不容許訪問該頁,這防止了內存管理出現競態條件,例如從硬盤讀取數據到頁幀時 PG_error, //若是在涉及該頁的I/O操做期間發生錯誤,則PG_error置位 PG_referenced, //PG_referenced、PG_active控制了系統使用該頁的活躍程度,在頁交換子系統選擇換出頁時,該信息很重要 PG_uptodate, //PG_uptodate表示頁的數據已經從塊設備讀取,期間沒有出錯 PG_dirty, //若是與硬盤上的數據相比,頁的內容已經改變,則置位PG_dirty。處於性能考慮,頁並不在每次修改後當即寫回,所以內核使用該標誌註明頁已經改變,能夠在稍後刷出。設置了該標誌的頁稱爲髒的(即內存中的數據沒有與外存儲器介質如硬盤上的數據同步) PG_lru, //PG_lru有助於實現頁面回收和切換,內核使用兩個最近最少使用(least recently used lru)鏈表來區別活動和不活動頁,若是頁在其中一個鏈表中,則設置該比特位 PG_active, PG_slab, //若是頁是SLAB分配器的一部分,則設置PG_slab位 PG_owner_priv_1, //Owner use. If pagecache, fs may use PG_arch_1, PG_reserved, PG_private, //If pagecache, has fs-private data: 若是page結構的private成員非空,則必須設置PG_private位,用於I/O的頁,可以使用該字段將頁細分爲多個緩衝區 PG_private_2, //If pagecache, has fs aux data PG_writeback, //Page is under writeback: 若是頁的內容處於向塊設備回寫的過程當中,則須要設置PG_writeback位 #ifdef CONFIG_PAGEFLAGS_EXTENDED PG_head, //A head page PG_tail, //A tail page #else PG_compound, //A compound page: PG_compound表示該頁屬於一個更大的複合頁,複合頁由多個相連的普通頁組成 #endif PG_swapcache, //Swap page: swp_entry_t in private: 若是頁處於交換緩存,則設置PG_swapcache位,在這種狀況下,private包含一個類型爲swap_entry_t的項 PG_mappedtodisk, //Has blocks allocated on-disk PG_reclaim, //To be reclaimed asap: 在可用內存的數量變少時,內核視圖週期性地回收頁,即剔除不活動、未用的頁,在內核決定回收某個特定的頁=以後,須要設置PG_reclaim標誌通知 PG_buddy, //Page is free, on buddy lists: 若是頁空閒且包含在夥伴系統的列表中,則設置PG_buddy位,夥伴系統是頁分配機制的核心 PG_swapbacked, //Page is backed by RAM/swap PG_unevictable, //Page is "unevictable" #ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT PG_mlocked, //Page is vma mlocked #endif #ifdef CONFIG_ARCH_USES_PG_UNCACHED PG_uncached, //Page has been mapped as uncached #endif #ifdef CONFIG_MEMORY_FAILURE PG_hwpoison, //hardware poisoned page. Don't touch #endif __NR_PAGEFLAGS, PG_checked = PG_owner_priv_1, //Filesystems PG_fscache = PG_private_2, //page backed by cache //XEN PG_pinned = PG_owner_priv_1, PG_savepinned = PG_dirty, //SLOB PG_slob_free = PG_private, //SLUB PG_slub_frozen = PG_active, PG_slub_debug = PG_error, }; 內核定義了一些標準宏,用於檢查頁是否設置了某個特定的比特位,或者操做某個比特位,這些宏的名稱有必定的模式,這些操做都是原子的 1. PageXXX(page): 會檢查頁是否設置了PG_XXX位 2. SetPageXXX: 在某個比特位沒有設置的狀況下,設置該比特位,並返回原值 3. ClearPageXXX: 無條件地清除某個特定的比特位 4. TestClearPageXXX: 清除某個設置的比特位,並返回原值 */ unsigned long flags; /* Usage count, see below _count記錄了該頁被引用了多少次,_count是一個使用計數,表示內核中引用該頁的次數 1. 在其值到達0時,內核就知道page實例當前不使用,所以能夠刪除 2. 若是其值大於0,該實例毫不會從內存刪除 */ atomic_t _count; union { /* Count of ptes mapped in mms, to show when page is mapped & limit reverse map searches. 內存管理子系統中映射的頁表項計數,用於表示在頁表中有多少項指向該頁,還用於限制逆向映射搜索 atomic_t類型容許以原子方式修改其值,即不受併發訪問的影響 */ atomic_t _mapcount; struct { /* SLUB: 用於SLUB分配器,表示對象的數目 */ u16 inuse; u16 objects; }; }; union { struct { /* Mapping-private opaque data: 由映射私有,不透明數據 usually used for buffer_heads if PagePrivate set: 若是設置了PagePrivate,一般用於buffer_heads used for swp_entry_t if PageSwapCache: 若是設置了PageSwapCache,則用於swp_entry_t indicates order in the buddy system if PG_buddy is set: 若是設置了PG_buddy,則用於表示夥伴系統中的階 private是一個指向"私有"數據的指針,虛擬內存管理會忽略該數據 */ unsigned long private; /* If low bit clear, points to inode address_space, or NULL: 若是最低位爲0,則指向inode address_space,成爲NULL If page mapped as anonymous memory, low bit is set, and it points to anon_vma object: 若是頁映射爲匿名內存,則將最低位置位,並且該指針指向anon_vma對象 mapping指定了頁幀所在的地址空間 */ struct address_space *mapping; }; #if USE_SPLIT_PTLOCKS spinlock_t ptl; #endif /* SLUB: Pointer to slab 用於SLAB分配器: 指向SLAB的指針 */ struct kmem_cache *slab; /* Compound tail pages 內核能夠將多個相連的頁合併成較大的複合頁(compound page),分組中的第一個頁稱做首頁(head page),而全部其他各頁叫作尾頁(tail page),全部尾頁對應的page實例中,都將first_page設置爲指向首頁 用於複合頁的頁尾,指向首頁 */ struct page *first_page; }; union { /* Our offset within mapping. index是頁幀在映射內的偏移量 */ pgoff_t index; void *freelist; /* SLUB: freelist req. slab lock */ }; /* Pageout list(換出頁列表), eg. active_list protected by zone->lru_lock */ struct list_head lru; /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with * highmem some memory is mapped into kernel virtual memory * dynamically, so we need a place to store that address. * Note that this field could be 16 bits on x86 ... ;) * * Architectures with slow multiplication can define * WANT_PAGE_VIRTUAL in asm/page.h */ #if defined(WANT_PAGE_VIRTUAL) /* Kernel virtual address (NULL if not kmapped, ie. highmem) 內核虛擬地址(若是沒有映射機制則爲NULL,即高端內存) */ void *virtual; #endif /* WANT_PAGE_VIRTUAL */ #ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS unsigned long debug_flags; /* Use atomic bitops on this */ #endif #ifdef CONFIG_KMEMCHECK /* * kmemcheck wants to track the status of each byte in a page; this * is a pointer to such a status block. NULL if not tracked. */ void *shadow; #endif };
不少時候,須要等待頁的狀態改變,而後才能恢復工做,內核提供了兩個輔助函數
\linux-2.6.32.63\include\linux\pagemap.h
static inline void wait_on_page_locked(struct page *page); 假定內核的一部分在等待一個被鎖定的頁面,直至頁面解鎖,wait_on_page_locked提供了該功能,在頁面鎖定的狀況下調用該函數,內核將進入睡眠,在頁解鎖以後,睡眠進程被自動喚醒並繼續共走 static inline void wait_on_page_writeback(struct page *page); wait_on_page_writeback的工做方式相似,該函數會等待到與頁面相關的全部待決回寫操做結束,將頁面包含的數據同步到塊設備(例如硬盤)爲止
8. 中斷相關的數據結構
0x1: struct irq_desc
用於表示IRQ描述符的結構定義以下:\linux-2.6.32.63\include\linux\irq.h
struct irq_desc { //1. interrupt number for this descriptor unsigned int irq; //2. irq stats per cpu unsigned int *kstat_irqs; #ifdef CONFIG_INTR_REMAP //3. iommu with this irq struct irq_2_iommu *irq_2_iommu; #endif //4. highlevel irq-events handler [if NULL, __do_IRQ()] irq_flow_handler_t handle_irq; //5. low level interrupt hardware access struct irq_chip *chip; //6. MSI descriptor struct msi_desc *msi_desc; //7. per-IRQ data for the irq_chip methods void *handler_data; //8. platform-specific per-chip private data for the chip methods, to allow shared chip implementations void *chip_data; /* IRQ action list */ //9. the irq action chain struct irqaction *action; /* IRQ status */ //10. status information unsigned int status; /* nested irq disables */ //11. disable-depth, for nested irq_disable() calls unsigned int depth; /* nested wake enables */ //12. enable depth, for multiple set_irq_wake() callers unsigned int wake_depth; /* For detecting broken IRQs */ //13. stats field to detect stalled irqs unsigned int irq_count; /* Aging timer for unhandled count */ //14. aging timer for unhandled count unsigned long last_unhandled; //15. stats field for spurious unhandled interrupts unsigned int irqs_unhandled; //16. locking for SMP spinlock_t lock; #ifdef CONFIG_SMP //17. IRQ affinity on SMP cpumask_var_t affinity; //18. node index useful for balancing unsigned int node; #ifdef CONFIG_GENERIC_PENDING_IRQ //19. pending rebalanced interrupts cpumask_var_t pending_mask; #endif #endif //20. number of irqaction threads currently running atomic_t threads_active; //21. wait queue for sync_irq to wait for threaded handlers wait_queue_head_t wait_for_threads; #ifdef CONFIG_PROC_FS //22. /proc/irq/ procfs entry struct proc_dir_entry *dir; #endif //23. flow handler name for /proc/interrupts output const char *name; } ____cacheline_internodealigned_in_smp;
status描述了IRQ的當前狀態
irq.h中定義了各類表示當前狀態的常數,可用於描述IRQ電路當前的狀態。每一個常數表示位串中的一個置爲的標誌位(能夠同時設置)
/* * IRQ line status. * * Bits 0-7 are reserved for the IRQF_* bits in linux/interrupt.h * * IRQ types */ #define IRQ_TYPE_NONE 0x00000000 /* Default, unspecified type */ #define IRQ_TYPE_EDGE_RISING 0x00000001 /* Edge rising type */ #define IRQ_TYPE_EDGE_FALLING 0x00000002 /* Edge falling type */ #define IRQ_TYPE_EDGE_BOTH (IRQ_TYPE_EDGE_FALLING | IRQ_TYPE_EDGE_RISING) #define IRQ_TYPE_LEVEL_HIGH 0x00000004 /* Level high type */ #define IRQ_TYPE_LEVEL_LOW 0x00000008 /* Level low type */ #define IRQ_TYPE_SENSE_MASK 0x0000000f /* Mask of the above */ #define IRQ_TYPE_PROBE 0x00000010 /* Probing in progress */ /* IRQ handler active - do not enter! 與IRQ_DISABLED相似,IRQ_DISABLED會阻止其他的內核代碼執行該處理程序 */ #define IRQ_INPROGRESS 0x00000100 /* IRQ disabled - do not enter! 用戶表示被設備驅動程序禁用的IRQ電路毛概標誌通知內核不要進入處理程序 */ #define IRQ_DISABLED 0x00000200 /* IRQ pending - replay on enable 當CPU產生一箇中斷但還沒有執行對應的處理程序時,IRQ_PENDING標誌位置位 */ #define IRQ_PENDING 0x00000400 /* IRQ has been replayed but not acked yet IRQ_REPLAY意味着該IRQ已經禁用,但此前尚有一個未確認的中斷 */ #define IRQ_REPLAY 0x00000800 #define IRQ_AUTODETECT 0x00001000 /* IRQ is being autodetected */ #define IRQ_WAITING 0x00002000 /* IRQ not yet seen - for autodetection */ /* IRQ level triggered 用於Alpha和PowerPC系統,用於區分電平觸發和邊沿觸發的IRQ */ #define IRQ_LEVEL 0x00004000 /* IRQ masked - shouldn't be seen again 爲正確處理發生在中斷處理期間的中斷,須要IRQ_MASKED標誌位 */ #define IRQ_MASKED 0x00008000 /* IRQ is per CPU 某個IRQ只能發生在一個CPU上時,將設置IRQ_PER_CPU標誌位,在SMP系統中,該標誌使幾個用於防止併發訪問的保護機制變得多餘 */ #define IRQ_PER_CPU 0x00010000 #define IRQ_NOPROBE 0x00020000 /* IRQ is not valid for probing */ #define IRQ_NOREQUEST 0x00040000 /* IRQ cannot be requested */ #define IRQ_NOAUTOEN 0x00080000 /* IRQ will not be enabled on request irq */ #define IRQ_WAKEUP 0x00100000 /* IRQ triggers system wakeup */ #define IRQ_MOVE_PENDING 0x00200000 /* need to re-target IRQ destination */ #define IRQ_NO_BALANCING 0x00400000 /* IRQ is excluded from balancing */ #define IRQ_SPURIOUS_DISABLED 0x00800000 /* IRQ was disabled by the spurious trap */ #define IRQ_MOVE_PCNTXT 0x01000000 /* IRQ migration from process context */ #define IRQ_AFFINITY_SET 0x02000000 /* IRQ affinity was set from userspace*/ #define IRQ_SUSPENDED 0x04000000 /* IRQ has gone through suspend sequence */ #define IRQ_ONESHOT 0x08000000 /* IRQ is not unmasked after hardirq */ #define IRQ_NESTED_THREAD 0x10000000 /* IRQ is nested into another, no own handler thread */ #ifdef CONFIG_IRQ_PER_CPU # define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU) # define IRQ_NO_BALANCING_MASK (IRQ_PER_CPU | IRQ_NO_BALANCING) #else # define CHECK_IRQ_PER_CPU(var) 0 # define IRQ_NO_BALANCING_MASK IRQ_NO_BALANCING #endif
0x2: struct irq_chip
\linux-2.6.32.63\include\linux\irq.h
struct irq_chip { /* 1. name for /proc/interrupts 包含一個短的字符串,用於標識硬件控制器 1) IA-32: XTPIC 2) AMD64: IO-APIC */ const char *name; //2. start up the interrupt (defaults to ->enable if NULL),用於第一次初始化一個IRQ,startup實際上就是將工做轉給enable unsigned int (*startup)(unsigned int irq); //3. shut down the interrupt (defaults to ->disable if NULL) void (*shutdown)(unsigned int irq); //4. enable the interrupt (defaults to chip->unmask if NULL) void (*enable)(unsigned int irq); //5. disable the interrupt (defaults to chip->mask if NULL) void (*disable)(unsigned int irq); //6. start of a new interrupt void (*ack)(unsigned int irq); //7. mask an interrupt source void (*mask)(unsigned int irq); //8. ack and mask an interrupt source void (*mask_ack)(unsigned int irq); //9. unmask an interrupt source void (*unmask)(unsigned int irq); //10. end of interrupt - chip level void (*eoi)(unsigned int irq); //11. end of interrupt - flow level void (*end)(unsigned int irq); //12. set the CPU affinity on SMP machines int (*set_affinity)(unsigned int irq, const struct cpumask *dest); //13. resend an IRQ to the CPU int (*retrigger)(unsigned int irq); //14. set the flow type (IRQ_TYPE_LEVEL/etc.) of an IRQ int (*set_type)(unsigned int irq, unsigned int flow_type); //15. enable/disable power-management wake-on of an IRQ int (*set_wake)(unsigned int irq, unsigned int on); //16. function to lock access to slow bus (i2c) chips void (*bus_lock)(unsigned int irq); //17. function to sync and unlock slow bus (i2c) chips void (*bus_sync_unlock)(unsigned int irq); /* Currently used only by UML, might disappear one day.*/ #ifdef CONFIG_IRQ_RELEASE_METHOD //18. release function solely used by UML void (*release)(unsigned int irq, void *dev_id); #endif /* * For compatibility, ->typename is copied into ->name. * Will disappear. */ //19. obsoleted by name, kept as migration helper const char *typename; };
該結構須要考慮內核中出現的各個IRQ實現的全部特性。所以,一個該結構的特定實例,一般只定義全部可能方法的一個子集,下面以IO-APIC、i8259A標準中斷控制器做爲例子
\linux-2.6.32.63\arch\x86\kernel\io_apic.c
static struct irq_chip ioapic_chip __read_mostly = { .name = "IO-APIC", .startup = startup_ioapic_irq, .mask = mask_IO_APIC_irq, .unmask = unmask_IO_APIC_irq, .ack = ack_apic_edge, .eoi = ack_apic_level, #ifdef CONFIG_SMP .set_affinity = set_ioapic_affinity_irq, #endif .retrigger = ioapic_retrigger_irq, };
linux-2.6.32.63\arch\alpha\kernel\irq_i8259.c
struct irq_chip i8259a_irq_type = { .name = "XT-PIC", .startup = i8259a_startup_irq, .shutdown = i8259a_disable_irq, .enable = i8259a_enable_irq, .disable = i8259a_disable_irq, .ack = i8259a_mask_and_ack_irq, .end = i8259a_end_irq, };
能夠看到,運行該設備,只須要定義全部可能處理程序函數的一個子集
0x3: struct irqaction
struct irqaction結構是struct irq_desc中和IRQ處理函數相關的成員結構
struct irqaction { //1. name、dev_id惟一地標識一箇中斷處理程序 irq_handler_t handler; void *dev_id; void __percpu *percpu_dev_id; //2. next用於實現共享的IRQ處理程序 struct irqaction *next; irq_handler_t thread_fn; struct task_struct *thread; unsigned int irq; //3. flags是一個標誌變量,經過位圖描述了IRQ(和相關的中斷)的一些特性,位圖中的各個標誌位能夠經過預約義的常數訪問 unsigned int flags; unsigned long thread_flags; unsigned long thread_mask; //4. name是一個短字符串,用於標識設備 const char *name; struct proc_dir_entry *dir; } ____cacheline_internodealigned_in_smp;
幾個irqaction實例彙集到一個鏈表中,鏈表的全部元素都必須處理同一個IRQ編號,在發生一個共享中斷時,內核掃描該鏈表找出中斷實際上的來源設備
9. 進程間通訊(IPC)相關數據結構
0x1: struct ipc_namespace
從內核版本2.6.19開始,IPC機制已經可以意識到命名空間的存在,但管理IPC命名空間比較簡單,由於它們之間沒有層次關係,給定的進程屬於task_struct->nsproxy->ipc_ns指向的命名空間,初始的默認命名空間經過ipc_namespace的靜態實例init_ipc_ns實現,每一個命名空間都包括以下結構
source/include/linux/ipc_namespace.h
struct ipc_namespace { atomic_t count; /* 每一個數組元素對應一種IPC機制 1) ids[0]: 信號量 2) ids[1]: 消息隊列 3) ids[2]: 共享內存 */ struct ipc_ids ids[3]; int sem_ctls[4]; int used_sems; int msg_ctlmax; int msg_ctlmnb; int msg_ctlmni; atomic_t msg_bytes; atomic_t msg_hdrs; int auto_msgmni; size_t shm_ctlmax; size_t shm_ctlall; int shm_ctlmni; int shm_tot; struct notifier_block ipcns_nb; /* The kern_mount of the mqueuefs sb. We take a ref on it */ struct vfsmount *mq_mnt; /* # queues in this ns, protected by mq_lock */ unsigned int mq_queues_count; /* next fields are set through sysctl */ unsigned int mq_queues_max; /* initialized to DFLT_QUEUESMAX */ unsigned int mq_msg_max; /* initialized to DFLT_MSGMAX */ unsigned int mq_msgsize_max; /* initialized to DFLT_MSGSIZEMAX */ };
Relevant Link:
http://blog.csdn.net/bullbat/article/details/7781027 http://book.51cto.com/art/201005/200882.htm
0x2: struct ipc_ids
這個結構保存了有關IPC對象狀態的通常信息,每一個struct ipc_ids結構實例對應於一種IPC機制: 共享內存、信號量、消息隊列。爲了防止對每一個;類別都須要查找對應的正確數組索引,內核提供了輔助函數msg_ids、shm_ids、sem_ids
source/include/linux/ipc_namespace.h
struct ipc_ids { //1. 當前使用中IPC對象的數目 int in_use; /* 2. 用戶連續產生用戶空間IPC ID,須要注意的是,ID不等同於序號,內核經過ID來標識IPC對象,ID按資源類型管理,即一個ID用於消息隊列,一個用於信號量、一個用於共享內存對象 每次建立新的IPC對象時,序號加1(自動進行迴繞,即到達最大值自動變爲0) 用戶層可見的ID = s * SEQ_MULTIPLER + i,其中s是當前序號,i是內核內部的ID,SEQ_MULTIPLER設置爲IPC對象的上限 若是重用了內部ID,仍然會產生不一樣的用戶空間ID,由於序號不會重用,在用戶層傳遞了一箇舊的ID時,這種作法最小化了使用錯誤資源的風險 */ unsigned short seq; unsigned short seq_max; //3. 內核信號量,它用於實現信號量操做,避免用戶空間中的競態條件,該互斥量有效地保護了包含信號量值的數據結構 struct rw_semaphore rw_mutex; //4. 每一個IPC對象都由kern_ipc_perm的一個實例表示,ipcs_idr用於將ID關聯到指向對應的kern_ipc_perm實例的指針 struct idr ipcs_idr; };
每一個IPC對象都由kern_ipc_perm的一個實例表示,每一個對象都有一個內核內部ID,ipcs_idr用於將ID關聯到指向對應的kern_ipc_perm實例的指針
0x3: struct kern_ipc_perm
這個結構保存了當前IPC對象的"全部者"、和訪問權限等相關信息
/source/include/linux/ipc.h
/* Obsolete, used only for backwards compatibility and libc5 compiles */ struct ipc_perm { //1. 保存了用戶程序用來標識信號量的魔數 __kernel_key_t key; //2. 當前IPC對象全部者的UID __kernel_uid_t uid; //3. 當前IPC對象全部者的組ID __kernel_gid_t gid; //4. 產生信號量的進程的用戶ID __kernel_uid_t cuid; //5. 產生信號量的進程的用戶組ID __kernel_gid_t cgid; //6. 位掩碼。指定了全部者、組、其餘用戶的訪問權限 __kernel_mode_t mode; //7. 一個序號,在分配IPC時使用 unsigned short seq; };
這個結果不足以保存信號量所需的全部信息。在進程的task_struct實例中有一個與IPC相關的成員
struct task_struct { ... #ifdef CONFIG_SYSVIPC struct sysv_sem sysvsem; #endif ... } //只有設置了配置選項CONFIG_SYSVIPC時,Sysv相關代碼纔會被編譯到內核中
0x4: struct sysv_sem
struct sysv_sem數據結構封裝了另外一個成員
struct sysv_sem { //用於撤銷信號量 struct sem_undo_list *undo_list; };
若是崩潰金曾修改了信號量狀態以後,可能會致使有等待該信號量的進程沒法喚醒(僞死鎖),則該機制在這種狀況下頗有用。經過使用撤銷列表中的信息在適當的時候撤銷這些操做,信號量能夠恢復到一致狀態,防止死鎖
0x5: struct sem_queue
struct sem_queue數據結構用於將信號量與睡眠進程關聯起來,該進程想要執行信號量操做,但目前由於資源爭奪關係不容許。簡單來講,信號量的"待決操做列表"中,每一項都是該數據結構的實例
/* One queue for each sleeping process in the system. */ struct sem_queue { /* queue of pending operations: 等待隊列,使用next、prev串聯起來的雙向鏈表 */ struct list_head list; /* this process: 睡眠的結構 */ struct task_struct *sleeper; /* undo structure: 用於撤銷的結構 */ struct sem_undo *undo; /* process id of requesting process: 請求信號量操做的進程ID */ int pid; /* completion status of operation: 操做的完成狀態 */ int status; /* array of pending operations: 待決操做數組 */ struct sembuf *sops; /* number of operations: 操做數目 */ int nsops; /* does the operation alter the array?: 操做是否改變了數組? */ int alter; };
對每一個信號量,都有一個隊列管理與信號量相關的全部睡眠進程(待決進程),該隊列並未使用內核的標準設施實現,而是經過next、prev指針手工實現
信號量各數據結構之間的相互關係
0x6: struct msg_queue
和消息隊列相關的數據結構,struct msg_queue做爲消息隊列的鏈表頭,描述了當前消息隊列的相關信息以及隊列的訪問權限
/source/include/linux/msg.h
/* one msq_queue structure for each present queue on the system */ struct msg_queue { struct kern_ipc_perm q_perm; /* last msgsnd time: 上一次調用msgsnd發送消息的時間 */ time_t q_stime; /* last msgrcv time: 上一次調用msgrcv接收消息的時間 */ time_t q_rtime; /* last change time: 上一次修改的時間 */ time_t q_ctime; /* current number of bytes on queue: 隊列上當前字節數目 */ unsigned long q_cbytes; /* number of messages in queue: 隊列中的消息數目 */ unsigned long q_qnum; /* max number of bytes on queue: 隊列上最大字節數目 */ unsigned long q_qbytes; /* pid of last msgsnd: 上一次調用msgsnd的pid */ pid_t q_lspid; /* last receive pid: 上一次接收消息的pid */ pid_t q_lrpid; struct list_head q_messages; struct list_head q_receivers; struct list_head q_senders; };
咱們重點關注一下結構體的最後3個成員,它們是3個標準的內核鏈表
1. struct list_head q_messages : 消息自己 2. struct list_head q_receivers : 睡眠的接收者 3. struct list_head q_senders : 睡眠的發送者
q_messages(消息自己)中的各個消息都封裝在一個msg_msg實例中
0x7: struct msg_msg
\linux-2.6.32.63\include\linux\msg.h
/* one msg_msg structure for each message */ struct msg_msg { //用做鏈接各個消息的鏈表元素 struct list_head m_list; //消息類型 long m_type; /* message text size: 消息長度 */ int m_ts; /* 若是保存了超過一個內存頁的長消息,則須要next 每一個消息都(至少)分配了一個內存頁,msg_msg實例保存在該頁的起始處,剩餘的空間能夠用於存儲消息的正文 */ struct msg_msgseg* next; void *security; /* the actual message follows immediately */ };
消息隊列通訊時,發送進程和接收進程均可以進入睡眠
1. 若是消息隊列已經達到最大容量,則發送者在試圖寫入時會進入睡眠 2. 若是接受者在試圖獲取消息時會進入睡眠
在實際的編程中,爲了緩解由於消息隊列上限滿致使消息發送者(senders 向消息隊列中寫入數據的進程)被強制睡眠阻塞,咱們能夠採起幾個措施
1. vim /etc/sysctl.conf 2. 使用非阻塞消息發送方式調用msgsnd() API /* int ret = msgsnd(msgq_id, msg, msg_size, IPC_NOWAIT); IPC_NOWAIT:當消息隊列已滿的時候,msgsnd函數不等待當即返回 */
Relevant Link:
http://blog.csdn.net/guoping16/article/details/6584024
0x8: struct msg_sender
對於消息隊列來講,睡眠的發送者放置在msg_queue的q_senders鏈表中,鏈表元素使用下列數據結構
/* one msg_sender for each sleeping sender */ struct msg_sender { //鏈表元素 struct list_head list; //指向對應進程的task_struct的指針 struct task_struct *tsk; };
這裏不須要額外的信息,由於發送進程是sys_msgsnd系統調用期間進入睡眠,也多是經過sys_ipc系統調用期間進入睡眠(sys_ipc會在喚醒後自動重試發送操做)
0x9: struct msg_receiver
/* * one msg_receiver structure for each sleeping receiver: */ struct msg_receiver { struct list_head r_list; struct task_struct *r_tsk; int r_mode; //對預期消息的描述 long r_msgtype; long r_maxsize; //指向msg_msg實例的指針,在消息可用的狀況下,該指針指定了複製數據的目標地址 struct msg_msg *volatile r_msg; };
每一個消息隊列都有一個msqid_ds結構與其關聯
0x10: struct msqid_ds
\linux-2.6.32.63\include\linux\msg.h
/* Obsolete, used only for backwards compatibility and libc5 compiles */ struct msqid_ds { struct ipc_perm msg_perm; struct msg *msg_first; /* first message on queue,unused */ struct msg *msg_last; /* last message in queue,unused */ __kernel_time_t msg_stime; /* last msgsnd time */ __kernel_time_t msg_rtime; /* last msgrcv time */ __kernel_time_t msg_ctime; /* last change time */ unsigned long msg_lcbytes; /* Reuse junk fields for 32 bit */ unsigned long msg_lqbytes; /* ditto */ unsigned short msg_cbytes; /* current number of bytes on queue */ unsigned short msg_qnum; /* number of messages in queue */ unsigned short msg_qbytes; /* max number of bytes on queue */ __kernel_ipc_pid_t msg_lspid; /* pid of last msgsnd */ __kernel_ipc_pid_t msg_lrpid; /* last receive pid */ };
下圖說明了消息隊列所涉及各數據結構的相互關係
10. 命名空間(namespace)相關數據結構
Linux內核經過數據結構之間互相的鏈接關係,造成了一套虛擬的命名空間的虛擬化概念
0x1: struct pid_namespace
\linux-2.6.32.63\include\linux\pid_namespace.h
struct pid_namespace { struct kref kref; struct pidmap pidmap[PIDMAP_ENTRIES]; int last_pid; /* 每一個PID命名空間都具備一個進程,其發揮的做用至關於全局的init進程,init的一個目的是對孤兒調用wait4,命名空間局部的init變體也必須完成該工做。child_reaper保存了指向該進程的task_struct的指針 */ struct task_struct *child_reaper; struct kmem_cache *pid_cachep; /* 2. level表示當前命名空間在命名空間層次結構中的深度。初始命名空間的level爲0,下一層爲1,逐層遞增。level較高的命名空間中的ID,對level較低的命名空間來講是可見的(即子命名空間對父命名空間可見)。從給定的level位置,內核便可推斷進程會關聯到多少個ID(即子命名空間中的進程須要關聯從當前命名空間一直到最頂層的全部命名空間) */ unsigned int level; /* 3. parent是指向父命名空間的指針 */ struct pid_namespace *parent; #ifdef CONFIG_PROC_FS struct vfsmount *proc_mnt; #endif #ifdef CONFIG_BSD_PROCESS_ACCT struct bsd_acct_struct *bacct; #endif };
0x2: struct pid、struct upid
PID的管理圍繞着兩個數據結構展開,struct pid是內核對PID的內部表示、struct upid則表示特定的命名空間中可見的信息
\linux-2.6.32.63\include\linux\pid.h
/* struct upid is used to get the id of the struct pid, as it is seen in particular namespace. Later the struct pid is found with find_pid_ns() using the int nr and struct pid_namespace *ns. */ struct upid { /* Try to keep pid_chain in the same cacheline as nr for find_vpid */ //1. 表示ID的數值 int nr; //2. 指向該ID所屬的命名空間的指針 struct pid_namespace *ns; /* 3. 全部的upid實例都保存在一個散列表中,pid_chain用內核的標準方法實現了散列溢出鏈表 */ struct hlist_node pid_chain; }; struct pid { //1. 引用計數器 atomic_t count; unsigned int level; /* lists of tasks that use this pid task是一個數組,每一個數組項都是一個散列表頭,對應於一個ID類型,由於一個ID可能用於幾個進程,全部共享同一個給定ID的task_struct實例,都經過該列表鏈接起來,PIDTYPE_MAX表示ID類型的數目 enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; */ struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; struct upid numbers[1]; };
另外一幅圖以下
能夠看到,內核用數據結構中的這種N:N的關係,實現了一個虛擬的層次命名空間結構
0x3: struct nsproxy
\linux-2.6.32.63\include\linux\nsproxy.h
/* A structure to contain pointers to all per-process namespaces 1. fs (mount) 2. uts 3. network 4. sysvipc 5. etc 'count' is the number of tasks holding a reference. The count for each namespace, then, will be the number of nsproxies pointing to it, not the number of tasks. The nsproxy is shared by tasks which share all namespaces. As soon as a single namespace is cloned or unshared, the nsproxy is copied. */ struct nsproxy { atomic_t count; /* 1. UTS(UNIX Timesharing System)命名空間包含了運行內核的名稱、版本、底層體系結構類型等信息 */ struct uts_namespace *uts_ns; /* 2. 保存在struct ipc_namespace中的全部與進程間通訊(IPC)有關的信息 */ struct ipc_namespace *ipc_ns; /* 3. 已經裝載的文件系統的視圖,在struct mnt_namespace中給出 */ struct mnt_namespace *mnt_ns; /* 4. 有關進程ID的信息,由struct pid_namespace提供 */ struct pid_namespace *pid_ns; /* 5. struct net包含全部網絡相關的命名空間參數 */ struct net *net_ns; }; extern struct nsproxy init_nsproxy;
0x4: struct mnt_namespace
\linux-2.6.32.63\include\linux\mnt_namespace.h
struct mnt_namespace { //使用計數器,指定了使用該命名空間的進程數目 atomic_t count; //指向根目錄的vfsmount實例 struct vfsmount * root; //雙鏈表表頭,保存了VFS命名空間中全部文件系統的vfsmount實例,鏈表元素是vfsmount的成員mnt_list struct list_head list; wait_queue_head_t poll; int event; };
Copyright (c) 2014 LittleHann All rights reserved