Allocating memory to kernel mode processes ==== >node
Three ways for a kernel function to get dynamic memory:linux
1. __get_free_pages() or alloc_pages() to get pages from zoned page frame allocator算法
2. Kmem_cache_alloc() or kmalloc() to use slab allocator編程
3. vmalloc() or vmalloc_32() to get a noncontiguous memory area數組
The kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.數據結構
Allocating memory to User Mode processes ==== >app
1. The kernel tries to defer allocating dynamic memory to User Mode processes.this
2. The kernel must be prepared to catch all addressing errors caused by User Mode processes.atom
Key concepts: Memory Region, address space management, Page Fault exception handlerspa
The address space consists of all linear addresses that the process is allowed to use. Because each process sees a different set of linear addresses, the address used by one process bears no relation to the address used by another. (This implies the not all set of the linear addresses can be used by a process! The addresses can be used by the process only after it has been added to its address space.)
The kernel represents intervals of linear addresses by means of memory regions, which are characterized by initial linear address, the length and some access rights.
Memory region creation and deletion
ð Process creation and destruction
ð Exec function
ð Memory mapping
ð Expand user mode stack
ð Expand heap
ð IPC shared memory
Even if some interval of linear addresses has been added to the process, page frames corresponding to the linear addresses may not have been allocated yet. The programmer or the user usually don’t need to care about this, because use of this kind of linear address will cause a Page Fault, and the Page Fault handler will do demand paging for us. This kind of page fault is different from those caused by programming errors.
Type: mm_struct
The memory descriptor object, referenced as mm in process descriptor’s fields, contains all information related to process’s address space.
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned long total_vm, locked_vm, shared_vm, exec_vm;
unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;
struct linux_binfmt *binfmt;
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
mm_context_t context;
/* Swap token stuff */
/*
* Last value of global fault stamp as seen by this process.
* In other words, this value gives an indication of how long
* it has been since this task got the token.
* Look at mm/thrash.c
*/
unsigned int faultstamp;
unsigned int token_priority;
unsigned int last_interval;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct *owner;
#endif
#ifdef CONFIG_PROC_FS
/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
};
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
Type: vm_area_struct
Two adjacent memory regions merge if their access rights match.
Red-black tree
關於樹結構
樹的結構,本質上是能夠在O(1)時間內獲得其child(有時能夠獲得parent或者sibling child)。
在O(1)時間內獲得child的方法,能夠數組下標索引,能夠是經過存儲在node中的child指針。後者用的更加廣泛,可是前者也是很強大的一個方法。好比,能夠用數組來實現一個heap(二叉樹);這個heap結構,能夠用來實現heapsort,也能夠實現簡單的priority queue。
實際應用中,若是某個數據結構A的實例間須要互相關聯,可是A的size比較大,或者A的實例的個數會動態的增刪,並且個數差異比較大,那麼,咱們就不能利用數組來存儲A。一般的解決方案是,利用一個list將A的實例所有串聯起來。因而問題來了,在這個list中,若是element的個數很少,那麼search,add,delete的開銷不大,能夠接受;可是若是element的個數比較多,好比成百上千,那麼search,add,delete的開銷就比較大,是O(n)時間的(list是一個sorted list)。這個時候,若是在將A成員組織成list的同時,將他們組織成樹,那麼search,add,delete操做的開銷就會大大減少,是O(lgn)時間的。這種解決方案,在Linux的Kernel中,應用比較普遍。好比,memory descriptor中memory region descriptor的組織就是利用這種策略。這是一種用空間換時間,用複雜數據結構換取高效算法的策略。
Page Fault interrupt service routine for 8086 architecture: do_page_fault()
從這張圖能夠看到,若是編程時出現SIGSEGV錯誤,那麼有兩種可能:
1. 訪問地址不在process address space內
2. 訪問地址在進程空間地址內,可是權限不匹配。
Clone() -> copy_mm()
If CLONE_VM flag is set, copy_mm() gives the clone(tsk) the address space of its parent(current).
If CLONE_VM flag is not set, copy_mm will create a new address space, even though no memory is allocated with that address space until some address is requested. This function allocates new memory descriptor, copies the contents in current->mm into tsk->mm, and then changes a few fields of it.
Exit_mm -> mm_release()
Heap: a specific memory region, start_brk -> brk, used to satisfy a process’s dynamic memory requests.
Malloc
Calloc
Realloc
Free
Brk
Sbrk