以前寫過一篇簡單的介紹mmap()/munmap()的文章《Linux內存管理 (9)mmap》,比較單薄,這裏詳細的梳理一下。html
從經常使用的使用者角度介紹兩個函數的使用;而後重點是分析內核的實現流程;最後對mmap()/munmap()進行一些驗證測試。node
mmap系統調用並不徹底是爲了共享內存而設計的,它自己提供了不一樣於通常對普通文件的訪問方式,進程能夠像讀寫內存同樣對普通文件操做。linux
mmap系統調用使得進程之間經過映射同一個普通文件實現共享內存。普通文件被映射到進程地址空間後,進程能夠像訪問普通內存同樣對文件進行訪問,沒必要再調用read()/write()等操做。編程
mmap並不分配空間,只是將文件映射到調用進程的地址空間裏(佔用虛擬地址空間),而後就可使用memcpy()等操做,內存中內容並不當即更行到文件中,而是有一段時間的延遲,可使用msync()顯式同步。數組
取消內存映射經過munmap()。緩存
下面這張圖示意了mmap的內存映射,起始地址是返回的addr,off和len分別對應參數offset和length。bash
對mmap()/munmap()的使用比較簡單,有兩個參數組合致使了多樣性,分別是prot和flags。架構
#include <sys/mman.h> void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset); int munmap(void *addr, size_t length);
下面對這些參數作一個簡單的介紹:app
- addr:若是不爲NULL,內核會在此地址建立映射;不然,內核會選擇一個合適的虛擬地址。大部分狀況不指定虛擬地址,意義不大,而是讓內核選擇返回一個地址給用戶空間使用。
- length:表示映射到進程地址空間的大小。
- prot:內存區域的讀/寫/執行屬性。
- flags:內存映射的屬性,共享、私有、匿名、文件等。
- fd:表示這是一個文件映射,fd是打開文件的句柄。若是是文件映射,須要指定fd;匿名映射就指定一個特殊的-1。
- offset:在文件映射時,表示相對文件頭的偏移量;返回的地址是偏移量對應的虛擬地址。
通常讀寫文件須要open、read、write,須要先將磁盤文件讀取到內核cache緩衝區,而後再拷貝到用戶空間內存區,設計兩次讀寫操做。less
mmap經過將磁盤文件映射到用戶空間,當進程讀文件時,發生缺頁中斷,給虛擬內存分配對應的物理內存,在經過磁盤調頁操做將磁盤數據讀到物理內存上,實現了用戶空間數據的讀取,整個過程只有一次內存拷貝。
兩個進程映射同一個文件,在兩個進程中,同一個文件區域映射的虛擬地址空間不一樣。一個進程操做文件時,先經過缺頁獲取物理內存,進而經過磁盤文件調頁操做將文件數據讀入內存。
另外一個進程訪問文件的時候,發現沒有物理頁面映射到虛擬內存,經過fs的缺頁處理查找cache區是否有讀入磁盤文件,有的話創建映射關係,這樣兩個進程經過共享內存就能夠進行通訊。
由於在內核中已經經過fd找到對應的磁盤文件,從而將文件跟vma關聯。
映射時文件長度已經肯定,無法經過mmap訪問操做len的區間。
共有四種組合,下面逐一介紹。
多個進程使用一樣的物理頁面進行初始化,可是各個進程對內存文件的修改不會共享,也不會反映到物理文件中。
好比對linux .so動態庫文件就採用這種方式映射到各個進程虛擬地址空間中。
mmap會建立一個新的映射,各個進程不共享,主要用於分配內存(malloc分配大內存會調用mmap)。
多個進程經過虛擬內存技術共享一樣物理內存,對內存文件的修改會反應到實際物理內存中,也是進程間通訊的一種。
這種機制在進行fork時不會採用寫時複製,父子進程徹底共享一樣的物理內存頁,也就是父子進程通訊。
系統調用的入口是entry_SYSCALL_64_fastpath,而後根據系統調用號在sys_call_table中找到對應的函數。
mmap()和munmap()對應的系統調用分別是SyS_mmap()和SyS_munmap()下面就來分析一下實現。
在分析具體內核實現以前,經過腳原本看看mmap/munmap調用路徑。
經過增長set_ftrace_filter的函數,修改current_tracer發現函數的調用者,逐步豐富調用路徑。
#!/bin/bash DPATH="/sys/kernel/debug/tracing" PID=$$ ## Quick basic checks [ `id -u` -ne 0 ] && { echo "needs to be root" ; exit 1; } # check for root permissions [ -z $1 ] && { echo "needs process name as argument" ; exit 1; } # check for args to this function mount | grep -i debugfs &> /dev/null [ $? -ne 0 ] && { echo "debugfs not mounted, mount it first"; exit 1; } #checks for debugfs mount # flush existing trace data echo > $DPATH/trace echo nop > $DPATH/current_tracer echo > $DPATH/set_ftrace_filter echo "SyS_mmap SyS_mmap_pgoff SyS_munmap SyS_open SyS_read SyS_write SyS_close SyS_brk SyS_msync" >> $DPATH/set_ftrace_filter echo "do_brk elf_map load_elf_binary" >> $DPATH/set_ftrace_filter echo "do_mmap do_munmap get_unmapped_area mmap_region vm_mmap vm_munmap vm_mmap_pgoff" >> $DPATH/set_ftrace_filter echo "__split_vma* unmap_region" >> $DPATH/set_ftrace_filter # set function tracer echo function_graph > $DPATH/current_tracer # write current process id to set_ftrace_pid file echo $PID > $DPATH/set_ftrace_pid #echo "common_pid==$PID" > /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/filter #echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_mmap/enable #echo "common_pid==$PID" > /sys/kernel/debug/tracing/events/syscalls/sys_enter_munmap/filter #echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_munmap/enable # start the tracing echo 1 > $DPATH/tracing_on # execute the process exec $* #sudo cat $DPATH/trace > /home/al/v4l2/trace.txt
最後使用function_graph跟蹤器查看調用關係以下:
1) | SyS_mmap() { 1) | SyS_mmap_pgoff() { 1) | vm_mmap_pgoff() { 1) | do_mmap() { 1) 0.548 us | get_unmapped_area(); 1) 3.388 us | mmap_region(); 1) 4.598 us | } 1) 5.286 us | } 1) 5.756 us | } 1) 6.058 us | } 1) | SyS_munmap() { 1) | vm_munmap() { 1) | do_munmap() { 1) + 99.985 us | unmap_region(); 1) ! 101.439 us | } 1) ! 101.838 us | } 1) ! 102.410 us | }
下面就圍繞這條路徑展開分析。
mmap()系統調用的核心是do_mmap(),能夠分爲三部分。
第一部分經過get_unmapped_area()函數,找到一段虛擬地址,範圍是[addr, addr+len]。
用戶進程通常不會指定addr,也就是由內核指定這個虛擬空間的首地址addr在哪裏。
在函數do_mmap_pgoff()調用get_unmapped_area()以前會預指定addr,經過round_hint_to_min()實現,而後用這個預指定addr爲參數調用get_unmapped_area()。
第二部分肯定vma線性區的flags,針對文件、匿名,私有、共享有所不一樣。
第三部分是實際建立vma先行區,經過函數mmap_region()實現。
asmlinkage unsigned long sys_mmap (unsigned long addr, unsigned long len, int prot, int flags, int fd, long off) { if (offset_in_page(off) != 0) return -EINVAL; addr = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); if (!IS_ERR((void *) addr)) force_successful_syscall_return(); return addr; } SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, pgoff) { struct file *file = NULL; unsigned long retval; if (!(flags & MAP_ANONYMOUS)) {------------------------------------------對非匿名文件映射的檢查,必須能根據文件句柄找到struct file。 audit_mmap_fd(fd, flags); file = fget(fd); if (!file) return -EBADF; if (is_file_hugepages(file)) len = ALIGN(len, huge_page_size(hstate_file(file)));-------------根據file->f_op來判斷是不是hugepage,而後進行hugepage頁面對齊。 retval = -EINVAL; if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) goto out_fput; } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; struct hstate *hs; hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & SHM_HUGE_MASK); if (!hs) return -EINVAL; len = ALIGN(len, huge_page_size(hs)); /* * VM_NORESERVE is used because the reservations will be * taken when vm_ops->mmap() is called * A dummy user value is used because we are not locking * memory so no accounting is necessary */ file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE, &user, HUGETLB_ANONHUGE_INODE, (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK); if (IS_ERR(file)) return PTR_ERR(file); } flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); out_fput: if (file) fput(file); return retval; } unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long pgoff) { unsigned long ret; struct mm_struct *mm = current->mm; unsigned long populate; ret = security_mmap_file(file, prot, flag); if (!ret) { down_write(&mm->mmap_sem); ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff, &populate); up_write(&mm->mmap_sem); if (populate) mm_populate(ret, populate); } return ret; } unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate) { struct mm_struct *mm = current->mm; *populate = 0; if (!len) return -EINVAL; if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; if (!(flags & MAP_FIXED))-------------------------------------------------對於非MAP_FIXED,addr不能小於mmap_min_addr大小,若是小於則使用mmap_min_addr頁對齊後的地址。 addr = round_hint_to_min(addr); /* Careful about overflows.. */ len = PAGE_ALIGN(len); if (!len)-----------------------------------------------------------------這裏不是判斷len是否爲0,而是檢查len是否溢出。 return -ENOMEM; /* offset overflow? */ if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)--------------------------------檢查offset是否溢出 return -EOVERFLOW; /* Too many mappings? */ if (mm->map_count > sysctl_max_map_count)---------------------------------進程中mmap個數限制,超出返回ENOMEM錯誤。 return -ENOMEM; addr = get_unmapped_area(file, addr, len, pgoff, flags);------------------在建立新的ma區域以前首先尋找一塊足夠大小的空閒區域,本函數就是用於查找未映射的區域,返回值addr就是這段空間的首地址。 if (offset_in_page(addr)) return addr; vm_flags |= calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;------------根據prot/flags以及mm->flags來獲得vm_flags。 if (flags & MAP_LOCKED) if (!can_do_mlock()) return -EPERM; if (mlock_future_check(mm, vm_flags, len)) return -EAGAIN; if (file) {---------------------------------------------------------------文件映射狀況處理,主要更新vm_flags。 struct inode *inode = file_inode(file); if (!file_mmap_ok(file, inode, pgoff, len)) return -EOVERFLOW; switch (flags & MAP_TYPE) { case MAP_SHARED:------------------------------------------------------共享文件映射 if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) return -EACCES; if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) return -EACCES; if (locks_verify_locked(file)) return -EAGAIN; vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); case MAP_PRIVATE:-----------------------------------------------------私有文件映射 if (!(file->f_mode & FMODE_READ)) return -EACCES; if (path_noexec(&file->f_path)) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; } if (!file->f_op->mmap) return -ENODEV; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; break; default: return -EINVAL; } } else {------------------------------------------------------------------匿名映射狀況處理 switch (flags & MAP_TYPE) { case MAP_SHARED:------------------------------------------------------共享匿名映射 if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; pgoff = 0;--------------------------------------------------------爲何爲0? vm_flags |= VM_SHARED | VM_MAYSHARE; break; case MAP_PRIVATE:-----------------------------------------------------私有匿名映射 pgoff = addr >> PAGE_SHIFT; break; default: return -EINVAL; } } if (flags & MAP_NORESERVE) { /* We honor MAP_NORESERVE if allowed to overcommit */ if (sysctl_overcommit_memory != OVERCOMMIT_NEVER) vm_flags |= VM_NORESERVE; /* hugetlb applies strict overcommit unless MAP_NORESERVE */ if (file && is_file_hugepages(file)) vm_flags |= VM_NORESERVE; } addr = mmap_region(file, addr, len, vm_flags, pgoff);--------------------實際建立vma if (!IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; return addr; }
get_unmapped_area()根據輸入的addr,以及其它參數經過get_area()來找到一個知足條件的虛擬空間,返回這個虛擬空間的首地址。
get_area()是一個函數指針,有兩種可能使用mm->get_unmapped_area()或者file->f_op->get_unmapped_area()。
unsigned long get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { unsigned long (*get_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); unsigned long error = arch_mmap_check(addr, len, flags); if (error) return error; /* Careful about overflows.. */ if (len > TASK_SIZE) return -ENOMEM; get_area = current->mm->get_unmapped_area;------------使用mm_struct->get_unmapped_area()方法,即arch_get_unmapped_area()。 if (file && file->f_op->get_unmapped_area)------------若是是文件映射,而且該文件的file_operations定義了get_unmapped_area方法,那麼使用它實現定位虛擬區間。 get_area = file->f_op->get_unmapped_area; addr = get_area(file, addr, len, pgoff, flags); if (IS_ERR_VALUE(addr)) return addr; if (addr > TASK_SIZE - len) return -ENOMEM; if (offset_in_page(addr)) return -EINVAL; addr = arch_rebalance_pgtables(addr, len); error = security_mmap_addr(addr); return error ? error : addr; }
看arch_get_unmapped_area()名字就知道,可能有各架構本身的實現函數。這裏以平臺無關的函數進行分析。
arch_get_unmapped_area()完成從低地址向高地址建立新的映射,而arch_get_unmapped_area_topdown()完成從高地址向低地址建立新的映射。
unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; int do_align = 0; int aliasing = cache_is_vipt_aliasing(); struct vm_unmapped_area_info info; if (aliasing) do_align = filp || (flags & MAP_SHARED); if (flags & MAP_FIXED) {------------------這裏能夠看出MAP_FIXED不參與選址,固定地址建立。 if (aliasing && flags & MAP_SHARED && (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1)) return -EINVAL; return addr; } if (len > TASK_SIZE) return -ENOMEM; if (addr) {--------------------------------當addr非0,表示制定了一個特定的優先選用地址,內核會檢查該區域是否與現存區域重疊,有find_vma()完成查找功能。 if (do_align) addr = COLOUR_ALIGN(addr, pgoff); else addr = PAGE_ALIGN(addr); vma = find_vma(mm, addr); if (TASK_SIZE - len >= addr && (!vma || addr + len <= vm_start_gap(vma))) return addr; } info.flags = 0; info.length = len; info.low_limit = mm->mmap_base; info.high_limit = TASK_SIZE; info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0; info.align_offset = pgoff << PAGE_SHIFT; return vm_unmapped_area(&info);-----------當addr爲空或者指定的優選地址不知足分配條件時,內核必須遍歷進程中可用的區域,設法找到一個大小適當的空閒區域,vm_unmapped_area()完成實際的工做。 } static inline unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info) { if (info->flags & VM_UNMAPPED_AREA_TOPDOWN) return unmapped_area_topdown(info);--從高地址到低地址穿點映射。 else return unmapped_area(info);----------從低地址到高地址建立映射。 } unsigned long unmapped_area(struct vm_unmapped_area_info *info) { /* * We implement the search by looking for an rbtree node that * immediately follows a suitable gap. That is, * - gap_start = vma->vm_prev->vm_end <= info->high_limit - length; * - gap_end = vma->vm_start >= info->low_limit + length; * - gap_end - gap_start >= length */ struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long length, low_limit, high_limit, gap_start, gap_end; /* Adjust search length to account for worst case alignment overhead */ length = info->length + info->align_mask; if (length < info->length) return -ENOMEM; /* Adjust search limits by the desired length */ if (info->high_limit < length) return -ENOMEM; high_limit = info->high_limit - length; if (info->low_limit > high_limit) return -ENOMEM; low_limit = info->low_limit + length; /* Check if rbtree root looks promising */ if (RB_EMPTY_ROOT(&mm->mm_rb)) goto check_highest; vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb); if (vma->rb_subtree_gap < length) goto check_highest; while (true) { /* Visit left subtree if it looks promising */ gap_end = vm_start_gap(vma);----------------------------------先從低地址開始查詢。 if (gap_end >= low_limit && vma->vm_rb.rb_left) { struct vm_area_struct *left = rb_entry(vma->vm_rb.rb_left, struct vm_area_struct, vm_rb); if (left->rb_subtree_gap >= length) { vma = left; continue; } } gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;------當前結點rb_subtree_gap已是最後一個可能知足此次分配。 check_current: /* Check if current node has a suitable gap */ if (gap_start > high_limit) return -ENOMEM; if (gap_end >= low_limit && gap_end > gap_start && gap_end - gap_start >= length) goto found; /* Visit right subtree if it looks promising */ if (vma->vm_rb.rb_right) { struct vm_area_struct *right = rb_entry(vma->vm_rb.rb_right, struct vm_area_struct, vm_rb); if (right->rb_subtree_gap >= length) { vma = right; continue; } } /* Go back up the rbtree to find next candidate node */ while (true) { struct rb_node *prev = &vma->vm_rb; if (!rb_parent(prev)) goto check_highest; vma = rb_entry(rb_parent(prev), struct vm_area_struct, vm_rb); if (prev == vma->vm_rb.rb_left) { gap_start = vm_end_gap(vma->vm_prev); gap_end = vm_start_gap(vma); goto check_current; } } } check_highest: /* Check highest gap, which does not precede any rbtree node */ gap_start = mm->highest_vm_end; gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */ if (gap_start > high_limit) return -ENOMEM; found: /* We found a suitable gap. Clip it with the original low_limit. */ if (gap_start < info->low_limit) gap_start = info->low_limit; /* Adjust gap address to the desired alignment */ gap_start += (info->align_offset - gap_start) & info->align_mask; VM_BUG_ON(gap_start + info->length > info->high_limit); VM_BUG_ON(gap_start + info->length > gap_end); return gap_start; }
mmap_region()首先調用find_vma_links()查找是否已有vma線性區包含addr,若是有調用do_munmap()把這個vma幹掉。
Linux不但願vma和vma之間存在空洞,只要新建立vma的flags屬性和前面或者後面vma仙童,就嘗試合併成一個新的vma,減小slab緩存消耗量,同時也減小了空洞浪費。
若是沒法合併,那麼只好新建立vma並對vma結構體初始化先關成員;根據vma是否有頁鎖定標誌(VM_LOCKED),決定是否當即分配物理頁。
最後將新建的vma插入進程空間vma紅黑樹中,並返回addr。
unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; int error; struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; /* Check against address space limit. */ if (!may_expand_vm(mm, len >> PAGE_SHIFT)) {--------------------檢查當前total_vm+len是否查過RLIMIT_AS,確保虛擬映射能夠進行。 unsigned long nr_pages; if (!(vm_flags & MAP_FIXED)) return -ENOMEM; nr_pages = count_vma_pages_range(mm, addr, addr + len); if (!may_expand_vm(mm, (len >> PAGE_SHIFT) - nr_pages)) return -ENOMEM; } while (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) {-----------------------------------遍歷該進程已有的vma紅黑樹,若是找到vma覆蓋[addr, end]區域,那麼返回0,表示找到。若是覆蓋已有的vma區域,返回ENOMEM。 if (do_munmap(mm, addr, len))------------------------------存在覆蓋已有區域的狀況,那麼嘗試取munmap這塊區域。若是munmap成功返回0,不成功則mmap_region()失敗。 return -ENOMEM; } if (accountable_mapping(file, vm_flags)) { charged = len >> PAGE_SHIFT; if (security_vm_enough_memory_mm(mm, charged)) return -ENOMEM; vm_flags |= VM_ACCOUNT; } vma = vma_merge(mm, prev, addr, addr + len, vm_flags, NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);-----------------------至此表示已經能夠找到合適的vma區域,原有映射是否能夠被新的映射覆用,減小由於vma致使的slab消耗和虛擬內存的空洞。 if (vma) goto out; vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);----------------------在沒有找到的狀況下,新建一個vma。 if (!vma) { error = -ENOMEM; goto unacct_error; } vma->vm_mm = mm;---------------------------------------------------------初始化vma數據 vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; INIT_LIST_HEAD(&vma->anon_vma_chain); if (file) {--------------------------------------------------------------若是是文件映射 if (vm_flags & VM_DENYWRITE) { error = deny_write_access(file); if (error) goto free_vma; } if (vm_flags & VM_SHARED) { error = mapping_map_writable(file->f_mapping); if (error) goto allow_write_and_free_vma; } vma->vm_file = get_file(file); error = file->f_op->mmap(file, vma);---------------------------------調用文件操做函數集的mmap成員。 if (error) goto unmap_and_free_vma; WARN_ON_ONCE(addr != vma->vm_start); addr = vma->vm_start; vm_flags = vma->vm_flags; } else if (vm_flags & VM_SHARED) {--------------------------------------共享匿名區 error = shmem_zero_setup(vma); if (error) goto free_vma; } vma_link(mm, vma, prev, rb_link, rb_parent);----------------------------將新建的vma插入到進程地址空間的vma紅黑樹中,已經作一些計數更新等。 /* Once vma denies write, undo our temporary denial count */ if (file) { if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); if (vm_flags & VM_DENYWRITE) allow_write_access(file); } file = vma->vm_file; out: perf_event_mmap(vma); vm_stat_account(mm, vm_flags, file, len >> PAGE_SHIFT); if (vm_flags & VM_LOCKED) { if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); else vma->vm_flags &= VM_LOCKED_CLEAR_MASK; } if (file) uprobe_mmap(vma); vma->vm_flags |= VM_SOFTDIRTY; vma_set_page_prot(vma); return addr; unmap_and_free_vma: vma->vm_file = NULL; fput(file); /* Undo any partial mapping done by a device driver. */ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end); charged = 0; if (vm_flags & VM_SHARED) mapping_unmap_writable(file->f_mapping); allow_write_and_free_vma: if (vm_flags & VM_DENYWRITE) allow_write_access(file); free_vma: kmem_cache_free(vm_area_cachep, vma); unacct_error: if (charged) vm_unacct_memory(charged); return error; }
參考文檔:《linux進程地址空間(3) 內存映射(1)mmap與do_mmap》、《進程地址空間 get_unmmapped_area()》
檢查目標地址在當前進程的虛擬空間是否已經在使用,若是已經在使用就要將老的映射撤銷,要是這個操做失敗,則goto free_vma。由於flags的標誌位爲MAP_FIXED爲1時,並未對此檢查。
munmap()用於解除內存映射,其核心函數式do_munmap()。
SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len) { profile_munmap(addr); return vm_munmap(addr, len); } int vm_munmap(unsigned long start, size_t len) { int ret; struct mm_struct *mm = current->mm; down_write(&mm->mmap_sem); ret = do_munmap(mm, start, len); up_write(&mm->mmap_sem); return ret; } int do_munmap(struct mm_struct *mm, unsigned long start, size_t len) { unsigned long end; struct vm_area_struct *vma, *prev, *last; if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start) return -EINVAL; len = PAGE_ALIGN(len); if (len == 0) return -EINVAL; /* Find the first overlapping VMA */ vma = find_vma(mm, start);-----------------找到起始地址落在哪一個vma內。 if (!vma)----------------------------------若是沒有找到的話,直接返回0。 return 0; prev = vma->vm_prev; end = start + len; if (vma->vm_start >= end)------------------若是要釋放空間的結束地址都小於vma起始地址,說明這二者沒有重疊,直接退出。 return 0; if (start > vma->vm_start) { int error; if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count) return -ENOMEM; error = __split_vma(mm, vma, start, 0);----因爲start>vma->vm_start,說明要釋放空間和vm_start有一段空隙。這裏就是分離這段gap。 if (error) return error; prev = vma; } last = find_vma(mm, end);----------------------找到要釋放空間結束地址的vma。 if (last && end > last->vm_start) { int error = __split_vma(mm, last, end, 1);-若是if成立,說明要釋放空間end和vm_start之間有gap,就須要分離這段gap。 if (error) return error; } vma = prev ? prev->vm_next : mm->mmap; if (mm->locked_vm) { struct vm_area_struct *tmp = vma; while (tmp && tmp->vm_start < end) { if (tmp->vm_flags & VM_LOCKED) { mm->locked_vm -= vma_pages(tmp); munlock_vma_pages_all(tmp);-------若是這段空間是VM_LOCKED,就須要unlock。 } tmp = tmp->vm_next; } } detach_vmas_to_be_unmapped(mm, vma, prev, end); unmap_region(mm, vma, prev, start, end);------釋放實際佔用的頁面。 arch_unmap(mm, vma, start, end); /* Fix up all other VM information */ remove_vma_list(mm, vma);---------------------刪除mm_struct結構中的vma信息。 return 0; } static void unmap_region(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long start, unsigned long end) { struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap; struct mmu_gather tlb; lru_add_drain(); tlb_gather_mmu(&tlb, mm, start, end); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end);---------掃描線性地址空間的全部頁表項 free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING);---回收上一步已經清空的進程頁表。 tlb_finish_mmu(&tlb, start, end);----------刷新TLB,在多處理器系統中,調用freepages_and_swap_cache()釋放頁框。 } void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr) { struct mm_struct *mm = vma->vm_mm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next) unmap_single_vma(tlb, vma, start_addr, end_addr, NULL); mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); }
參考文檔:《內存管理API之do_munmap》《釋放線性地址區間》。
進程對映射的內存空間內容改變並不直接回寫到磁盤中,每每在調用munmap()後才執行操做。
msync()函數將映射內存空間內容同步到磁盤文件中。
SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags) { unsigned long end; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; int unmapped_error = 0; int error = -EINVAL; if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) goto out; if (offset_in_page(start)) goto out; if ((flags & MS_ASYNC) && (flags & MS_SYNC)) goto out; error = -ENOMEM; len = (len + ~PAGE_MASK) & PAGE_MASK; end = start + len; if (end < start) goto out; error = 0; if (end == start) goto out; /* * If the interval [start,end) covers some unmapped address ranges, * just ignore them, but return -ENOMEM at the end. */ down_read(&mm->mmap_sem); vma = find_vma(mm, start); for (;;) { struct file *file; loff_t fstart, fend; /* Still start < end. */ error = -ENOMEM; if (!vma) goto out_unlock; /* Here start < vma->vm_end. */ if (start < vma->vm_start) { start = vma->vm_start; if (start >= end) goto out_unlock; unmapped_error = -ENOMEM; } /* Here vma->vm_start <= start < vma->vm_end. */ if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) { error = -EBUSY; goto out_unlock; } file = vma->vm_file; fstart = (start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); fend = fstart + (min(end, vma->vm_end) - start) - 1; start = vma->vm_end; if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) { get_file(file); up_read(&mm->mmap_sem); error = vfs_fsync_range(file, fstart, fend, 1); fput(file); if (error || start >= end) goto out; down_read(&mm->mmap_sem); vma = find_vma(mm, start); } else { if (start >= end) { error = 0; goto out_unlock; } vma = vma->vm_next; } } out_unlock: up_read(&mm->mmap_sem); out: return error ? : unmapped_error; } int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync) { struct inode *inode = file->f_mapping->host; if (!file->f_op->fsync) return -EINVAL; if (!datasync && (inode->i_state & I_DIRTY_TIME)) { spin_lock(&inode->i_lock); inode->i_state &= ~I_DIRTY_TIME; spin_unlock(&inode->i_lock); mark_inode_dirty_sync(inode); } return file->f_op->fsync(file, start, end, datasync); }
經過getconf PAGESIZE查看當前系統頁面大小,可知當前系統頁面大小爲4096。
malloc()分配內存,並不必定都經過brk()進行;若是分配的內存達到128K,就要經過mmap進行。
#include<unistd.h> #include<stdio.h> #include<stdlib.h> #include<string.h> #include<sys/types.h> #include<sys/stat.h> #include<sys/mman.h> #define MAX (4096*31+4072) int main() { int i=0; char *array = (char *)malloc(MAX); for( i=0; i<MAX; ++i ) ++array[ i ]; free(array); return 0; }
下面就來看看MAX不一樣大小,對malloc的影響。
當MAX爲(4096*31+4072)時,跟蹤系統調用以下:
...
brk(0x244c000) = 0x244c000
brk(0x242c000) = 0x242c000
exit_group(0) = ?
+++ exited with 0 +++
當MAX爲(4096*31+4073)時,跟蹤系統調用以下:
...
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f12b88c9000
munmap(0x7f12b88c9000, 135168) = 0
exit_group(0) = ?
+++ exited with 0 +++
能夠看出當分配的內存接近128KB是,malloc()會對齊到128KB,而且附加了1頁做爲gap。實際分配的虛擬地址空間達到了132kB。
上面有提到mmap()後對內存的操做相對於普通的read()/write()速度更快,這裏進行一個簡單測試。
#include<unistd.h> #include<stdio.h> #include<stdlib.h> #include<string.h> #include<sys/types.h> #include<sys/stat.h> #include<sys/time.h> #include<fcntl.h> #include<sys/mman.h> #define MAX 1024*128 int main() { int i=0; int count=0, fd=0; struct timeval tv1, tv2; char *array = (char *)malloc(MAX); /*read*/ gettimeofday( &tv1, NULL ); fd = open( "./mmap_test", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if(fd<0) printf("Open file failed\n"); if(MAX != read( fd, (char*)array, MAX )) { printf("Reading data failed...\n"); return -1; } memset(array, 'a', MAX); lseek(fd,0,SEEK_SET); if(MAX != write(fd, (void *)array, MAX)) { printf( "Writing data failed...\n" ); return -1; } close( fd ); gettimeofday( &tv2, NULL ); free( array ); printf( "Time of read/write: %ldus\n", (tv2.tv_usec - tv1.tv_usec)); /*mmap*/ gettimeofday( &tv1, NULL ); fd = open( "./mmap_test2", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); array = mmap( NULL, MAX, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); memset(array, 'b', MAX); munmap( array, MAX ); msync( array, MAX, MS_SYNC ); close( fd ); gettimeofday( &tv2, NULL ); printf( "Time of mmap/munmap/msync: %ldus\n", (tv2.tv_usec - tv1.tv_usec)); return 0; }
首先建立兩個128KB的空文件。
dd bs=1024 count=128 if=/dev/zero of=./mmap_test
dd bs=1024 count=128 if=/dev/zero of=./mmap_test2
兩個文件內容分別變成了'A'和'B',能夠看出mmap領先很多:
Time of read/write: 134us
Time of mmap/munmap/msync: 91us
#include<stdio.h> #include<unistd.h> void main() { sleep(1000); }
經過strace執行如上應用,獲得以下的系統調用過程。
execve("./sleep", ["./sleep"], [/* 77 vars */]) = 0 brk(NULL) = 0x1286000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=145720, ...}) = 0 mmap(NULL, 145720, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa2e0dec000--------------------------------------------------------1,只讀私有文件映射,在a處釋放。 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\t\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1868984, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0deb000--------------------------------2,匿名映射一頁,範圍0x7fa2e0deb000-0x7fa2e0dec000,可讀寫 mmap(NULL, 3971488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa2e0821000-------------------------------3,建立可讀可執行,私有文件映射,範圍0x7fa2e0821000-0x7fa2e0beb000 mprotect(0x7fa2e09e1000, 2097152, PROT_NONE) = 0-------------------------------------------------------------------------4,修改0x7fa2e09e1000-0x7fa2e0be1000屬性,不可讀寫執行 mmap(0x7fa2e0be1000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1c0000) = 0x7fa2e0be1000-----5,私有文件固定地址映射,可讀寫,0x7fa2e0be1000-0x7fa2e0be7000 mmap(0x7fa2e0be7000, 14752, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0be7000-----------6,私有匿名固定地址映射,可讀寫,0x7fa2e0be7000-0x7fa2e0beb000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0dea000--------------------------------7,匿名映射一頁,範圍0x7fa2e0dea000-0x7fa2e0deb000,可讀寫 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa2e0de9000--------------------------------8,匿名映射一頁,範圍0x7fa2e0de9000-0x7fa2e0dea000,可讀寫 arch_prctl(ARCH_SET_FS, 0x7fa2e0dea700) = 0 mprotect(0x7fa2e0be1000, 16384, PROT_READ) = 0---------------------------------------------------------------------------9,將5建立的內存映射的0x7fa2e0be1000-0x7fa2e0be5000變成只讀 mprotect(0x600000, 4096, PROT_READ) = 0 mprotect(0x7fa2e0e10000, 4096, PROT_READ) = 0 munmap(0x7fa2e0dec000, 145720) = 0------------------------------------------------------------------------------a,釋放1建立的內存映射 nanosleep({1000, 0}, 0x7ffef87e2c10) = 0------------------------------------------------------------------------------此時cat /proc/xxx/maps,1建立的內存映射已經被釋放。 exit_group(0) = ? +++ exited with 0 +++
下面逐一分析mmap()/munmap()對進程映射空間的影響。
00400000-00401000 r-xp 00000000 08:08 3415949 /home/al/mmap/sleep 00600000-00601000 r--p 00000000 08:08 3415949 /home/al/mmap/sleep 00601000-00602000 rw-p 00001000 08:08 3415949 /home/al/mmap/sleep 7fa2e0821000-7fa2e09e1000 r-xp 00000000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3建立私有文件映射,可讀可執行。 7fa2e09e1000-7fa2e0be1000 ---p 001c0000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3建立私有文件映射,4修改屬性從可讀可執行變成不可讀寫不可執行。 7fa2e0be1000-7fa2e0be5000 r--p 001c0000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3建立私有文件映射,5修改屬性從可讀可執行變成可讀寫,9修改屬性爲只讀。 7fa2e0be5000-7fa2e0be7000 rw-p 001c4000 08:08 3185985 /lib/x86_64-linux-gnu/libc-2.23.so--------------3建立私有文件映射,5修改屬性從可讀可執行變成可讀寫。 7fa2e0be7000-7fa2e0beb000 rw-p 00000000 00:00 0 -------------------------------------------------------------------------3建立私有文件映射,6覆蓋建立的私有匿名固定地址映射,可讀寫。 7fa2e0beb000-7fa2e0c11000 r-xp 00000000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so 7fa2e0de9000-7fa2e0dec000 rw-p 00000000 00:00 0 -------------------------------------------------------------------------2,7,8三個匿名映射由於屬性都是私有匿名映射,可讀寫,因此vma區域合併。 7fa2e0e10000-7fa2e0e11000 r--p 00025000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so 7fa2e0e11000-7fa2e0e12000 rw-p 00026000 08:08 3185983 /lib/x86_64-linux-gnu/ld-2.23.so 7fa2e0e12000-7fa2e0e13000 rw-p 00000000 00:00 0 7ffef87c3000-7ffef87e4000 rw-p 00000000 00:00 0 [stack] 7ffef87e4000-7ffef87e7000 r--p 00000000 00:00 0 [vvar] 7ffef87e7000-7ffef87e9000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
對於解釋能夠參考UNIX系統編程手冊以下描述。
《linux內存映射mmap原理分析》