linux內核寫時複製機制源代碼解讀

時間 2020-09-14

原文原文鏈接

寫時複製技術（一下簡稱COW）是linux內核比較重要的一種機制，咱們都知道：父進程fork子進程的時候，子進程會和父進程會以只讀的方式共享全部私有的可寫頁，當有一方將要寫的時候會發生COW缺頁異常。那麼究竟COW在linux內核中是如何觸發？又是如何處理的呢？咱們將在本文中以源代碼情景分析的方式來解讀神祕的寫時COW，從源代碼級別的角度完全理解它。linux

須要說明的是:本文中所分析的內核源碼時linux-5.0版本內核，使用arm64處理器架構，固然此文章發佈時linux內核已是linux-5.8.x，當你查看最新的內核源碼的時候會發現變化並非很大。本文主要會從下面幾個方面去分析討論寫時複製：架構

fork子進程時內核爲COW作了哪些準備
COW進程是如何觸發的
內核時怎樣處理COW這種缺頁異常的
匿名頁的reuse

一，從fork提及

咱們都知道，進程是經過fork進行建立的，fork建立子進程的時候會和父進程共享資源，如fs,file,mm等等，其中內存資源的共享是一下路徑：app

kernel/fork.c
_do_fork->copy_process->copy_mm

固然本文中討論的是COW，暫時不詳解其餘資源共享以及內存資源共享的其餘部分（後面的相關文章咱們會討論），copy_mm整體來講所做的工做是：分配mm_struct結構實例mm，拷貝父進程的old_mm到mm,建立本身的pgd頁全局目錄，而後會遍歷父進程的vma鏈表爲子進程創建vma鏈表（如代碼段，數據段等等），而後就是比較關鍵的頁的共享，linux內核爲了效率考慮並非拷貝父進程的全部物理頁內容，而是經過複製頁表來共享這些頁。而在複製頁表的時候，內核會判斷這個頁表條目是徹底複製仍是修改成只讀來爲COW缺頁作準備。函數

共享父進程內存資源處理以下：this

如下咱們主要分析copy_one_pte 拷貝頁表條目的這一函數：atom

首先會處理一些頁表項不爲空但物理頁不在內存中的狀況（!pte_present(pte)分支）如被swap到交換分區中的頁，接下來處理物理頁在內存中的狀況：code

773         /*
   774         |* If it's a COW mapping, write protect it both
   775         |* in the parent and the child
   776         |*/
   777         if (is_cow_mapping(vm_flags) && pte_write(pte)) {//vma爲私有可寫  並且pte有可寫屬性
   778                 ptep_set_wrprotect(src_mm, addr, src_pte);//設置父進程頁表項爲只讀
   779                 pte = pte_wrprotect(pte); //爲子進程設置只讀的頁表項值
   780         }
   781

上面的代碼塊是判斷當前頁所在的vma是不是私有可寫的屬性並且父進程頁表項是可寫：orm

247 static inline bool is_cow_mapping(vm_flags_t flags)
  248 {
  249         return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
  250 }

若是判斷成立說明是COW的映射，則須要將父子進程頁表修改成只讀：blog

ptep_set_wrprotect(src_mm, addr, src_pte)將父進程的頁表項修改成只讀， pte = pte_wrprotect(pte)將子進程的即將寫入的頁表項值修改成只讀（注意：修改以前pte爲父進程原來的pte值，修改以後子進程pte尚未寫入到對應的頁表項條目中！）進程

修改頁表項爲只讀的核心函數爲：

152 static inline pte_t pte_wrprotect(pte_t pte)
  153 {
  154         pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));//清可寫位
  155         pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));//置位只讀位
  156         return pte;
  157

再次回到copy_one_pte函數往下分析：

上面咱們已經修改了父進程的頁表項，也得到了子進程即將寫入的頁表項值pte(注意：如今尚未寫入到子進程的頁表項中，由於此時子進程的頁表項值尚未被徹底拼接號好)，接下來咱們將要拼接子進程的頁表項的值：

782         /*
   783         |* If it's a shared mapping, mark it clean in
   784         |* the child
   785         |*/
   786         if (vm_flags & VM_SHARED) //vma的屬性爲共享
   787                 pte = pte_mkclean(pte);//設置頁表項值爲clean
   788         pte = pte_mkold(pte); //設置頁表項值爲未被訪問過便是清PTE_AF
   789
   790         page = vm_normal_page(vma, addr, pte); //得到pte對應的page結構（便是和父進程共享的頁描述符）
   791         if (page) {
   792                 get_page(page);//增進page結構的引用計數
   793                 page_dup_rmap(page, false);//注意：不是拷貝rmap 而是增長page->_mapcount計數（頁被映射計數）
   794                 rss[mm_counter(page)]++;
   795         } else if (pte_devmap(pte)) {
   796                 page = pte_page(pte);
   797
   798                 /*
   799                 |* Cache coherent device memory behave like regular page and
   800                 |* not like persistent memory page. For more informations see
   801                 |* MEMORY_DEVICE_CACHE_COHERENT in memory_hotplug.h
   802                 |*/
   803                 if (is_device_public_page(page)) {
   804                         get_page(page);
   805                         page_dup_rmap(page, false);
   806                         rss[mm_counter(page)]++;
   807                 }
   808         }
   809
   810 out_set_pte:
   811         set_pte_at(dst_mm, addr, dst_pte, pte);//將拼接的頁表項值寫入到子進程的頁表項中
   812         return 0;

以上過程就完成了對於須要寫時複製的頁，將父子進程的頁表項改寫爲只讀（這時候vma的屬性是可寫的），並共享相同的物理頁，這爲下面的COW缺頁異常作好了頁表級別的準備工做。

二，COW缺頁異常觸發條件

固然若是父子進程僅僅是對COW共享的頁面作只讀訪問，則經過各自的頁表就能直接訪問到對應的數據，一切都正常，一旦有一方去寫，就會發生處理器異常，處理器會判斷出是COW缺頁異常：

arm64處理器處理過程：

咱們從handle_pte_fault函數開始分析：

3800         if (vmf->flags & FAULT_FLAG_WRITE) {//vam可寫
  3801                 if (!pte_write(entry))//頁表項屬性只讀
  3802                         return do_wp_page(vmf);//處理cow
  3803                 entry = pte_mkdirty(entry);
  3804         }

程序走到上面的判斷說明：頁表項存在，物理頁存在內存，可是vma是可寫，pte頁表項是隻讀屬性（這就是fork的時候所做的準備），這些條件也是COW缺頁異常判斷的條件。

三,發生COW缺頁異常

當內核判斷了此次異常時COW缺頁異常，就會調用do_wp_page進行處理：

2480 static vm_fault_t do_wp_page(struct vm_fault *vmf)
  2481         __releases(vmf->ptl)
  2482 {
  2483         struct vm_area_struct *vma = vmf->vma;
  2484
  2485         vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);//得到異常地址對應的page實例
  2486         if (!vmf->page) {
  2487                 /*
  2488                 |* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
  2489                 |* VM_PFNMAP VMA.
  2490                 |*
  2491                 |* We should not cow pages in a shared writeable mapping.
  2492                 |* Just mark the pages writable and/or call ops->pfn_mkwrite.
  2493                 |*/
  2494                 if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
  2495                                 |    (VM_WRITE|VM_SHARED))
  2496                         return wp_pfn_shared(vmf);//處理共享可寫映射
  2497
  2498                 pte_unmap_unlock(vmf->pte, vmf->ptl);
  2499                 return wp_page_copy(vmf);//處理私有可寫映射
  2500         }

2485行，得到發生異常時地址所在的page結構，若是沒有page結構是使用頁幀號的特殊映射，則經過wp_pfn_shared處理共享可寫映射，wp_page_copy處理私有可寫映射，固然這不是咱們分析重點。

咱們繼續往下分析：

咱們主要關注2522行，判斷是否能夠從新使用這個頁，這個稍後在分析。

2544         |* Ok, we need to copy. Oh, well..
  2545         |*/
  2546         get_page(vmf->page);
  2547
  2548         pte_unmap_unlock(vmf->pte, vmf->ptl);
  2549         return wp_page_copy(vmf);

2546行增長原來的頁的引用計數，防止被釋放。

2548行釋放頁表鎖

2549行這是COW處理的核心函數

咱們下面將詳細分析wp_page_copy函數：

* - Allocate a page, copy the content of the old page to the new one.
  2234  * - Handle book keeping and accounting - cgroups, mmu-notifiers, etc.
  2235  * - Take the PTL. If the pte changed, bail out and release the allocated page
  2236  * - If the pte is still the way we remember it, update the page table and all
  2237  *   relevant references. This includes dropping the reference the page-table
  2238  *   held to the old page, as well as updating the rmap.
  2239  * - In any case, unlock the PTL and drop the reference we took to the old page.
  2240  */
  2241 static vm_fault_t wp_page_copy(struct vm_fault *vmf)
  2242 {
  2243         struct vm_area_struct *vma = vmf->vma;
  2244         struct mm_struct *mm = vma->vm_mm;
  2245         struct page *old_page = vmf->page;
  2246         struct page *new_page = NULL;
  2247         pte_t entry;
  2248         int page_copied = 0;
  2249         struct mem_cgroup *memcg;
  2250         struct mmu_notifier_range range;
  2251
  2252         if (unlikely(anon_vma_prepare(vma)))
  2253                 goto oom;
  2254
  2255         if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
  2256                 new_page = alloc_zeroed_user_highpage_movable(vma,
  2257                                                         |     vmf->address);
  2258                 if (!new_page)
  2259                         goto oom;
  2260         } else {
  2261                 new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
  2262                                 vmf->address);
  2263                 if (!new_page)
  2264                         goto oom;
  2265                 cow_user_page(new_page, old_page, vmf->address, vma);
  2266         }

2252行關聯一個anon_vma實例到vma

2255行到 2259行判斷原來的頁表項映射的頁是0頁，就分配高端可移動的頁並用0初始化

2261到2265行若是不是0頁，分配高端可移動的頁，而後將原來的頁拷貝到新頁

2268         if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
  2269                 goto oom_free_new;
  2270
  2271         __SetPageUptodate(new_page);
  2272
  2273         mmu_notifier_range_init(&range, mm, vmf->address & PAGE_MASK,
  2274                                 (vmf->address & PAGE_MASK) + PAGE_SIZE);
  2275         mmu_notifier_invalidate_range_start(&range);
  2276
  2277         /*
  2278         |* Re-check the pte - we dropped the lock
  2279         |*/
  2280         vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
  2281         if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
  2282                 if (old_page) {
  2283                         if (!PageAnon(old_page)) {
  2284                                 dec_mm_counter_fast(mm,
  2285                                                 mm_counter_file(old_page));
  2286                                 inc_mm_counter_fast(mm, MM_ANONPAGES);
  2287                         }
  2288                 } else {
  2289                         inc_mm_counter_fast(mm, MM_ANONPAGES);
  2290                 }
  2291                 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
  2292                 entry = mk_pte(new_page, vma->vm_page_prot);
  2293                 entry = maybe_mkwrite(pte_mkdirty(entry), vma);
  2294                 /*
  2295                 |* Clear the pte entry and flush it first, before updating the
  2296                 |* pte with the new entry. This will avoid a race condition
  2297                 |* seen in the presence of one thread doing SMC and another
  2298                 |* thread doing COW.
  2299                 |*/
  2300                 ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
  2301                 page_add_new_anon_rmap(new_page, vma, vmf->address, false);
  2302                 mem_cgroup_commit_charge(new_page, memcg, false, false);
  2303                 lru_cache_add_active_or_unevictable(new_page, vma);
  2304                 /*
  2305                 |* We call the notify macro here because, when using secondary
  2306                 |* mmu page tables (such as kvm shadow page tables), we want the
  2307                 |* new page to be mapped directly into the secondary page table.
  2308                 |*/
  2309                 set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
  2310                 update_mmu_cache(vma, vmf->address, vmf->pte);
  2311                 if (old_page) {
  2312                         /*
  2313                         |* Only after switching the pte to the new page may
  2314                         |* we remove the mapcount here. Otherwise another
  2315                         |* process may come and find the rmap count decremented
  2316                         |* before the pte is switched to the new page, and
  2317                         |* "reuse" the old page writing into it while our pte
  2318                         |* here still points into it and can be read by other
  2319                         |* threads.
  2320                         |*
  2321                         |* The critical issue is to order this
  2322                         |* page_remove_rmap with the ptp_clear_flush above.
  2323                         |* Those stores are ordered by (if nothing else,)
  2324                         |* the barrier present in the atomic_add_negative
  2325                         |* in page_remove_rmap.
  2326                         |*
  2327                         |* Then the TLB flush in ptep_clear_flush ensures that
  2328                         |* no process can access the old page before the
  2329                         |* decremented mapcount is visible. And the old page
  2330                         |* cannot be reused until after the decremented
  2331                         |* mapcount is visible. So transitively, TLBs to
  2332                         |* old page will be flushed before it can be reused.
  2333                         |*/
  2334                         page_remove_rmap(old_page, false);
  2335                 }
  2336
  2337                 /* Free the old page.. */
  2338                 new_page = old_page;
  2339                 page_copied = 1;
  2340         } else {
  2341                 mem_cgroup_cancel_charge(new_page, memcg, false);
  2342         }

2271行設置新的頁標識位爲PageUptodate，表示頁中包含有效數據。

2280行鎖住頁表

2281到2339行是發生缺頁異常時得到頁表項和如今鎖住以後得到頁表項內容相同的狀況

2341 時頁表項不一樣的狀況

主要分析相同的狀況：

2282到2290 主要時對頁計數的統計

2291 cache中刷新頁

2292行由vma的訪問權限和新頁的頁描述符來構建頁表項的值

2293行設置頁表項值屬性爲髒和可寫（若是vma有可寫屬性，這個時候將頁表項修改成了可寫，fork的時候修改成只讀這個地方修改了回來）

2300行將頁表項原有的值清除，而後刷新地址發生缺頁地址對應的tlb（這一行操做很重要）

2301行將新的物理頁添加到vma對應的匿名頁的反向映射中

2303行將新物理頁添加到活躍或不可回收LRU鏈表中

2309 行將構建好的頁表項值寫入到頁表項條目中，這個時候頁表項修改纔會生效。

2334行刪除原來的頁到虛擬頁的反向映射，而後作了比較重要的一個操做爲atomic_add_negative(-1, &page->_mapcount)將頁的頁表映射計數減一。

2344到2347 遞減舊頁的引用計數並釋放頁表鎖

2353到2364行若是已經映射了新的物理頁，舊頁被鎖住在內存中，將舊頁解鎖。

到此就完成了寫時複製過程。總結下：分配新的物理頁，拷貝原來頁的內容到新頁，而後修改頁表項內容指向新頁並修改成可寫（vma具有可寫屬性）。

前面咱們遺留了一個問題沒有討論，那就是do_wp_page函數中處理reuse_swap_page的處理，所謂的單身匿名頁面的處理。

四，匿名頁的reuse

假設有以下情形發生：父進程P經過fork建立了子進程A,其中有一私有可寫的匿名頁page1被共享，這個時候內核會此頁都映射到各自的虛擬內存頁，並修改雙方的頁表屬性爲只讀，page1的映射計數_mapcount爲2，這個時候假設子進程寫page1,則發生COW異常，異常處理程序爲子進程A分配了新頁page2並和虛擬頁創建映射關係，並改寫了子進程頁表項爲可寫，這個時候子進程能夠隨意的寫page2而不會影響父進程，固然上面分析咱們知道page1的映射計數_mapcount會遞減1變爲1，也就表面這個頁page1被父進程所惟一映射，那麼這個時候父進程再去寫page1，會發生什麼呢？還會發生COW去分配新的頁嗎?

下面咱們在源代碼中尋找答案：

do_wp_page函數的2502到2541是咱們分析重點：

2502         /*
  2503         |* Take out anonymous pages first, anonymous shared vmas are
  2504         |* not dirty accountable.
  2505         |*/
  2506         if (PageAnon(vmf->page) && !PageKsm(vmf->page)) {
  2507                 int total_map_swapcount;
  2508                 if (!trylock_page(vmf->page)) {
  2509                         get_page(vmf->page);
  2510                         pte_unmap_unlock(vmf->pte, vmf->ptl);
  2511                         lock_page(vmf->page);
  2512                         vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  2513                                         vmf->address, &vmf->ptl);
  2514                         if (!pte_same(*vmf->pte, vmf->orig_pte)) {
  2515                                 unlock_page(vmf->page);
  2516                                 pte_unmap_unlock(vmf->pte, vmf->ptl);
  2517                                 put_page(vmf->page);
  2518                                 return 0;
  2519                         }
  2520                         put_page(vmf->page);
  2521                 }
  2522                 if (reuse_swap_page(vmf->page, &total_map_swapcount)) {
  2523                         if (total_map_swapcount == 1) {
  2524                                 /*
  2525                                 |* The page is all ours. Move it to
  2526                                 |* our anon_vma so the rmap code will
  2527                                 |* not search our parent or siblings.
  2528                                 |* Protected against the rmap code by
  2529                                 |* the page lock.
  2530                                 |*/
  2524                                 /*
  2525                                 |* The page is all ours. Move it to
  2526                                 |* our anon_vma so the rmap code will
  2527                                 |* not search our parent or siblings.
  2528                                 |* Protected against the rmap code by
  2529                                 |* the page lock.
  2530                                 |*/
  2531                                 page_move_anon_rmap(vmf->page, vma);
  2532                         }
  2533                         unlock_page(vmf->page);
  2534                         wp_page_reuse(vmf);
  2535                         return VM_FAULT_WRITE;
  2536                 }
  2537                 unlock_page(vmf->page);
  2538         } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
  2539                                         (VM_WRITE|VM_SHARED))) {
  2540                 return wp_page_shared(vmf);
  2541         }

2506行對於匿名頁面且非KSM頁

2522行判斷是否這個頁面只被我所擁有（total_map_swapcount <= 0）

2534 調用wp_page_reuse處理（這是重點）

2195 /*
  2196  * Handle write page faults for pages that can be reused in the current vma
  2197  *
  2198  * This can happen either due to the mapping being with the VM_SHARED flag,
  2199  * or due to us being the last reference standing to the page. In either
  2200  * case, all we need to do here is to mark the page as writable and update
  2201  * any related book-keeping.
  2202  */
  2203 static inline void wp_page_reuse(struct vm_fault *vmf)
  2204         __releases(vmf->ptl)
  2205 {
  2206         struct vm_area_struct *vma = vmf->vma;
  2207         struct page *page = vmf->page;
  2208         pte_t entry;
  2209         /*
  2210         |* Clear the pages cpupid information as the existing
  2211         |* information potentially belongs to a now completely
  2212         |* unrelated process.
  2213         |*/
  2214         if (page)
  2215                 page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1);
  2216
  2217         flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
  2218         entry = pte_mkyoung(vmf->orig_pte);
  2219         entry = maybe_mkwrite(pte_mkdirty(entry), vma);
  2220         if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
  2221                 update_mmu_cache(vma, vmf->address, vmf->pte);
  2222         pte_unmap_unlock(vmf->pte, vmf->ptl);
  2223 }

代碼中能夠清晰看到：

2218行設置頁被訪問

2219行設置頁表項爲髒，若是頁所在的vma是可寫屬性則設置頁表項值爲可寫

2220行將設置好的頁表項值寫入到頁表項條目中（真正設置好了頁表項），注意arm64中在ptep_set_access_flags刷新了頁對應的tlb。

分析到這裏，有關COW的機制已經所有分析完，固然這個過程涉及到了無數的技術細節，在此再也不一一贅述，後面有機會會討論到相關的內容。

五，總結

咱們總結一下寫時複製（COW）機制的整個過程：首先發生在父進程fork子進程的時候，父子進程會共享（此共享並非咱們一般所說的共享映射和私有映射，而是經過將頁映射到每一個進程頁表造成共享）全部的私有可寫的物理頁，並將父子進程對應的頁表項修改成只讀，當有一方試圖寫共享的物理頁，因爲頁表項屬性是隻讀的會發生COW缺頁異常，缺頁異常處理程序會爲寫操做的一方分配新的物理頁，並將原來共享的物理頁內容拷貝到新頁，而後創建新頁的頁表映射關係，這樣寫操做的進程就能夠繼續執行，不會影響另外一方，父子進程對共享的私有頁面訪問就分道揚鑣了，當共享的頁面最終只有一個擁有者（便是其餘映射頁面到本身頁表的進程都發生寫時複製分配了新的物理頁），這個時候若是擁有者進程想要寫這個頁就會從新使用這個頁而不用分配新頁。

下面給出實驗代碼案例：

程序中有一全局變量num=10 打印num的值, 而後fork子進程，在子進程中修改全局變量num=100 而後打印num的值,父進程中睡眠1s故意等待子進程先執行完，而後再次打印num的值

1 #include <stdio.h>
    2 #include <unistd.h>
    3 #include <sys/types.h>
    4 
    5 
    6 int num = 10;
    7 
    8 int main(int argc,char **argv)
    9 {
   10 
   11         pid_t pid;
   12 
   13         printf("###%s:%d  pid=%d num=%d###\n", __func__, __LINE__,  getpid(), num);
   14 
   15 
   16         pid = fork();
   17         if (pid < 0) {
   18                 printf("fail to fork\n");
   19                 return -1;
   20         } else if (pid == 0) { //child process
   21                 num = 100;
   22                 printf("### This is child process pid=%d  num=%d###\n", getpid(), num);
   23                 _exit(0);
   24         } else { //parent process
   25                 sleep(1);
   26                 printf("### This is parent process  pid=%d  num=%d###\n", getpid(), num);
   27                 _exit(0);
   28         }
   29 
   30         return 0;
   31 }

你們能夠思考一下：第13，22, 27分別得出的num是多少？

咱們編譯執行：

hanch@hanch-VirtualBox:~/test/COW$ gcc fork-cow-test.c -o fork-cow-test
hanch@hanch-VirtualBox:~/test/COW$ ./fork-cow-test 
###main:13  pid=26844 num=10###
### This is child process pid=26845  num=100###
### This is parent process  pid=26844  num=10###

能夠發現父進程中的全局變量num =10, 當fork子進程後對這個全局變量進行了修改使得num =100,實際上fork的時候已經將父子進程的num這個全局變量所在的頁修改成了只讀，而後共享這個頁，當子進程寫這個全局變量的時候發生了COW缺頁異常，然而這對於應用程序來講是透明的，內核卻在缺頁異常處理中作了不少工做：主要是爲子進程分配物理頁，將父進程的num所在的頁內容拷貝到子進程，而後將子進程的va所對應的的頁表條目修改成可寫和分配的物理頁創建了映射關係，而後缺頁異常就返回了（從內核空間返回到了用戶空間），這個時候處理器會從新執行賦值操做指令，這個時候屬於子進程的num才被改寫爲100，可是要明白這個時候父進程的num變量所在的頁的讀寫屬性仍是隻讀，父進程再去寫的時候依然會發生COW缺頁異常。

最後咱們用圖說話來理解COW的整個過程：