Linux內存管理 (12)反向映射RMAP Linux內存管理專題

專題:Linux內存管理專題html

關鍵詞:RMAP、VMA、AV、AVCnode

 

所謂反向映射是相對於從虛擬地址到物理地址的映射,反向映射是從物理頁面到虛擬地址空間VMA的反向映射。api

RMAP可否實現的基礎是經過struct anon_vma、struct anon_vma_chain和sturct vm_area_struct創建了聯繫,經過物理頁面反向查找到VMA。數據結構

用戶在使用虛擬內存過程當中,PTE頁表項中保留着虛擬內存頁面映射到物理內存頁面的記錄。app

 

一個物理頁面能夠同時被多個進程的虛擬地址內存映射,但一個虛擬頁面同時只能有一個物理頁面與之映射。ide

不一樣虛擬頁面同時映射到同一物理頁面是由於子進程克隆父進程VMA,和KSM機制的存在。函數

 

若是頁面要被回收,就必需要找出哪些進程在使用這個頁面,而後斷開這些虛擬地址到物理頁面的映射。post

匿名頁面實際的斷開映射操做在rmap_walk_anon中進行的,能夠看出從struct page、到struct anon_vma、到struct anon_vma_chain、到struct vm_area_struct的關係。ui

1. 父進程分配匿名頁面

父進程爲本身的進程地址空間VMA分配物理內存時,一般會產生匿名頁面。this

do_anonymous_page()會分配匿名頁面;do_wp_page()發生寫時複製COW時也會產生一個新的匿名頁面。

static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
        unsigned long address, pte_t *page_table, pmd_t *pmd,
        unsigned int flags)
{
...
    /* Allocate our own private page. */
    if (unlikely(anon_vma_prepare(vma)))------------------------------爲進程地址空間準備struct anon_vma數據結構和struct anon_vma_chain鏈表。 goto oom;
    page = alloc_zeroed_user_highpage_movable(vma, address);----------從HIGHMEM區域分配一個zeroed頁面 if (!page)
        goto oom;
...
    inc_mm_counter_fast(mm, MM_ANONPAGES);
    page_add_new_anon_rmap(page, vma, address);-----------------------
    mem_cgroup_commit_charge(page, memcg, false);
    lru_cache_add_active_or_unevictable(page, vma);
...
}

 RMAP反向映射系統中有兩個重要的數據結構:一個是struct anon_vma,簡稱AV;一個是struct anon_vma_chain,簡稱AVC。

struct anon_vma {
    struct anon_vma *root;        /* Root of this anon_vma tree */----------------指向anon_vma數據機構中的根節點
    struct rw_semaphore rwsem;    /* W: modification, R: walking the list */------保護anon_vma中鏈表的讀寫信號量
    /*
     * The refcount is taken on an anon_vma when there is no
     * guarantee that the vma of page tables will exist for
     * the duration of the operation. A caller that takes
     * the reference is responsible for clearing up the
     * anon_vma if they are the last user on release
     */
    atomic_t refcount;------------------------------------------------------------對anon_vma的引用計數 /*
     * Count of child anon_vmas and VMAs which points to this anon_vma.
     *
     * This counter is used for making decision about reusing anon_vma
     * instead of forking new one. See comments in function anon_vma_clone.
     */
    unsigned degree;

    struct anon_vma *parent;    /* Parent of this anon_vma */--------------------指向父anon_vma數據結構

    /*
     * NOTE: the LSB of the rb_root.rb_node is set by
     * mm_take_all_locks() _after_ taking the above lock. So the
     * rb_root must only be read/written after taking the above lock
     * to be sure to see a valid next pointer. The LSB bit itself
     * is serialized by a system wide lock only visible to
     * mm_take_all_locks() (mm_all_locks_mutex).
     */
    struct rb_root rb_root;    /* Interval tree of private "related" vmas */-----紅黑樹根節點
}

 

struct anon_vma_chain數據結構是連接父子進程的樞紐:

struct anon_vma_chain {
    struct vm_area_struct *vma;-----------------------------------------------指向VMA struct anon_vma *anon_vma;------------------------------------------------指向anon_vma數據結構,能夠指向父進程或子進程的anon_vma數據結構。 struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */---鏈表節點,一般把anon_vma_chain添加到vma->anon_vma_chain鏈表中。
    struct rb_node rb;            /* locked by anon_vma->rwsem */-------------紅黑樹節點,一般把anon_vma_chain添加到anon_vma->rb_root的紅黑樹中。
    unsigned long rb_subtree_last;
#ifdef CONFIG_DEBUG_VM_RB
    unsigned long cached_vma_start, cached_vma_last;
#endif
}

 

 下面分析如何創建AV、AVC、VMA之間的關係: 

int anon_vma_prepare(struct vm_area_struct *vma)
{
    struct anon_vma *anon_vma = vma->anon_vma;--------------vma->anon_vma指向struct anon_vma數據結構。 struct anon_vma_chain *avc;

    might_sleep();
    if (unlikely(!anon_vma)) {
        struct mm_struct *mm = vma->vm_mm;
        struct anon_vma *allocated;

        avc = anon_vma_chain_alloc(GFP_KERNEL);------------分配一個struct anon_vma_chain結構。 if (!avc)
            goto out_enomem;

        anon_vma = find_mergeable_anon_vma(vma);-----------是否能夠和先後vma合併
        allocated = NULL;
        if (!anon_vma) {
            anon_vma = anon_vma_alloc();-------------------若是沒法合併,則從新分配一個結構體 if (unlikely(!anon_vma))
                goto out_enomem_free_avc;
            allocated = anon_vma;
        }

        anon_vma_lock_write(anon_vma);
        /* page_table_lock to protect against threads */
        spin_lock(&mm->page_table_lock);
        if (likely(!vma->anon_vma)) {
            vma->anon_vma = anon_vma;-------------------------創建struct vm_area_struct和struct anon_vma關聯 anon_vma_chain_link(vma, avc, anon_vma);----------創建struct anon_vma_chain和其餘結構體的關係。 /* vma reference or self-parent link for new root */
            anon_vma->degree++;
            allocated = NULL;
            avc = NULL;
        }
        spin_unlock(&mm->page_table_lock);
        anon_vma_unlock_write(anon_vma);

        if (unlikely(allocated))
            put_anon_vma(allocated);
        if (unlikely(avc))
            anon_vma_chain_free(avc);
    }
    return 0;

 out_enomem_free_avc:
    anon_vma_chain_free(avc);
 out_enomem:
    return -ENOMEM;
}

至此已經創建struct vm_area_struct、struct anon_vma、struct anon_vma_chain三者之間的連接,並插入相應鏈表、紅黑樹中。

從AVC能夠輕鬆找到VMA和AV;AV能夠經過紅黑樹找到AVC,而後發現全部紅黑樹中的AV;VMA能夠直接找到AV,也能夠經過AVC鏈表找到AVC。

 

 

static void anon_vma_chain_link(struct vm_area_struct *vma,
                struct anon_vma_chain *avc,
                struct anon_vma *anon_vma)
{
    avc->vma = vma;--------------------------------------------創建struct anon_vma_chain和struct vm_area_struct關聯
    avc->anon_vma = anon_vma;----------------------------------創建struct anon_vma_chain和struct anon_vma關聯
    list_add(&avc->same_vma, &vma->anon_vma_chain);------------將AVC添加到struct vm_area_struct->anon_vma_chain鏈表中。
    anon_vma_interval_tree_insert(avc, &anon_vma->rb_root);----將AVC添加到struct anon_vma->rb_root紅黑樹中。
}

 

 調用alloc_zeroed_user_highpage_movable分配物理內存以後,調用page_add_new_anon_rmap創建PTE映射關係。

 

void page_add_new_anon_rmap(struct page *page,
    struct vm_area_struct *vma, unsigned long address)
{
    VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
    SetPageSwapBacked(page);----------------------------------------------------------設置PG_SwapBacked表示這個頁面能夠swap到磁盤。
    atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */-------------設置_mapcount引用計數爲0
    if (PageTransHuge(page))
        __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
    __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,-----------------------------增長頁面所在zone的匿名頁面計數
            hpage_nr_pages(page));
    __page_set_anon_rmap(page, vma, address, 1);--------------------------------------設置這個頁面位匿名映射
}

static void __page_set_anon_rmap(struct page *page,
    struct vm_area_struct *vma, unsigned long address, int exclusive)
{
    struct anon_vma *anon_vma = vma->anon_vma;

    BUG_ON(!anon_vma);

    if (PageAnon(page))---------------------------------------------------------------判斷當前頁面是不是匿名頁面PAGE_MAPPING_ANON return;

    /*
     * If the page isn't exclusively mapped into this vma,
     * we must use the _oldest_ possible anon_vma for the
     * page mapping!
     */
    if (!exclusive)
        anon_vma = anon_vma->root;

    anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
    page->mapping = (struct address_space *) anon_vma;------------------------------mapping指定頁面所在的地址空間,這裏指向匿名頁面的地址空間數據結構struct anon_vma。
    page->index = linear_page_index(vma, address);
}

 

 結合上圖,總結以下:

  • 父進程每一個VMA都有一個anon_vma數據結構,vma->anon_vma指向。
  • 和VMA相關的物理頁面page->mapping都指向anon_vma。
  • AVC數據結構anon_vma_chain->vma指向VMA,anon_vma_chain->anon_vma指向AV。
  • AVC添加到VMA->anon_vma_chain鏈表中。
  • AVC添加到AV->anon_vma紅黑樹中。

  

2. 父進程建立子進程

父進程經過fork系統調用建立子進程時,子進程會複製父進程的進程地址空間VMA數據結構做爲本身的進程地址空間,而且會複製父進程的PTE頁表項內容到子進程的頁表中,實現父子進程共享頁表。

多個不一樣子進程中的虛擬頁面會同時映射到同一個物理頁面,另外多個不相干進程虛擬頁面也能夠經過KSM機制映射到同一個物理頁面

fork()系統調用實如今kernel/fork.c中,在dup_mmap()中複製父進程的地址空間和父進程的PTE頁表項:

static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
{
    struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
    struct rb_node **rb_link, *rb_parent;
    int retval;
    unsigned long charge;

    uprobe_start_dup_mmap();
    down_write(&oldmm->mmap_sem);
    flush_cache_dup_mm(oldmm);
    uprobe_dup_mmap(oldmm, mm);
    /*
     * Not linked in yet - no deadlock potential:
     */
    down_write_nested(&mm->mmap_sem, SINGLE_DEPTH_NESTING);

    mm->total_vm = oldmm->total_vm;
    mm->shared_vm = oldmm->shared_vm;
    mm->exec_vm = oldmm->exec_vm;
    mm->stack_vm = oldmm->stack_vm;

    rb_link = &mm->mm_rb.rb_node;
    rb_parent = NULL;
    pprev = &mm->mmap;
    retval = ksm_fork(mm, oldmm);
    if (retval)
        goto out;
    retval = khugepaged_fork(mm, oldmm);
    if (retval)
        goto out;

    prev = NULL;
    for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {-------------------for循環遍歷父進程的進程地址空間VMA。 struct file *file;

        if (mpnt->vm_flags & VM_DONTCOPY) {
            vm_stat_account(mm, mpnt->vm_flags, mpnt->vm_file,
                            -vma_pages(mpnt));
            continue;
        }
        charge = 0;
        if (mpnt->vm_flags & VM_ACCOUNT) {
            unsigned long len = vma_pages(mpnt);

            if (security_vm_enough_memory_mm(oldmm, len)) /* sic */
                goto fail_nomem;
            charge = len;
        }
        tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
        if (!tmp)
            goto fail_nomem;
        *tmp = *mpnt;--------------------------------------------------------複製父進程地址空間VMA到剛建立的子進程tmp中。
        INIT_LIST_HEAD(&tmp->anon_vma_chain);
        retval = vma_dup_policy(mpnt, tmp);
        if (retval)
            goto fail_nomem_policy;
        tmp->vm_mm = mm;
        if (anon_vma_fork(tmp, mpnt))----------------------------------------爲子進程建立相應的anon_vma數據結構。 goto fail_nomem_anon_vma_fork;
        tmp->vm_flags &= ~VM_LOCKED;
        tmp->vm_next = tmp->vm_prev = NULL;
        file = tmp->vm_file;
...
        __vma_link_rb(mm, tmp, rb_link, rb_parent);--------------------------把VMA添加到子進程紅黑樹中。
        rb_link = &tmp->vm_rb.rb_right;
        rb_parent = &tmp->vm_rb;

        mm->map_count++;
        retval = copy_page_range(mm, oldmm, mpnt);---------------------------複製父進程的PTE頁表項到子進程頁表中。 if (tmp->vm_ops && tmp->vm_ops->open)
            tmp->vm_ops->open(tmp);

        if (retval)
            goto out;
    }
...
}

 

 

 

3. 子進程發生COW

若是子進程的VMA發生COW,那麼會使用子進程VMA建立的anon_vma數據結構,即page->mmaping指針指向子進程VMA對應的anon_vma數據結構。

在do_wp_page()函數中處理COW場景狀況:子進程和父進程共享的匿名頁面,子進程的VMA發生COW。

->發生缺頁中斷
    ->handle_pte_fault
        ->do_wp_page
            ->分配一個新的匿名頁面
                ->__page_set_anon_rmap 使用子進程的anon_vma來設置page->mapping

 

 

4. RMAP應用

內核中經過struct page找到全部映射到這個頁面的VMA典型場景有:

  • kswapd內核線程回收頁面須要斷開全部映射了該匿名頁面的用戶PTE頁表項。
  • 頁面遷移時,須要斷開全部映射到匿名頁面的用戶PTE頁表項。 

 

try_to_unmap()是反向映射的核心函數,內核中其餘模塊會調用此函數來斷開一個頁面的全部映射:

/**
 * try_to_unmap - try to remove all page table mappings to a page
 * @page: the page to get unmapped
 * @flags: action and flags
 *
 * Tries to remove all the page table entries which are mapping this
 * page, used in the pageout path.  Caller must hold the page lock.
 * Return values are:
 *
 * SWAP_SUCCESS    - we succeeded in removing all mappings------------成功解除了全部映射的PTE。
 * SWAP_AGAIN    - we missed a mapping, try again later---------------可能錯過了一個映射的PTE,須要重來一次。
 * SWAP_FAIL    - the page is unswappable-----------------------------失敗
 * SWAP_MLOCK    - page is mlocked.-----------------------------------頁面被鎖住了 */
int try_to_unmap(struct page *page, enum ttu_flags flags)
{
    int ret;
    struct rmap_walk_control rwc = {
        .rmap_one = try_to_unmap_one,--------------------------------具體斷開某個VMA上映射的pte
        .arg = (void *)flags,
        .done = page_not_mapped,-------------------------------------判斷一個頁面是否斷開成功的條件
        .anon_lock = page_lock_anon_vma_read,------------------------
    };

    VM_BUG_ON_PAGE(!PageHuge(page) && PageTransHuge(page), page);

    /*
     * During exec, a temporary VMA is setup and later moved.
     * The VMA is moved under the anon_vma lock but not the
     * page tables leading to a race where migration cannot
     * find the migration ptes. Rather than increasing the
     * locking requirements of exec(), migration skips
     * temporary VMAs until after exec() completes.
     */
    if ((flags & TTU_MIGRATION) && !PageKsm(page) && PageAnon(page))
        rwc.invalid_vma = invalid_migration_vma;

    ret = rmap_walk(page, &rwc);

    if (ret != SWAP_MLOCK && !page_mapped(page))
        ret = SWAP_SUCCESS;
    return ret;
}

 

內核中有三種頁面須要unmap操做,即KSM頁面、匿名頁面、文件映射頁面:

int rmap_walk(struct page *page, struct rmap_walk_control *rwc)
{
    if (unlikely(PageKsm(page)))
        return rmap_walk_ksm(page, rwc);
    else if (PageAnon(page))
        return rmap_walk_anon(page, rwc);
    else
        return rmap_walk_file(page, rwc);
}

 

下面以匿名頁面的unmap爲例:

static int rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc)
{
    struct anon_vma *anon_vma;
    pgoff_t pgoff;
    struct anon_vma_chain *avc;
    int ret = SWAP_AGAIN;

    anon_vma = rmap_walk_anon_lock(page, rwc);-----------------------------------獲取頁面page->mapping指向的anon_vma數據結構,並申請一個讀者鎖。 if (!anon_vma)
        return ret;

    pgoff = page_to_pgoff(page);
    anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {------遍歷anon_vma->rb_root紅黑樹中的AVC,從AVC獲得相應的VMA。 struct vm_area_struct *vma = avc->vma;
        unsigned long address = vma_address(page, vma);

        if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
            continue;

        ret = rwc->rmap_one(page, vma, address, rwc->arg);-----------------------實際的斷開用戶PTE頁表項操做。 if (ret != SWAP_AGAIN)
            break;
        if (rwc->done && rwc->done(page))
            break;
    }
    anon_vma_unlock_read(anon_vma);
    return ret;
}

 

struct rmap_walk_control中的rmap_one實現是try_to_unmap_one,最終調用page_remove_rmap()和page_cache_release()來斷開PTE映射關係。

相關文章
相關標籤/搜索