Linux內存管理 (13)回收頁面

專題:Linux內存管理專題html

關鍵詞:LRU、活躍/不活躍-文件緩存/匿名頁面、Refault Distancenode

 

頁面回收、或者回收頁面也即page reclaim,依賴於LRU鏈表對頁面進行分類:不活躍匿名頁面、活躍匿名頁面、不活躍文件緩存頁面、活躍文件緩存頁面和不可回收頁面。算法

內存緊張時優先換出文件緩存頁面,而後纔是匿名頁面。由於文件緩存頁面有後備存儲器,而匿名頁面必需要寫入交換分區。緩存

因此回收頁面的三種機制(1)對未修改的文件緩存頁面能夠直接丟棄,(2)對被修改的文件緩存頁面須要會寫到存儲設備中,(3)不多使用的匿名頁面交換到swap分區,以便釋放出物理內存,這個機制稱爲頁交換(swapping)。cookie

 

LRU鏈表是頁面回收操做的基礎,kswapd內核線程是頁面回收的入口,每一個NUMA內存節點都會建立一個"kswapd%d"的內核線程。數據結構

下面的函數分析按照內存節點-->Zone-->LRU Active/Inactive層級展開:併發

balance_pgdat函數是頁面回收的主函數,對應的內存層次是一個內存節點;app

shrink_zone函數用於掃描zone中的全部可回收頁面,對應的內存層次是一個zone;less

而後shrink_active_list函數掃描活躍頁面,看看有哪些活躍頁面能夠遷移到不活躍頁面鏈表中;shrink_inactive_list函數掃描不活躍頁面鏈表而且回收頁面。tcp

最後跟蹤LRU活動狀況介紹了頁面被釋放,LRU是如何知道而且更新鏈表的;以及Refault Distance算法對文件緩存頁面回收的優化。

 

將kswapd的核心活動列出能夠看出kswapd基本脈絡,下面的章節逐步展開介紹:

kswapd_init---------------------------------------kswapd模塊的初始化函數
kswapd_run--------------------------------------建立內核線程kswapd
kswapd----------------------------------------kswapd內核線程的執行函數
kswapd_try_to_sleep-------------------------睡眠而且讓出CPU,等待wakeup_kswapd()喚醒。♥
balance_pgdat-------------------------------回收頁面的主函數,多zone
kswapd_shrink_zone------------------------單獨處理某個zone的掃描和頁面回收
shrink_zone-----------------------------掃描zone中全部可回收的頁面
shrink_lruvec-------------------------掃描LRU鏈表的核心函數
shrink_list-------------------------處理各類LRU鏈表
shrink_active_list----------------查看哪些活躍頁面能夠遷移到不活躍頁面鏈表中
isolate_lru_pages---------------從LRU鏈表中分離頁面
shrink_inactive_list--------------掃描inactive LRU鏈表嘗試回收頁面,而且返回已經回收頁面的數量。
shrink_page_list----------------掃描page_list鏈表的頁面並返回已回收的頁面數量
shrink_slab---------------------------調用內存管理系統中的shrinker接口來回收內存
pgdat_balanced----------------------------判斷內存節點是否處於平衡狀態,即處於高水位
zone_balanced---------------------------判斷內存節點中的zone是否處於平衡狀態

1. LRU鏈表

LRU(Least Recently Used)是最近最少使用的意思,內核假定最近不適用的頁在較短的時間內也不會頻繁使用。

在內存不足時,這些頁面優先成爲被換出的候選者。

1.1 LRU鏈表

LRU是雙向鏈表,內核根據頁面類型(匿名和文件)與活躍性(活躍和不活躍),分紅5種類型LRU鏈表:

#define LRU_BASE 0
#define LRU_ACTIVE 1
#define LRU_FILE 2

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,--------------------------不活躍匿名頁面鏈表,須要交換分區才能回收
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,---------------活躍匿名頁面鏈表
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,---------------不活躍文件映射頁面鏈表,最優先回收
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,----活躍文件映射頁面鏈表
    LRU_UNEVICTABLE,---------------------------------------不可回收頁面鏈表,禁止換出
    NR_LRU_LISTS
};


struct lruvec {
    struct list_head lists[NR_LRU_LISTS];
    struct zone_reclaim_stat reclaim_stat;
#ifdef CONFIG_MEMCG
    struct zone *zone;
#endif
};


struct zone {
...
    /* Fields commonly accessed by the page reclaim scanner */
    spinlock_t        lru_lock;
    struct lruvec lruvec;
...
}

從zone能夠找到各類LRU鏈表,遍歷成員,因此頁面回收是按照zone來進行的;到Linux v4.8開始改成基於node的LRU鏈表。

LRU鏈表是如何實現頁面老化的?

將頁面加入LRU鏈表的經常使用API是lru_cache_add()。

lru_cache_add()-->__lru_cache_add()-->

 
 

/* 14 pointers + two long's align the pagevec structure to a power of two */
#define PAGEVEC_SIZE 14

struct pagevec {
    unsigned long nr;
    unsigned long cold;
    struct page *pages[PAGEVEC_SIZE];-------批處理一次14個頁面
};



static
void __lru_cache_add(struct page *page) { struct pagevec *pvec = &get_cpu_var(lru_add_pvec); page_cache_get(page); if (!pagevec_space(pvec))-------------判斷pagevec是否還有空間,如沒有調用__pagevec_lru_add()將原有的page加入到LRU鏈表中 __pagevec_lru_add(pvec); pagevec_add(pvec, page);--------------加入到struct pagevec中 put_cpu_var(lru_add_pvec); }

 

 

void __pagevec_lru_add(struct pagevec *pvec)
{
    pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
}


static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
                 void *arg)
{
    int file = page_is_file_cache(page);
    int active = PageActive(page);
    enum lru_list lru = page_lru(page);-------------------------------------------判斷page的LRU類型

    VM_BUG_ON_PAGE(PageLRU(page), page);

    SetPageLRU(page);
    add_page_to_lru_list(page, lruvec, lru);
    update_page_reclaim_stat(lruvec, file, active);
    trace_mm_lru_insertion(page, lru);
}

 

add_page_to_lru_list()根據獲取的lru,將page加入到lruvec->lists[lru]中。

lru_to_page()和list_del()組合實現從LRU鏈表摘取頁面。lru_to_page()從鏈表末尾摘取頁面,LRU採用FIFO算法,最早進入LRU鏈表的頁面,在LRU中時間越長,老化時間越長。

最不經常使用的頁面將慢慢移動到不活躍LRU鏈表末尾,這些頁面是最適合的候選者。

 

lru_cache_add:用於將頁面加入到LRU鏈表中。

lru_to_page:用於從LRU鏈表末尾獲取頁面。

list_del:能夠將page從LRU鏈表中移除。

 缺一張圖

1.2 第二次機會法

第二次機會(second chance)算法爲了不把常用的頁面置換出去。

當選擇置換頁面時,依然和LRU算法同樣,選擇最先置入鏈表的頁面,即在鏈表末尾的頁面。

二次機會法設置了一個訪問狀態位,若是訪問位是0,就淘汰這頁面;若是訪問位是1,就給他第二次機會,並選擇下一個頁面來換出。

獲得第二次機會的頁面,它的訪問位會被清0;若是該頁在此期間再次被訪問過,則訪問位職爲1。

 

Linux內核使用了PG_active和PG_referenced兩個標誌位來實現第二次機會法。

PG_active表示該頁在活躍LRU中;PG_referenced表示該頁是否被引用過。

1.3 mark_page_accessed()

 從函數開頭能夠看出,PG_active和PG_referenced有三種組合。

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced    ->    inactive,referenced-----1
 * inactive,referenced        ->    active,unreferenced---2
 * active,unreferenced        ->    active,referenced-----3
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */
void mark_page_accessed(struct page *page)
{
    if (!PageActive(page) && !PageUnevictable(page) &&
            PageReferenced(page)) {-----------------------inactive,referenced狀況,置爲active,unreferenced。對應狀況2 /*
         * If the page is on the LRU, queue it for activation via
         * activate_page_pvecs. Otherwise, assume the page is on a
         * pagevec, mark it active and it'll be moved to the active
         * LRU on the next drain.
         */
        if (PageLRU(page))
            activate_page(page);
        else
            __lru_cache_activate_page(page);
        ClearPageReferenced(page);
        if (page_is_file_cache(page))
            workingset_activation(page);
    } else if (!PageReferenced(page)) {--------------------inactive,unreferenced和active,unreferenced兩種狀況,置爲inactive/active,referenced。對應狀況1,3
        SetPageReferenced(page);
    }
}

 

 

1.4 page_check_references()

在掃描不活躍LRU鏈表時,page_check_referenced()會被調用,返回值是一個page_referenced的枚舉類型。

enum page_references {
    PAGEREF_RECLAIM,-------------------表示該頁面能夠被嘗試回收
    PAGEREF_RECLAIM_CLEAN,-------------表示該頁面能夠被嘗試回收
    PAGEREF_KEEP,----------------------表示該頁面會繼續保留在不活躍鏈表中
    PAGEREF_ACTIVATE,------------------表示該頁面會遷移到活躍鏈表
};

static enum page_references page_check_references(struct page *page,
                          struct scan_control *sc)
{
    int referenced_ptes, referenced_page;
    unsigned long vm_flags;

    referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
                      &vm_flags);-------------------------------------------------檢查該頁是否被pte訪問引用,經過rmap_walk查找該頁被多少個pte引用。
    referenced_page = TestClearPageReferenced(page);------------------------------頁面是否被置位PG_referenced,若是是就給第二次機會。 /*
     * Mlock lost the isolation race with us.  Let try_to_unmap()
     * move the page to the unevictable list.
     */
    if (vm_flags & VM_LOCKED)
        return PAGEREF_RECLAIM;

    if (referenced_ptes) {---------------------------------------------------------頁面被pte引用
        if (PageSwapBacked(page))
            return PAGEREF_ACTIVATE;-----------------------------------------------匿名頁面,加入活躍鏈表
        /*
         * All mapped pages start out with page table
         * references from the instantiating fault, so we need
         * to look twice if a mapped file page is used more
         * than once.
         *
         * Mark it and spare it for another trip around the
         * inactive list.  Another page table reference will
         * lead to its activation.
         *
         * Note: the mark is set for activated pages as well
         * so that recently deactivated but used pages are
         * quickly recovered.
         */
        SetPageReferenced(page);

        if (referenced_page || referenced_ptes > 1)
            return PAGEREF_ACTIVATE;--------------------------------------------最近第二次訪問的page cache或shared page cache,加入活躍鏈表。 /*
         * Activate file-backed executable pages after first usage.
         */
        if (vm_flags & VM_EXEC)
            return PAGEREF_ACTIVATE;--------------------------------------------可執行文件的page cache,加入活躍鏈表 return PAGEREF_KEEP;----------------------------------------------------保留在不活躍鏈表中
    }

    /* Reclaim if clean, defer dirty pages to writeback */
    if (referenced_page && !PageSwapBacked(page))-------------------------------第二次訪問的page cache頁面,能夠釋放 return PAGEREF_RECLAIM_CLEAN;

    return PAGEREF_RECLAIM;-----------------------------------------------------頁面沒有被pte引用,能夠釋放
}

 

  

1.5 page_referenced()

page_referenced()函數判斷page是否被訪問引用過,返回訪問引用pte的個數,即訪問和引用這個頁面的用戶進程空間虛擬頁面的個數。

核心思想是利用反響映射系統來統計訪問引用pte的用戶個數。

page_referenced()主要工做以下:

  • 利用RMAP系統遍歷全部映射該頁面的pte。
  • 對每一個pte,若是L_PTE_YOUNG比特位置位,說明以前被訪問過,referenced計數加1;而後清空L_PTE_YOUNG。對ARM32來講,會清空硬件頁表項內容,人爲製造一個缺頁中斷,當再次訪問該pte時,在缺頁中斷中設置L_PTE_YOUNG比特位。
  • 返回referenced計數,表示該頁有多少個訪問引用pte。

 

/**
 * page_referenced - test if the page was referenced
 * @page: the page to test
 * @is_locked: caller holds lock on the page
 * @memcg: target memory cgroup
 * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
 *
 * Quick test_and_clear_referenced for all mappings to a page,
 * returns the number of ptes which referenced the page.
 */
int page_referenced(struct page *page,
            int is_locked,
            struct mem_cgroup *memcg,
            unsigned long *vm_flags)
{
    int ret;
    int we_locked = 0;
    struct page_referenced_arg pra = {
        .mapcount = page_mapcount(page),
        .memcg = memcg,
    };
    struct rmap_walk_control rwc = {
        .rmap_one = page_referenced_one,
        .arg = (void *)&pra,
        .anon_lock = page_lock_anon_vma_read,
    };

    *vm_flags = 0;
    if (!page_mapped(page))----------------------------------判斷page->_mapcount引用計數是否大於等於0. return 0;

    if (!page_rmapping(page))--------------------------------判斷page->mapping是否有地址空間映射。 return 0;

    if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
        we_locked = trylock_page(page);
        if (!we_locked)
            return 1;
    }

    /*
     * If we are reclaiming on behalf of a cgroup, skip
     * counting on behalf of references from different
     * cgroups
     */
    if (memcg) {
        rwc.invalid_vma = invalid_page_referenced_vma;
    }

    ret = rmap_walk(page, &rwc);---------------------------遍歷該頁面全部映射的pte,而後調用struct rmap_walk_control的成員rmap_one。 *vm_flags = pra.vm_flags;

    if (we_locked)
        unlock_page(page);

    return pra.referenced;
}

page_referenced()中調用page_referenced_one()進行referenced和mapcount計數處理。

static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
            unsigned long address, void *arg)
{
    struct mm_struct *mm = vma->vm_mm;
    spinlock_t *ptl;
    int referenced = 0;
    struct page_referenced_arg *pra = arg;

    if (unlikely(PageTransHuge(page))) {
...
    } else {
        pte_t *pte;

        /*
         * rmap might return false positives; we must filter
         * these out using page_check_address().
         */
        pte = page_check_address(page, mm, address, &ptl, 0);--------------根據mm和address獲取pte if (!pte)
            return SWAP_AGAIN;

        if (vma->vm_flags & VM_LOCKED) {
            pte_unmap_unlock(pte, ptl);
            pra->vm_flags |= VM_LOCKED;
            return SWAP_FAIL; /* To break the loop */
        }

        if (ptep_clear_flush_young_notify(vma, address, pte)) {-----------判斷pte最近是否被訪問過, /*
             * Don't treat a reference through a sequentially read
             * mapping as such.  If the page has been used in
             * another mapping, we will catch it; if this other
             * mapping is already gone, the unmap path will have
             * set PG_referenced or activated the page.
             */
            if (likely(!(vma->vm_flags & VM_SEQ_READ)))-------------------順序讀的page cache是被回收的最佳後選者,其他狀況都會當作pte被引用,增長計數。
                referenced++;
        }
        pte_unmap_unlock(pte, ptl);
    }

    if (referenced) {
        pra->referenced++;-------------------------------------------------pra->referenced增長計數
        pra->vm_flags |= vma->vm_flags;
    }

    pra->mapcount--;-------------------------------------------------------pra->mapcount減小計數 if (!pra->mapcount)
        return SWAP_SUCCESS; /* To break the loop */

    return SWAP_AGAIN;
}

  

2. kswapd內核線程

kswapd負責在內存不足的狀況下回收頁面,kswapd內核線程初始化時會爲系統中每一個NUMA內存節點建立一個名爲"kswapd%d"的內核線程。

2.1 kswapd_wait等待隊列

setup_arch()-->paging_init()-->bootmem_init()-->zone_sizes_init()-->free_area_init_node()-->free_area_init_core,kswapd_wait等待隊列在free_area_init_core中進行初始化,每一個內存節點一個。

等待隊列用於使進程等待某一事件發生,而無需頻繁輪詢,進程在等待期間睡眠。在某事件發生時,由內核自動喚醒。

kswapd內核線程在kswapd_wait等待隊列上等待TASK_INTERRUPTIBLE事件發生。

static void __paginginit free_area_init_core(struct pglist_data *pgdat,
        unsigned long node_start_pfn, unsigned long node_end_pfn,
        unsigned long *zones_size, unsigned long *zholes_size)
{
...
    init_waitqueue_head(&pgdat->kswapd_wait);
    init_waitqueue_head(&pgdat->pfmemalloc_wait);
    pgdat_page_ext_init(pgdat);

...
}

  

 

2.2 建立kswapd內核線程

kswapd內核線程負責在內存不足的狀況下進行頁面回收,每NUMA內存節點配置一個。

其中kswapd函數是內核線程kswapd的入口。

static int __init kswapd_init(void)
{
    int nid;

    swap_setup();
    for_each_node_state(nid, N_MEMORY)-----------------------------------------每一個內存節點建立一個kswapd內核線程
         kswapd_run(nid);
    hotcpu_notifier(cpu_callback, 0);
    return 0;
}

int kswapd_run(int nid)
{
    pg_data_t *pgdat = NODE_DATA(nid);-----------------------------------------獲取內存節點對應的pg_data_t指針 int ret = 0;

    if (pgdat->kswapd)
        return 0;

    pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);---------------kswapd函數,pgdat做爲參數傳入kswapd函數。 if (IS_ERR(pgdat->kswapd)) {
        /* failure at boot is fatal */
        BUG_ON(system_state == SYSTEM_BOOTING);
        pr_err("Failed to start kswapd on node %d\n", nid);
        ret = PTR_ERR(pgdat->kswapd);
        pgdat->kswapd = NULL;
    }
    return ret;
}

static int kswapd(void *p)
{
    unsigned long order, new_order;
    unsigned balanced_order;
    int classzone_idx, new_classzone_idx;
    int balanced_classzone_idx;
    pg_data_t *pgdat = (pg_data_t*)p;-----------------------------------------從kswapd_run傳入的內存節點數據結構pg_data_t。 struct task_struct *tsk = current;

    struct reclaim_state reclaim_state = {
        .reclaimed_slab = 0,
    };
    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

    lockdep_set_current_reclaim_state(GFP_KERNEL);

    if (!cpumask_empty(cpumask))
        set_cpus_allowed_ptr(tsk, cpumask);
    current->reclaim_state = &reclaim_state;

    /*
     * Tell the memory management that we're a "memory allocator",
     * and that if we need more memory we should get access to it
     * regardless (see "__alloc_pages()"). "kswapd" should
     * never get caught in the normal page freeing logic.
     *
     * (Kswapd normally doesn't need memory anyway, but sometimes
     * you need a small amount of memory in order to be able to
     * page out something else, and this flag essentially protects
     * us from recursively trying to free more memory as we're
     * trying to free the first piece of memory in the first place).
     */
    tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
    set_freezable();

    order = new_order = 0;
    balanced_order = 0;
    classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
    balanced_classzone_idx = classzone_idx;
    for ( ; ; ) {
        bool ret;

        /*
         * If the last balance_pgdat was unsuccessful it's unlikely a
         * new request of a similar or harder type will succeed soon
         * so consider going to sleep on the basis we reclaimed at
         */
        if (balanced_classzone_idx >= new_classzone_idx &&
                    balanced_order == new_order) {
            new_order = pgdat->kswapd_max_order;
            new_classzone_idx = pgdat->classzone_idx;
            pgdat->kswapd_max_order =  0;
            pgdat->classzone_idx = pgdat->nr_zones - 1;
        }

        if (order < new_order || classzone_idx > new_classzone_idx) {
            /*
             * Don't sleep if someone wants a larger 'order'
             * allocation or has tigher zone constraints
             */
            order = new_order;
            classzone_idx = new_classzone_idx;
        } else {
            kswapd_try_to_sleep(pgdat, balanced_order,---------------------------在此處睡眠,等待wakeup_kswapd喚醒。
                        balanced_classzone_idx);
            order = pgdat->kswapd_max_order;
            classzone_idx = pgdat->classzone_idx;--------------------------------pgdata->kswapd_max_order和pgdat->classzone_id已經在wakeup_kswapd中進行了更新。
            new_order = order;
            new_classzone_idx = classzone_idx;
            pgdat->kswapd_max_order = 0;
            pgdat->classzone_idx = pgdat->nr_zones - 1;
        }

        ret = try_to_freeze();
        if (kthread_should_stop())
            break;

        /*
         * We can speed up thawing tasks if we don't call balance_pgdat
         * after returning from the refrigerator
         */
        if (!ret) {
            trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
            balanced_classzone_idx = classzone_idx;
            balanced_order = balance_pgdat(pgdat, order,------------------------進行頁面回收的主函數。 &balanced_classzone_idx);
        }
    }

    tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
    current->reclaim_state = NULL;
    lockdep_clear_current_reclaim_state();

    return 0;
}

 

 

static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
{
    long remaining = 0;
    DEFINE_WAIT(wait);

    if (freezing(current) || kthread_should_stop())
        return;

    prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);----------------------------------定義一個wait在kswapd_wait上等待,設置進程狀態爲TASK_INTERRUPTIBLE。 /* Try to sleep for a short interval */
    if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {-------------------------------remaining爲0,檢查kswapd是否準備好睡眠。
        remaining = schedule_timeout(HZ/10);----------------------------------------------------------嘗試短睡100ms,若是返回不爲0,則說明沒有100ms以內被喚醒了。
        finish_wait(&pgdat->kswapd_wait, &wait);
        prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
    }

    /*
     * After a short sleep, check if it was a premature sleep. If not, then
     * go fully to sleep until explicitly woken up.
     */
    if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {-------------------------------若是短睡被喚醒,則沒有必要繼續睡眠。若是短睡美歐被喚醒,則能夠嘗試進入睡眠。
        trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

        /*
         * vmstat counters are not perfectly accurate and the estimated
         * value for counters such as NR_FREE_PAGES can deviate from the
         * true value by nr_online_cpus * threshold. To avoid the zone
         * watermarks being breached while under pressure, we reduce the
         * per-cpu vmstat threshold while kswapd is awake and restore
         * them before going back to sleep.
         */
        set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);

        /*
         * Compaction records what page blocks it recently failed to
         * isolate pages from and skips them in the future scanning.
         * When kswapd is going to sleep, it is reasonable to assume
         * that pages and compaction may succeed so reset the cache.
         */
        reset_isolation_suitable(pgdat);

        if (!kthread_should_stop())
            schedule();------------------------------------------------------------------------------讓出CPU控制權。

        set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
    } else {
        if (remaining)
            count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
        else
            count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
    }
    finish_wait(&pgdat->kswapd_wait, &wait);---------------------------------------------------------設置進程狀態爲TASK_RUNNING。
}

   

2.3  喚醒kswapd內核線程回收頁面

觸發內存回收的條件是,在內存分配路徑上,低水位狀況下內存分配失敗。那麼會經過調用wakeup_kswapd函數喚醒kswapd內核線程來回收頁面,達到釋放內存的目的。

在NUMA系統中,使用pg_data_t來描述物理內存佈局,和kswapd相關參數有:

typedef struct pglist_data {
...
    wait_queue_head_t kswapd_wait;----------------------------等待隊列
    wait_queue_head_t pfmemalloc_wait;
    struct task_struct *kswapd;    /* Protected by
                       mem_hotplug_begin/end() */
    int kswapd_max_order;-------------------------------------
    enum zone_type classzone_idx;-----------------------------最合適分配內存的zone序號
...
} pg_data_t;

 

最主要的兩個參數是kswapd_max_order和classzone_idx,這兩個參數會在kswapd喚醒後讀取並使用。

alloc_page()-->__alloc_pages_nodemask()-->__alloc_pages_slowpath()-->wake_all_kswapds()-->wakeup_kswapd()。

void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
    pg_data_t *pgdat;

    if (!populated_zone(zone))
        return;

    if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
        return;
    pgdat = zone->zone_pgdat;
    if (pgdat->kswapd_max_order < order) {
        pgdat->kswapd_max_order = order;
        pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);-------------------------準備內存本節點的classzone_idx和kswapd_max_order兩個參數。
    }
    if (!waitqueue_active(&pgdat->kswapd_wait))
        return;
    if (zone_balanced(zone, order, 0, 0))
        return;

    trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
    wake_up_interruptible(&pgdat->kswapd_wait);---------------------------------------------------喚醒kswapd_wait等待隊列上的TASK_INTERRUPTIBLE線程。
}

 

3. balance_pgdat函數

 balance_pgdat()是回收頁面的主函數。這是一個大循環,首先從高端zone往低端zone方向查找第一個處於不平衡狀態end_zone;而後從最低端zone開始回收頁面,直到end_zone;在大循環裏檢查從最低端zone到classzone_idx的zone是否處於平衡狀態,而後不斷加大掃描力度。

static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
                            int *classzone_idx)
{
    int i;
    int end_zone = 0;    /* Inclusive.  0 = ZONE_DMA */
    unsigned long nr_soft_reclaimed;
    unsigned long nr_soft_scanned;
    struct scan_control sc = {
        .gfp_mask = GFP_KERNEL,
        .order = order,
        .priority = DEF_PRIORITY,-------------------------------------------------------------------成員初始掃描優先級,每次掃描的頁面數爲tatal_size>>priority。
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    count_vm_event(PAGEOUTRUN);

    do {
        unsigned long nr_attempted = 0;
        bool raise_priority = true;
        bool pgdat_needs_compaction = (order > 0);

        sc.nr_reclaimed = 0;

        /*
         * Scan in the highmem->dma direction for the highest
         * zone which needs scanning
         */
        for (i = pgdat->nr_zones - 1; i >= 0; i--) {-----------------------------------------------從ZONE_HIGHMEM往ZONE_NORMAL方向查找第一個不平衡狀態的end_zone,即水位處於WMARK_HIGH之下的zone爲止。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            if (sc.priority != DEF_PRIORITY &&
                !zone_reclaimable(zone))
                continue;

            /*
             * Do some background aging of the anon list, to give
             * pages a chance to be referenced before reclaiming.
             */
            age_active_anon(zone, &sc);

            /*
             * If the number of buffer_heads in the machine
             * exceeds the maximum allowed level and this node
             * has a highmem zone, force kswapd to reclaim from
             * it to relieve lowmem pressure.
             */
            if (buffer_heads_over_limit && is_highmem_idx(i)) {
                end_zone = i;
                break;
            }

            if (!zone_balanced(zone, order, 0, 0)) {---------------------------------------------當前zone是否處於平衡狀態,若是不平衡記錄到end_zone中,而後跳出當前for循環。
                end_zone = i;
                break;
            } else {
                /*
                 * If balanced, clear the dirty and congested
                 * flags
                 */
                clear_bit(ZONE_CONGESTED, &zone->flags);
                clear_bit(ZONE_DIRTY, &zone->flags);
            }
        }

        if (i < 0)
            goto out;

        for (i = 0; i <= end_zone; i++) {---------------------------------------------------------從ZONE_NORMAL往endzone方向進行掃描,開始頁面回收。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            /*
             * If any zone is currently balanced then kswapd will
             * not call compaction as it is expected that the
             * necessary pages are already available.
             */
            if (pgdat_needs_compaction &&
                    zone_watermark_ok(zone, order,
                        low_wmark_pages(zone),
                        *classzone_idx, 0))-------------------------------------------------------在order大於0的狀況下,pgdat_needs_compaction初始化爲true;若是當前zone處於WMARK_LOW水位之上,則不須要內存規整。
                pgdat_needs_compaction = false;
        }

        /*
         * If we're getting trouble reclaiming, start doing writepage
         * even in laptop mode.
         */
        if (sc.priority < DEF_PRIORITY - 2)
            sc.may_writepage = 1;

        /*
         * Now scan the zone in the dma->highmem direction, stopping
         * at the last zone which needs scanning.
         *
         * We do this because the page allocator works in the opposite
         * direction.  This prevents the page allocator from allocating
         * pages behind kswapd's direction of progress, which would
         * cause too much scanning of the lower zones.
         */
        for (i = 0; i <= end_zone; i++) {---------------------------------------------------------從ZONE_NORMAL到end_zone方向,開始回收內存。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            if (sc.priority != DEF_PRIORITY &&
                !zone_reclaimable(zone))
                continue;

            sc.nr_scanned = 0;

            nr_soft_scanned = 0;
            /*
             * Call soft limit reclaim before calling shrink_zone.
             */
            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
                            order, sc.gfp_mask,
                            &nr_soft_scanned);
            sc.nr_reclaimed += nr_soft_reclaimed;

            /*
             * There should be no need to raise the scanning
             * priority if enough pages are already being scanned
             * that that high watermark would be met at 100%
             * efficiency.
             */
            if (kswapd_shrink_zone(zone, end_zone,------------------------------------------------真正掃描和頁回收函數,掃描的參數和結果存放在struct scan_control中。返回true代表回收了所須要的頁面,不須要再提升掃描優先級。 &sc, &nr_attempted))
                raise_priority = false;
        }

        /*
         * If the low watermark is met there is no need for processes
         * to be throttled on pfmemalloc_wait as they should not be
         * able to safely make forward progress. Wake them
         */
        if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
                pfmemalloc_watermark_ok(pgdat))
            wake_up_all(&pgdat->pfmemalloc_wait);

        /*
         * Fragmentation may mean that the system cannot be rebalanced
         * for high-order allocations in all zones. If twice the
         * allocation size has been reclaimed and the zones are still
         * not balanced then recheck the watermarks at order-0 to
         * prevent kswapd reclaiming excessively. Assume that a
         * process requested a high-order can direct reclaim/compact.
         */
        if (order && sc.nr_reclaimed >= 2UL << order)------------------------------------------若是order不爲0,而且sc.nr_reclaimed即已成功回收頁面數量大於等於2^order。
            order = sc.order = 0;--------------------------------------------------------------這裏設置order爲0,爲了不碎片,方式kswapd過於激進地回收頁面。 /* Check if kswapd should be suspending */
        if (try_to_freeze() || kthread_should_stop())------------------------------------------判斷kswapd是否須要中止或者睡眠,若是是則退出。 break;

        /*
         * Compact if necessary and kswapd is reclaiming at least the
         * high watermark number of pages as requsted
         */
        if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)-------------------------判斷是否須要進行內存規整,優化內存碎片。
            compact_pgdat(pgdat, order);------------------------------------------------------參照內存規整章節關於compact_pgdat的解釋。 /*
         * Raise priority if scanning rate is too low or there was no
         * progress in reclaiming pages
         */
        if (raise_priority || !sc.nr_reclaimed)
            sc.priority--;--------------------------------------------------------------------因爲一次掃描的頁面數爲total_size>>priority,因此掃描頁面數量逐漸加大。當kswapd_shrink_zone返回true,即成功回收了頁面,纔會將raise_priority置爲false。
    } while (sc.priority >= 1 &&
         !pgdat_balanced(pgdat, order, *classzone_idx));

out:
    /*
     * Return the order we were reclaiming at so prepare_kswapd_sleep()
     * makes a decision on the order we were last reclaiming at. However,
     * if another caller entered the allocator slow path while kswapd
     * was awake, order will remain at the higher level
     */
    *classzone_idx = end_zone;
    return order;
}

 pgdat_balanced用於檢查一個內存節點上的物理頁面是否處於平衡狀態,從最低端zone開始,直到classzone_idx。

 其中classzone_idx從wake_all_kswapds()傳下來。

static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
{
    unsigned long managed_pages = 0;
    unsigned long balanced_pages = 0;
    int i;

    /* Check the watermark levels */
    for (i = 0; i <= classzone_idx; i++) {-----------------------------------------------從低到高遍歷zone
        struct zone *zone = pgdat->node_zones + i;

        if (!populated_zone(zone))
            continue;

        managed_pages += zone->managed_pages;

        /*
         * A special case here:
         *
         * balance_pgdat() skips over all_unreclaimable after
         * DEF_PRIORITY. Effectively, it considers them balanced so
         * they must be considered balanced here as well!
         */
        if (!zone_reclaimable(zone)) {
            balanced_pages += zone->managed_pages;
            continue;
        }

        if (zone_balanced(zone, order, 0, i))------------------------------------------若是這個zone的空閒頁面高於WMARK_HIGH水位,那麼這個zone全部管理的頁面能夠看做balances_pages。
            balanced_pages += zone->managed_pages;
        else if (!order)---------------------------------------------------------------在order爲0,即只分配一頁的狀況下,當前zone低於WMARK_HIGH水位。認爲當前內存節點不平衡。 return false;
    }

    if (order)
        return balanced_pages >= (managed_pages >> 2);---------------------------------order大於0,當全部從最低端zone到classzone_idx zone中全部balanced_pages大於managed_pages的25%,認爲此節點處於平衡狀態。 else
        return true;-------------------------------------------------------------------此處說明全部zone都是平衡的,那麼在order爲0狀況下,這個節點是處於平衡的。
}

 zone_balances用於判斷zone在分配order個頁面以後的空閒頁面是否處於WMARK_HIGH水位之上。返回true,表示zone處於WMARK_HIGH之上。

 

static bool zone_balanced(struct zone *zone, int order,
              unsigned long balance_gap, int classzone_idx)
{
    if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
                    balance_gap, classzone_idx, 0))
        return false;

    if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
                order, 0, classzone_idx) == COMPACT_SKIPPED)
        return false;

    return true;
}

 

3.1 頁面分配路徑和頁面回收路徑

 爲何頁面回收的路徑從ZONE_NORMAL到end_zone方向?

由於夥伴系統分配頁面從ZONE_HIGHMEM到ZONE_NORMAL方向,頁面回收剛好和其相反。

這樣有利於減小鎖的爭用,提升效率。頁面分配和頁面回收可能爭用zone->lru_lock鎖。

 

4. shrink_zone函數

kswapd_shrink_zone在進行shrink_zone操做以前,進行了一些檢查工做,以確保確實須要進行頁面回收。

static bool kswapd_shrink_zone(struct zone *zone,
                   int classzone_idx,
                   struct scan_control *sc,
                   unsigned long *nr_attempted)
{
    int testorder = sc->order;
    unsigned long balance_gap;
    bool lowmem_pressure;

    /* Reclaim above the high watermark. */
    sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));-----------------------計算一輪掃描最多回收頁面數 /*
     * Kswapd reclaims only single pages with compaction enabled. Trying
     * too hard to reclaim until contiguous free pages have become
     * available can hurt performance by evicting too much useful data
     * from memory. Do not reclaim more than needed for compaction.
     */
    if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
            compaction_suitable(zone, sc->order, 0, classzone_idx)
                            != COMPACT_SKIPPED)
        testorder = 0;

    /*
     * We put equal pressure on every zone, unless one zone has way too
     * many pages free already. The "too many pages" is defined as the
     * high wmark plus a "gap" where the gap is either the low
     * watermark or 1% of the zone, whichever is smaller.
     */
    balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
            zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));----------------------------balance_gap增長了水位平衡難度 /*
     * If there is no low memory pressure or the zone is balanced then no
     * reclaim is necessary
     */
    lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
    if (!lowmem_pressure && zone_balanced(zone, testorder,
                        balance_gap, classzone_idx))----------------------------------------須要判斷當前水位是否高於WMARK_HIGH+balance_gap,若是成立,則直接返回true。不須要進行shrink_zone回收頁面。 return true;

    shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

    /* Account for the number of pages attempted to reclaim */
    *nr_attempted += sc->nr_to_reclaim;

    clear_bit(ZONE_WRITEBACK, &zone->flags);

    /*
     * If a zone reaches its high watermark, consider it to be no longer
     * congested. It's possible there are dirty pages backed by congested
     * BDIs but as pressure is relieved, speculatively avoid congestion
     * waits.
     */
    if (zone_reclaimable(zone) &&
        zone_balanced(zone, testorder, 0, classzone_idx)) {
        clear_bit(ZONE_CONGESTED, &zone->flags);
        clear_bit(ZONE_DIRTY, &zone->flags);
    }

    return sc->nr_scanned >= sc->nr_to_reclaim;---------------------------------------------掃描的頁面數量大於等於待回收頁面數量,表示掃描了足夠多的頁面。 }

 shrink_zone()用於掃描zone中全部可回收的頁面,參數:

static bool shrink_zone(struct zone *zone, struct scan_control *sc, bool is_classzone)

zone表示即將要掃描的zone,sc表示掃描控制參數,is_classzone表示當前zone是否爲balance_pgdat()剛開始計算的第一個處於非平衡狀態的zone。

static bool shrink_zone(struct zone *zone, struct scan_control *sc,
            bool is_classzone)
{
    struct reclaim_state *reclaim_state = current->reclaim_state;
    unsigned long nr_reclaimed, nr_scanned;
    bool reclaimable = false;

    do {
        struct mem_cgroup *root = sc->target_mem_cgroup;
        struct mem_cgroup_reclaim_cookie reclaim = {
            .zone = zone,
            .priority = sc->priority,
        };
        unsigned long zone_lru_pages = 0;
        struct mem_cgroup *memcg;

        nr_reclaimed = sc->nr_reclaimed;
        nr_scanned = sc->nr_scanned;

        memcg = mem_cgroup_iter(root, NULL, &reclaim);
        do {
            unsigned long lru_pages;
            unsigned long scanned;
            struct lruvec *lruvec;
            int swappiness;

            if (mem_cgroup_low(root, memcg)) {
                if (!sc->may_thrash)
                    continue;
                mem_cgroup_events(memcg, MEMCG_LOW, 1);
            }

            lruvec = mem_cgroup_zone_lruvec(zone, memcg);---------------------------取當前Memory CGroup的LRU鏈表數據結構,
            swappiness = mem_cgroup_swappiness(memcg);------------------------------獲取系統中的vm_swappiness參數,表示swap活躍程度。
            scanned = sc->nr_scanned;

            shrink_lruvec(lruvec, swappiness, sc, &lru_pages);----------------------掃描LRU鏈表的核心函數,並進行頁面回收。
            zone_lru_pages += lru_pages;

            if (memcg && is_classzone)
                shrink_slab(sc->gfp_mask, zone_to_nid(zone),------------------------調用內存管理系統中的shrinker接口,不少子系統會註冊shrinker接口來回收內存。
                        memcg, sc->nr_scanned - scanned,
                        lru_pages);

            /*
             * Direct reclaim and kswapd have to scan all memory
             * cgroups to fulfill the overall scan target for the
             * zone.
             *
             * Limit reclaim, on the other hand, only cares about
             * nr_to_reclaim pages to be reclaimed and it will
             * retry with decreasing priority if one round over the
             * whole hierarchy is not sufficient.
             */
            if (!global_reclaim(sc) &&
                    sc->nr_reclaimed >= sc->nr_to_reclaim) {
                mem_cgroup_iter_break(root, memcg);
                break;
            }
        } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));---------------------遍歷Memory CGroup。 /*
         * Shrink the slab caches in the same proportion that
         * the eligible LRU pages were scanned.
         */
        if (global_reclaim(sc) && is_classzone)
            shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
                    sc->nr_scanned - nr_scanned,
                    zone_lru_pages);

        if (reclaim_state) {
            sc->nr_reclaimed += reclaim_state->reclaimed_slab;
            reclaim_state->reclaimed_slab = 0;
        }

        vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
               sc->nr_scanned - nr_scanned,
               sc->nr_reclaimed - nr_reclaimed);

        if (sc->nr_reclaimed - nr_reclaimed)
            reclaimable = true;

    } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
                     sc->nr_scanned - nr_scanned, sc));----------------------------------經過此輪迴首頁面數量和掃描數量來判斷,掃描工做是否須要繼續。 return reclaimable;
}

shrink_lruvec決定從哪一個LRU鏈表中回收多少頁面。兩個重要參數是swappiness和sc->priority。

首先經過get_scan_count計算每一個LRU鏈表中有多少應該掃描的頁面數,而後開始循環LRU中每一個類型頁面鏈表。核心是使用shrink_list對LRU進行掃描,找出

static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
              struct scan_control *sc, unsigned long *lru_pages)
{
    unsigned long nr[NR_LRU_LISTS];
    unsigned long targets[NR_LRU_LISTS];
    unsigned long nr_to_scan;
    enum lru_list lru;
    unsigned long nr_reclaimed = 0;
    unsigned long nr_to_reclaim = sc->nr_to_reclaim;
    struct blk_plug plug;
    bool scan_adjusted;

    get_scan_count(lruvec, swappiness, sc, nr, lru_pages);------------------------------根據swappiness、sc->priority計算LRU4個鏈表中應該掃描的頁面數,結果放在nr[]中。 /* Record the original scan target for proportional adjustments later */
    memcpy(targets, nr, sizeof(nr));

    /*
     * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
     * event that can occur when there is little memory pressure e.g.
     * multiple streaming readers/writers. Hence, we do not abort scanning
     * when the requested number of pages are reclaimed when scanning at
     * DEF_PRIORITY on the assumption that the fact we are direct
     * reclaiming implies that kswapd is not keeping up and it is best to
     * do a batch of work at once. For memcg reclaim one check is made to
     * abort proportional reclaim if either the file or anon lru has already
     * dropped to zero at the first pass.
     */
    scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
             sc->priority == DEF_PRIORITY);

    blk_start_plug(&plug);
    while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
                    nr[LRU_INACTIVE_FILE]) {---------------------------------------------跳過LRU_ACTIVE_ANON類型,由於活躍的匿名頁面不能直接被回收,匿名頁面須要通過老化且加入到不活躍匿名頁面LRU鏈表才能被回收。
        unsigned long nr_anon, nr_file, percentage;
        unsigned long nr_scanned;

        for_each_evictable_lru(lru) {----------------------------------------------------依次遍歷四種LRU鏈表,shrink_list函數會具體處理各類LRU鏈表狀況。 if (nr[lru]) {
                nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
                nr[lru] -= nr_to_scan;

                nr_reclaimed += shrink_list(lru, nr_to_scan,
                                lruvec, sc);
            }
        }

        if (nr_reclaimed < nr_to_reclaim || scan_adjusted)-------------------------------已回收的頁面數目小於待回收的頁面數目,繼續掃描下一個LRU鏈表。 continue;

        /*
         * For kswapd and memcg, reclaim at least the number of pages
         * requested. Ensure that the anon and file LRUs are scanned
         * proportionally what was requested by get_scan_count(). We
         * stop reclaiming one LRU and reduce the amount scanning
         * proportional to the original scan target.
         */
        nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
        nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

        /*
         * It's just vindictive to attack the larger once the smaller
         * has gone to zero.  And given the way we stop scanning the
         * smaller below, this makes sure that we only make one nudge
         * towards proportionality once we've got nr_to_reclaim.
         */
        if (!nr_file || !nr_anon)--------------------------------------------------------匿名或者文件頁面已經被掃描完畢,退出循環。 break;
...
    }
...
}

 shrink_slab是對slab緩存進行進行收縮的函數,它會遍歷shrinker_list列表。

內核不少子系統註冊shrinker接口,shrinker->count_objects返回當前slabcache中有多少空閒緩存;shrinker->scan_objects會掃描空閒緩存並釋放。

 

static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
                 struct mem_cgroup *memcg,
                 unsigned long nr_scanned,
                 unsigned long nr_eligible)
{
    struct shrinker *shrinker;
    unsigned long freed = 0;

    if (memcg && !memcg_kmem_is_active(memcg))
        return 0;
...
    list_for_each_entry(shrinker, &shrinker_list, list) {------------------遍歷shrinker_list列表,提取shrinker。 struct shrink_control sc = {
            .gfp_mask = gfp_mask,
            .nid = nid,
            .memcg = memcg,
        };

        if (memcg && !(shrinker->flags & SHRINKER_MEMCG_AWARE))
            continue;

        if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
            sc.nid = 0;

        freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);---以shrink_control和shrinker爲參數進行slab緩存收縮操做。
    }

    up_read(&shrinker_rwsem);
out:
    cond_resched();
    return freed;
}

 下面針對不一樣類型頁面進行不一樣的處理,同時須要考慮是否打開SWAP分區的狀況。在不活躍鏈表較少(即low)狀況下,進行活躍鏈表的收縮。

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
                 struct lruvec *lruvec, struct scan_control *sc)
{
    if (is_active_lru(lru)) {-----------------------------------------lru鏈表爲LRU_ACTIVE_ANON、LRU_ACTIVE_FILE兩種狀況 if (inactive_list_is_low(lruvec, lru))------------------------若是不活躍文件或者匿名頁面爲低,則收縮活躍列表。
            shrink_active_list(nr_to_scan, lruvec, sc, lru);
        return 0;
    }

    return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);---------收縮LRU的LRU_INACTIVE_ANON和LRU_INACTIVE_FILE兩種頁面
}

 下面兩章節針對活躍和不活躍鏈表進行收縮處理。

 

5. shrink_active_list函數

 shrink_active_list看看有哪些活躍頁面能夠遷移到不活躍頁面鏈表中。

static void shrink_active_list(unsigned long nr_to_scan,
                   struct lruvec *lruvec,
                   struct scan_control *sc,
                   enum lru_list lru)
{
    unsigned long nr_taken;
    unsigned long nr_scanned;
    unsigned long vm_flags;
    LIST_HEAD(l_hold);    /* The pages which were snipped off */
    LIST_HEAD(l_active);
    LIST_HEAD(l_inactive);----------------------------------------------定義3個臨時鏈表l_hold、l_active、l_inactive。
    struct page *page;
    struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
    unsigned long nr_rotated = 0;
    isolate_mode_t isolate_mode = 0;
    int file = is_file_lru(lru);
    struct zone *zone = lruvec_zone(lruvec);

    lru_add_drain();

    if (!sc->may_unmap)
        isolate_mode |= ISOLATE_UNMAPPED;
    if (!sc->may_writepage)
        isolate_mode |= ISOLATE_CLEAN;---------------------------------設置isolate_mode

    spin_lock_irq(&zone->lru_lock);

    nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
                     &nr_scanned, sc, isolate_mode, lru);--------------經過isolate_mode限定將哪些頁面從LRU鏈表移動到l_hold中。 if (global_reclaim(sc))
        __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);

    reclaim_stat->recent_scanned[file] += nr_taken;

    __count_zone_vm_events(PGREFILL, zone, nr_scanned);
    __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
    spin_unlock_irq(&zone->lru_lock);

    while (!list_empty(&l_hold)) {------------------------------------掃描臨時l_hold鏈表中的頁面,有些頁面會添加到l_active中,有些會添加到l_inactive中,剩下部分能夠直接釋放。
        cond_resched();
        page = lru_to_page(&l_hold);
        list_del(&page->lru);-----------------------------------------將page從當前LRU鏈表l_hold中移除 if (unlikely(!page_evictable(page))) {------------------------若是頁面不可回收,則放回不可回收LRU鏈表中。繼續下一個頁面處理。
            putback_lru_page(page);
            continue;
        }

        if (unlikely(buffer_heads_over_limit)) {
            if (page_has_private(page) && trylock_page(page)) {
                if (page_has_private(page))
                    try_to_release_page(page, 0);
                unlock_page(page);
            }
        }

        if (page_referenced(page, 0, sc->target_mem_cgroup,
                    &vm_flags)) {
            nr_rotated += hpage_nr_pages(page);
            /*
             * Identify referenced, file-backed active pages and
             * give them one more trip around the active list. So
             * that executable code get better chances to stay in
             * memory under moderate memory pressure.  Anon pages
             * are not likely to be evicted by use-once streaming
             * IO, plus JVM can create lots of anon VM_EXEC pages,
             * so we ignore them here.
             */
            if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {---可執行page cache頁面保留在活躍鏈表中。
                list_add(&page->lru, &l_active);----------------------將page移入l_active中。 continue;
            }
        }

        ClearPageActive(page);    /* we are de-activating */
        list_add(&page->lru, &l_inactive);---------------------------將page移入l_inactive中。
    }

    /*
     * Move pages back to the lru list.
     */
    spin_lock_irq(&zone->lru_lock);
    /*
     * Count referenced pages from currently used mappings as rotated,
     * even though only some of them are actually re-activated.  This
     * helps balance scan pressure between file and anonymous pages in
     * get_scan_count.
     */
    reclaim_stat->recent_rotated[file] += nr_rotated;

    move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
    move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);----將l_active和l_inactive移入對應的LRU鏈表中。
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
    spin_unlock_irq(&zone->lru_lock);

    mem_cgroup_uncharge_list(&l_hold);
    free_hot_cold_page_list(&l_hold, true);---------------------------------l_hold中的鏈表是提出l_active和l_inactive剩下來部分,而後進行釋放。
}

isolate_lru_pages用於分離LRU鏈表中頁面的函數。

nr_to_scan表示在這個鏈表中掃描頁面的個數,lruvec是LRU鏈表集合,dst是臨時存放的頁面鏈表,nv_scanned是已經掃描的頁面個數。

 

static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
        struct lruvec *lruvec, struct list_head *dst,
        unsigned long *nr_scanned, struct scan_control *sc,
        isolate_mode_t mode, enum lru_list lru)
{
    struct list_head *src = &lruvec->lists[lru];
    unsigned long nr_taken = 0;
    unsigned long scan;

    for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
        struct page *page;
        int nr_pages;

        page = lru_to_page(src);
        prefetchw_prev_lru_page(page, src, flags);

        VM_BUG_ON_PAGE(!PageLRU(page), page);

        switch (__isolate_lru_page(page, mode)) {----------------調用此函數來分離單個頁面,0表示分離成功,並把頁面遷移到dst臨時鏈表中。 case 0:
            nr_pages = hpage_nr_pages(page);
            mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
            list_move(&page->lru, dst);
            nr_taken += nr_pages;
            break;

        case -EBUSY:
            /* else it is being freed elsewhere */
            list_move(&page->lru, src);
            continue;

        default:
            BUG();
        }
    }

    *nr_scanned = scan;
    trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
                    nr_taken, mode, is_file_lru(lru));
    return nr_taken;
}

 

 

6. shrink_inactive_list函數

 shrink_inactive_list函數掃描不活躍頁面鏈表而且回收頁面。

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
             struct scan_control *sc, enum lru_list lru)
{
    LIST_HEAD(page_list);
    unsigned long nr_scanned;
    unsigned long nr_reclaimed = 0;
    unsigned long nr_taken;
    unsigned long nr_dirty = 0;
    unsigned long nr_congested = 0;
    unsigned long nr_unqueued_dirty = 0;
    unsigned long nr_writeback = 0;
    unsigned long nr_immediate = 0;
    isolate_mode_t isolate_mode = 0;
    int file = is_file_lru(lru);
    struct zone *zone = lruvec_zone(lruvec);
    struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

    while (unlikely(too_many_isolated(zone, file, sc))) {
        congestion_wait(BLK_RW_ASYNC, HZ/10);

        /* We are about to die and free our memory. Return now. */
        if (fatal_signal_pending(current))
            return SWAP_CLUSTER_MAX;
    }

    lru_add_drain();

    if (!sc->may_unmap)
        isolate_mode |= ISOLATE_UNMAPPED;
    if (!sc->may_writepage)
        isolate_mode |= ISOLATE_CLEAN;

    spin_lock_irq(&zone->lru_lock);

    nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
                     &nr_scanned, sc, isolate_mode, lru);--------------按照isolate_mode把不活躍頁面分離到臨時鏈表page_list中。

    __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);

    if (global_reclaim(sc)) {
        __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
        if (current_is_kswapd())
            __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
        else
            __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
    }
    spin_unlock_irq(&zone->lru_lock);

    if (nr_taken == 0)
        return 0;

    nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
                &nr_dirty, &nr_unqueued_dirty, &nr_congested,
                &nr_writeback, &nr_immediate,
                false);------------------------------------------------掃描page_list鏈表頁面並返回已回收的頁面數量。

    spin_lock_irq(&zone->lru_lock);

    reclaim_stat->recent_scanned[file] += nr_taken;

    if (global_reclaim(sc)) {
        if (current_is_kswapd())
            __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
                           nr_reclaimed);
        else
            __count_zone_vm_events(PGSTEAL_DIRECT, zone,
                           nr_reclaimed);
    }

    putback_inactive_pages(lruvec, &page_list);-----------------------將知足條件的頁面放回lru鏈表中

    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);

    spin_unlock_irq(&zone->lru_lock);

    mem_cgroup_uncharge_list(&page_list);
    free_hot_cold_page_list(&page_list, true);------------------------page_list剩下部分的頁面將被釋放。 /*
     * If reclaim is isolating dirty pages under writeback, it implies
     * that the long-lived page allocation rate is exceeding the page
     * laundering rate. Either the global limits are not being effective
     * at throttling processes due to the page distribution throughout
     * zones or there is heavy usage of a slow backing device. The
     * only option is to throttle from reclaim context which is not ideal
     * as there is no guarantee the dirtying process is throttled in the
     * same way balance_dirty_pages() manages.
     *
     * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
     * of pages under pages flagged for immediate reclaim and stall if any
     * are encountered in the nr_immediate check below.
     */
    if (nr_writeback && nr_writeback == nr_taken)
        set_bit(ZONE_WRITEBACK, &zone->flags);

    /*
     * memcg will stall in page writeback so only consider forcibly
     * stalling for global reclaim
     */
    if (global_reclaim(sc)) {
        /*
         * Tag a zone as congested if all the dirty pages scanned were
         * backed by a congested BDI and wait_iff_congested will stall.
         */
        if (nr_dirty && nr_dirty == nr_congested)
            set_bit(ZONE_CONGESTED, &zone->flags);

        /*
         * If dirty pages are scanned that are not queued for IO, it
         * implies that flushers are not keeping up. In this case, flag
         * the zone ZONE_DIRTY and kswapd will start writing pages from
         * reclaim context.
         */
        if (nr_unqueued_dirty == nr_taken)
            set_bit(ZONE_DIRTY, &zone->flags);

        /*
         * If kswapd scans pages marked marked for immediate
         * reclaim and under writeback (nr_immediate), it implies
         * that pages are cycling through the LRU faster than
         * they are written so also forcibly stall.
         */
        if (nr_immediate && current_may_throttle())
            congestion_wait(BLK_RW_ASYNC, HZ/10);
    }

    /*
     * Stall direct reclaim for IO completions if underlying BDIs or zone
     * is congested. Allow kswapd to continue until it starts encountering
     * unqueued dirty pages or cycling through the LRU too quickly.
     */
    if (!sc->hibernation_mode && !current_is_kswapd() &&
        current_may_throttle())
        wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

    trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
        zone_idx(zone),
        nr_scanned, nr_reclaimed,
        sc->priority,
        trace_shrink_flags(file));
    return nr_reclaimed;
}

  

7. 跟蹤LRU活動狀況

若是在LRU鏈表中,頁面被其餘進程釋放了,那麼LRU鏈表如何知道頁面已經被釋放了?

LRU是一個雙向鏈表,如何保護鏈表中的成員不被其它內核路徑釋放是在設計頁面回收功能須要考慮的併發問題。

struct page數據結構中的__count引用計數起到重要做用。

以shrink_active_list()中分離頁面到臨時鏈表l_hold爲例。

shrink_active_list()

  ->isolate_lru_pages()

    ->page = lru_to_page()

    ->get_page_unless_zero(page)

    ->ClearPageLRU(page)

這樣從LRU鏈表中摘取一個頁面時,就對該頁page->_count引用計數減1。

把分離好的頁面放回LRU鏈表的狀況以下:

shrink_active_list()

  ->move_active_pages_to_lru()

    ->list_move(&page->lru, &lruvec->lists[lru])

    ->put_page_testzero(page)

這裏對page->_count計數減1,若是減1等於0,說明這個page已經被其餘進程釋放,清楚PG_LRU並從LRU鏈表刪除該頁。

 

8. Refault Distance算法

相關文章
相關標籤/搜索