Linux內存管理 (13)回收頁面

時間 2019-11-17

標籤 linux 內存管理回收頁面欄目 Linux 简体版

原文原文鏈接

關鍵詞：LRU、活躍/不活躍-文件緩存/匿名頁面、Refault Distance。node

頁面回收、或者回收頁面也即page reclaim，依賴於LRU鏈表對頁面進行分類：不活躍匿名頁面、活躍匿名頁面、不活躍文件緩存頁面、活躍文件緩存頁面和不可回收頁面。算法

內存緊張時優先換出文件緩存頁面，而後纔是匿名頁面。由於文件緩存頁面有後備存儲器，而匿名頁面必需要寫入交換分區。緩存

因此回收頁面的三種機制(1)對未修改的文件緩存頁面能夠直接丟棄，(2)對被修改的文件緩存頁面須要會寫到存儲設備中，(3)不多使用的匿名頁面交換到swap分區，以便釋放出物理內存，這個機制稱爲頁交換(swapping)。cookie

LRU鏈表是頁面回收操做的基礎，kswapd內核線程是頁面回收的入口，每一個NUMA內存節點都會建立一個"kswapd%d"的內核線程。數據結構

下面的函數分析按照內存節點-->Zone-->LRU Active/Inactive層級展開：併發

balance_pgdat函數是頁面回收的主函數，對應的內存層次是一個內存節點；app

shrink_zone函數用於掃描zone中的全部可回收頁面，對應的內存層次是一個zone；less

而後shrink_active_list函數掃描活躍頁面，看看有哪些活躍頁面能夠遷移到不活躍頁面鏈表中；shrink_inactive_list函數掃描不活躍頁面鏈表而且回收頁面。tcp

最後跟蹤LRU活動狀況介紹了頁面被釋放，LRU是如何知道而且更新鏈表的；以及Refault Distance算法對文件緩存頁面回收的優化。

將kswapd的核心活動列出能夠看出kswapd基本脈絡，下面的章節逐步展開介紹：

kswapd_init---------------------------------------kswapd模塊的初始化函數
  kswapd_run--------------------------------------建立內核線程kswapd
    kswapd----------------------------------------kswapd內核線程的執行函數
      kswapd_try_to_sleep-------------------------睡眠而且讓出CPU，等待wakeup_kswapd()喚醒。♥
      balance_pgdat-------------------------------回收頁面的主函數，多zone
        kswapd_shrink_zone------------------------單獨處理某個zone的掃描和頁面回收
          shrink_zone-----------------------------掃描zone中全部可回收的頁面
            shrink_lruvec-------------------------掃描LRU鏈表的核心函數
              shrink_list-------------------------處理各類LRU鏈表
                shrink_active_list----------------查看哪些活躍頁面能夠遷移到不活躍頁面鏈表中
                  isolate_lru_pages---------------從LRU鏈表中分離頁面
                shrink_inactive_list--------------掃描inactive LRU鏈表嘗試回收頁面，而且返回已經回收頁面的數量。
                  shrink_page_list----------------掃描page_list鏈表的頁面並返回已回收的頁面數量
            shrink_slab---------------------------調用內存管理系統中的shrinker接口來回收內存
        pgdat_balanced----------------------------判斷內存節點是否處於平衡狀態，即處於高水位
          zone_balanced---------------------------判斷內存節點中的zone是否處於平衡狀態

1. LRU鏈表

LRU(Least Recently Used)是最近最少使用的意思，內核假定最近不適用的頁在較短的時間內也不會頻繁使用。

在內存不足時，這些頁面優先成爲被換出的候選者。

1.1 LRU鏈表

LRU是雙向鏈表，內核根據頁面類型(匿名和文件)與活躍性(活躍和不活躍)，分紅5種類型LRU鏈表：

#define LRU_BASE 0
#define LRU_ACTIVE 1
#define LRU_FILE 2

enum lru_list {
    LRU_INACTIVE_ANON = LRU_BASE,--------------------------不活躍匿名頁面鏈表，須要交換分區才能回收
    LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,---------------活躍匿名頁面鏈表
    LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,---------------不活躍文件映射頁面鏈表，最優先回收
    LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,----活躍文件映射頁面鏈表
    LRU_UNEVICTABLE,---------------------------------------不可回收頁面鏈表，禁止換出
    NR_LRU_LISTS
};


struct lruvec {
    struct list_head lists[NR_LRU_LISTS];
    struct zone_reclaim_stat reclaim_stat;
#ifdef CONFIG_MEMCG
    struct zone *zone;
#endif
};


struct zone {
...
    /* Fields commonly accessed by the page reclaim scanner */
    spinlock_t        lru_lock;
    struct lruvec lruvec;
...
}

從zone能夠找到各類LRU鏈表，遍歷成員，因此頁面回收是按照zone來進行的；到Linux v4.8開始改成基於node的LRU鏈表。

LRU鏈表是如何實現頁面老化的？

將頁面加入LRU鏈表的經常使用API是lru_cache_add()。

lru_cache_add()-->__lru_cache_add()-->

/* 14 pointers + two long's align the pagevec structure to a power of two */
#define PAGEVEC_SIZE 14

struct pagevec {
unsigned long nr;
unsigned long cold;
struct page *pages[PAGEVEC_SIZE];-------批處理一次14個頁面
};



static void __lru_cache_add(struct page *page)
{
    struct pagevec *pvec = &get_cpu_var(lru_add_pvec);

    page_cache_get(page);
    if (!pagevec_space(pvec))-------------判斷pagevec是否還有空間，如沒有調用__pagevec_lru_add()將原有的page加入到LRU鏈表中
        __pagevec_lru_add(pvec);
    pagevec_add(pvec, page);--------------加入到struct pagevec中
    put_cpu_var(lru_add_pvec);
}

void __pagevec_lru_add(struct pagevec *pvec)
{
    pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
}


static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
                 void *arg)
{
    int file = page_is_file_cache(page);
    int active = PageActive(page);
    enum lru_list lru = page_lru(page);-------------------------------------------判斷page的LRU類型

    VM_BUG_ON_PAGE(PageLRU(page), page);

    SetPageLRU(page);
    add_page_to_lru_list(page, lruvec, lru);
    update_page_reclaim_stat(lruvec, file, active);
    trace_mm_lru_insertion(page, lru);
}

add_page_to_lru_list()根據獲取的lru，將page加入到lruvec->lists[lru]中。

lru_to_page()和list_del()組合實現從LRU鏈表摘取頁面。lru_to_page()從鏈表末尾摘取頁面，LRU採用FIFO算法，最早進入LRU鏈表的頁面，在LRU中時間越長，老化時間越長。

最不經常使用的頁面將慢慢移動到不活躍LRU鏈表末尾，這些頁面是最適合的候選者。

lru_cache_add：用於將頁面加入到LRU鏈表中。

lru_to_page：用於從LRU鏈表末尾獲取頁面。

list_del：能夠將page從LRU鏈表中移除。

缺一張圖

1.2 第二次機會法

第二次機會(second chance)算法爲了不把常用的頁面置換出去。

當選擇置換頁面時，依然和LRU算法同樣，選擇最先置入鏈表的頁面，即在鏈表末尾的頁面。

二次機會法設置了一個訪問狀態位，若是訪問位是0，就淘汰這頁面；若是訪問位是1，就給他第二次機會，並選擇下一個頁面來換出。

獲得第二次機會的頁面，它的訪問位會被清0；若是該頁在此期間再次被訪問過，則訪問位職爲1。

Linux內核使用了PG_active和PG_referenced兩個標誌位來實現第二次機會法。

PG_active表示該頁在活躍LRU中；PG_referenced表示該頁是否被引用過。

1.3 mark_page_accessed()

從函數開頭能夠看出，PG_active和PG_referenced有三種組合。

/*
 * Mark a page as having seen activity.
 *
 * inactive,unreferenced    ->    inactive,referenced-----1
 * inactive,referenced        ->    active,unreferenced---2
 * active,unreferenced        ->    active,referenced-----3
 *
 * When a newly allocated page is not yet visible, so safe for non-atomic ops,
 * __SetPageReferenced(page) may be substituted for mark_page_accessed(page).
 */
void mark_page_accessed(struct page *page)
{
    if (!PageActive(page) && !PageUnevictable(page) &&
            PageReferenced(page)) {-----------------------inactive，referenced狀況，置爲active，unreferenced。對應狀況2 /*
         * If the page is on the LRU, queue it for activation via
         * activate_page_pvecs. Otherwise, assume the page is on a
         * pagevec, mark it active and it'll be moved to the active
         * LRU on the next drain.
         */
        if (PageLRU(page))
            activate_page(page);
        else
            __lru_cache_activate_page(page);
        ClearPageReferenced(page);
        if (page_is_file_cache(page))
            workingset_activation(page);
    } else if (!PageReferenced(page)) {--------------------inactive，unreferenced和active，unreferenced兩種狀況，置爲inactive/active，referenced。對應狀況1,3
        SetPageReferenced(page);
    }
}

1.4 page_check_references()

在掃描不活躍LRU鏈表時，page_check_referenced()會被調用，返回值是一個page_referenced的枚舉類型。

enum page_references {
    PAGEREF_RECLAIM,-------------------表示該頁面能夠被嘗試回收
    PAGEREF_RECLAIM_CLEAN,-------------表示該頁面能夠被嘗試回收
    PAGEREF_KEEP,----------------------表示該頁面會繼續保留在不活躍鏈表中
    PAGEREF_ACTIVATE,------------------表示該頁面會遷移到活躍鏈表
};

static enum page_references page_check_references(struct page *page,
                          struct scan_control *sc)
{
    int referenced_ptes, referenced_page;
    unsigned long vm_flags;

    referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
                      &vm_flags);-------------------------------------------------檢查該頁是否被pte訪問引用，經過rmap_walk查找該頁被多少個pte引用。
    referenced_page = TestClearPageReferenced(page);------------------------------頁面是否被置位PG_referenced，若是是就給第二次機會。 /*
     * Mlock lost the isolation race with us.  Let try_to_unmap()
     * move the page to the unevictable list.
     */
    if (vm_flags & VM_LOCKED)
        return PAGEREF_RECLAIM;

    if (referenced_ptes) {---------------------------------------------------------頁面被pte引用
        if (PageSwapBacked(page))
            return PAGEREF_ACTIVATE;-----------------------------------------------匿名頁面，加入活躍鏈表
        /*
         * All mapped pages start out with page table
         * references from the instantiating fault, so we need
         * to look twice if a mapped file page is used more
         * than once.
         *
         * Mark it and spare it for another trip around the
         * inactive list.  Another page table reference will
         * lead to its activation.
         *
         * Note: the mark is set for activated pages as well
         * so that recently deactivated but used pages are
         * quickly recovered.
         */
        SetPageReferenced(page);

        if (referenced_page || referenced_ptes > 1)
            return PAGEREF_ACTIVATE;--------------------------------------------最近第二次訪問的page cache或shared page cache，加入活躍鏈表。 /*
         * Activate file-backed executable pages after first usage.
         */
        if (vm_flags & VM_EXEC)
            return PAGEREF_ACTIVATE;--------------------------------------------可執行文件的page cache，加入活躍鏈表 return PAGEREF_KEEP;----------------------------------------------------保留在不活躍鏈表中
    }

    /* Reclaim if clean, defer dirty pages to writeback */
    if (referenced_page && !PageSwapBacked(page))-------------------------------第二次訪問的page cache頁面，能夠釋放 return PAGEREF_RECLAIM_CLEAN;

    return PAGEREF_RECLAIM;-----------------------------------------------------頁面沒有被pte引用，能夠釋放
}

1.5 page_referenced()

page_referenced()函數判斷page是否被訪問引用過，返回訪問引用pte的個數，即訪問和引用這個頁面的用戶進程空間虛擬頁面的個數。

核心思想是利用反響映射系統來統計訪問引用pte的用戶個數。

page_referenced()主要工做以下：

利用RMAP系統遍歷全部映射該頁面的pte。
對每一個pte，若是L_PTE_YOUNG比特位置位，說明以前被訪問過，referenced計數加1；而後清空L_PTE_YOUNG。對ARM32來講，會清空硬件頁表項內容，人爲製造一個缺頁中斷，當再次訪問該pte時，在缺頁中斷中設置L_PTE_YOUNG比特位。
返回referenced計數，表示該頁有多少個訪問引用pte。

/**
 * page_referenced - test if the page was referenced
 * @page: the page to test
 * @is_locked: caller holds lock on the page
 * @memcg: target memory cgroup
 * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
 *
 * Quick test_and_clear_referenced for all mappings to a page,
 * returns the number of ptes which referenced the page.
 */
int page_referenced(struct page *page,
            int is_locked,
            struct mem_cgroup *memcg,
            unsigned long *vm_flags)
{
    int ret;
    int we_locked = 0;
    struct page_referenced_arg pra = {
        .mapcount = page_mapcount(page),
        .memcg = memcg,
    };
    struct rmap_walk_control rwc = {
        .rmap_one = page_referenced_one,
        .arg = (void *)&pra,
        .anon_lock = page_lock_anon_vma_read,
    };

    *vm_flags = 0;
    if (!page_mapped(page))----------------------------------判斷page->_mapcount引用計數是否大於等於0. return 0;

    if (!page_rmapping(page))--------------------------------判斷page->mapping是否有地址空間映射。 return 0;

    if (!is_locked && (!PageAnon(page) || PageKsm(page))) {
        we_locked = trylock_page(page);
        if (!we_locked)
            return 1;
    }

    /*
     * If we are reclaiming on behalf of a cgroup, skip
     * counting on behalf of references from different
     * cgroups
     */
    if (memcg) {
        rwc.invalid_vma = invalid_page_referenced_vma;
    }

    ret = rmap_walk(page, &rwc);---------------------------遍歷該頁面全部映射的pte，而後調用struct rmap_walk_control的成員rmap_one。 *vm_flags = pra.vm_flags;

    if (we_locked)
        unlock_page(page);

    return pra.referenced;
}

page_referenced()中調用page_referenced_one()進行referenced和mapcount計數處理。

static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
            unsigned long address, void *arg)
{
    struct mm_struct *mm = vma->vm_mm;
    spinlock_t *ptl;
    int referenced = 0;
    struct page_referenced_arg *pra = arg;

    if (unlikely(PageTransHuge(page))) {
...
    } else {
        pte_t *pte;

        /*
         * rmap might return false positives; we must filter
         * these out using page_check_address().
         */
        pte = page_check_address(page, mm, address, &ptl, 0);--------------根據mm和address獲取pte if (!pte)
            return SWAP_AGAIN;

        if (vma->vm_flags & VM_LOCKED) {
            pte_unmap_unlock(pte, ptl);
            pra->vm_flags |= VM_LOCKED;
            return SWAP_FAIL; /* To break the loop */
        }

        if (ptep_clear_flush_young_notify(vma, address, pte)) {-----------判斷pte最近是否被訪問過， /*
             * Don't treat a reference through a sequentially read
             * mapping as such.  If the page has been used in
             * another mapping, we will catch it; if this other
             * mapping is already gone, the unmap path will have
             * set PG_referenced or activated the page.
             */
            if (likely(!(vma->vm_flags & VM_SEQ_READ)))-------------------順序讀的page cache是被回收的最佳後選者，其他狀況都會當作pte被引用，增長計數。
                referenced++;
        }
        pte_unmap_unlock(pte, ptl);
    }

    if (referenced) {
        pra->referenced++;-------------------------------------------------pra->referenced增長計數
        pra->vm_flags |= vma->vm_flags;
    }

    pra->mapcount--;-------------------------------------------------------pra->mapcount減小計數 if (!pra->mapcount)
        return SWAP_SUCCESS; /* To break the loop */

    return SWAP_AGAIN;
}

2. kswapd內核線程

kswapd負責在內存不足的狀況下回收頁面，kswapd內核線程初始化時會爲系統中每一個NUMA內存節點建立一個名爲"kswapd%d"的內核線程。

2.1 kswapd_wait等待隊列

setup_arch()-->paging_init()-->bootmem_init()-->zone_sizes_init()-->free_area_init_node()-->free_area_init_core，kswapd_wait等待隊列在free_area_init_core中進行初始化，每一個內存節點一個。

等待隊列用於使進程等待某一事件發生，而無需頻繁輪詢，進程在等待期間睡眠。在某事件發生時，由內核自動喚醒。

kswapd內核線程在kswapd_wait等待隊列上等待TASK_INTERRUPTIBLE事件發生。

static void __paginginit free_area_init_core(struct pglist_data *pgdat,
        unsigned long node_start_pfn, unsigned long node_end_pfn,
        unsigned long *zones_size, unsigned long *zholes_size)
{
...
    init_waitqueue_head(&pgdat->kswapd_wait);
    init_waitqueue_head(&pgdat->pfmemalloc_wait);
    pgdat_page_ext_init(pgdat);

...
}

2.2 建立kswapd內核線程

kswapd內核線程負責在內存不足的狀況下進行頁面回收，每NUMA內存節點配置一個。

其中kswapd函數是內核線程kswapd的入口。

static int __init kswapd_init(void)
{
    int nid;

    swap_setup();
    for_each_node_state(nid, N_MEMORY)-----------------------------------------每一個內存節點建立一個kswapd內核線程
         kswapd_run(nid);
    hotcpu_notifier(cpu_callback, 0);
    return 0;
}

int kswapd_run(int nid)
{
    pg_data_t *pgdat = NODE_DATA(nid);-----------------------------------------獲取內存節點對應的pg_data_t指針 int ret = 0;

    if (pgdat->kswapd)
        return 0;

    pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);---------------kswapd函數，pgdat做爲參數傳入kswapd函數。 if (IS_ERR(pgdat->kswapd)) {
        /* failure at boot is fatal */
        BUG_ON(system_state == SYSTEM_BOOTING);
        pr_err("Failed to start kswapd on node %d\n", nid);
        ret = PTR_ERR(pgdat->kswapd);
        pgdat->kswapd = NULL;
    }
    return ret;
}

static int kswapd(void *p)
{
    unsigned long order, new_order;
    unsigned balanced_order;
    int classzone_idx, new_classzone_idx;
    int balanced_classzone_idx;
    pg_data_t *pgdat = (pg_data_t*)p;-----------------------------------------從kswapd_run傳入的內存節點數據結構pg_data_t。 struct task_struct *tsk = current;

    struct reclaim_state reclaim_state = {
        .reclaimed_slab = 0,
    };
    const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);

    lockdep_set_current_reclaim_state(GFP_KERNEL);

    if (!cpumask_empty(cpumask))
        set_cpus_allowed_ptr(tsk, cpumask);
    current->reclaim_state = &reclaim_state;

    /*
     * Tell the memory management that we're a "memory allocator",
     * and that if we need more memory we should get access to it
     * regardless (see "__alloc_pages()"). "kswapd" should
     * never get caught in the normal page freeing logic.
     *
     * (Kswapd normally doesn't need memory anyway, but sometimes
     * you need a small amount of memory in order to be able to
     * page out something else, and this flag essentially protects
     * us from recursively trying to free more memory as we're
     * trying to free the first piece of memory in the first place).
     */
    tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
    set_freezable();

    order = new_order = 0;
    balanced_order = 0;
    classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
    balanced_classzone_idx = classzone_idx;
    for ( ; ; ) {
        bool ret;

        /*
         * If the last balance_pgdat was unsuccessful it's unlikely a
         * new request of a similar or harder type will succeed soon
         * so consider going to sleep on the basis we reclaimed at
         */
        if (balanced_classzone_idx >= new_classzone_idx &&
                    balanced_order == new_order) {
            new_order = pgdat->kswapd_max_order;
            new_classzone_idx = pgdat->classzone_idx;
            pgdat->kswapd_max_order =  0;
            pgdat->classzone_idx = pgdat->nr_zones - 1;
        }

        if (order < new_order || classzone_idx > new_classzone_idx) {
            /*
             * Don't sleep if someone wants a larger 'order'
             * allocation or has tigher zone constraints
             */
            order = new_order;
            classzone_idx = new_classzone_idx;
        } else {
            kswapd_try_to_sleep(pgdat, balanced_order,---------------------------在此處睡眠，等待wakeup_kswapd來喚醒。
                        balanced_classzone_idx);
            order = pgdat->kswapd_max_order;
            classzone_idx = pgdat->classzone_idx;--------------------------------pgdata->kswapd_max_order和pgdat->classzone_id已經在wakeup_kswapd中進行了更新。
            new_order = order;
            new_classzone_idx = classzone_idx;
            pgdat->kswapd_max_order = 0;
            pgdat->classzone_idx = pgdat->nr_zones - 1;
        }

        ret = try_to_freeze();
        if (kthread_should_stop())
            break;

        /*
         * We can speed up thawing tasks if we don't call balance_pgdat
         * after returning from the refrigerator
         */
        if (!ret) {
            trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
            balanced_classzone_idx = classzone_idx;
            balanced_order = balance_pgdat(pgdat, order,------------------------進行頁面回收的主函數。 &balanced_classzone_idx);
        }
    }

    tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
    current->reclaim_state = NULL;
    lockdep_clear_current_reclaim_state();

    return 0;
}

static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
{
    long remaining = 0;
    DEFINE_WAIT(wait);

    if (freezing(current) || kthread_should_stop())
        return;

    prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);----------------------------------定義一個wait在kswapd_wait上等待，設置進程狀態爲TASK_INTERRUPTIBLE。 /* Try to sleep for a short interval */
    if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {-------------------------------remaining爲0，檢查kswapd是否準備好睡眠。
        remaining = schedule_timeout(HZ/10);----------------------------------------------------------嘗試短睡100ms，若是返回不爲0，則說明沒有100ms以內被喚醒了。
        finish_wait(&pgdat->kswapd_wait, &wait);
        prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
    }

    /*
     * After a short sleep, check if it was a premature sleep. If not, then
     * go fully to sleep until explicitly woken up.
     */
    if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {-------------------------------若是短睡被喚醒，則沒有必要繼續睡眠。若是短睡美歐被喚醒，則能夠嘗試進入睡眠。
        trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

        /*
         * vmstat counters are not perfectly accurate and the estimated
         * value for counters such as NR_FREE_PAGES can deviate from the
         * true value by nr_online_cpus * threshold. To avoid the zone
         * watermarks being breached while under pressure, we reduce the
         * per-cpu vmstat threshold while kswapd is awake and restore
         * them before going back to sleep.
         */
        set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);

        /*
         * Compaction records what page blocks it recently failed to
         * isolate pages from and skips them in the future scanning.
         * When kswapd is going to sleep, it is reasonable to assume
         * that pages and compaction may succeed so reset the cache.
         */
        reset_isolation_suitable(pgdat);

        if (!kthread_should_stop())
            schedule();------------------------------------------------------------------------------讓出CPU控制權。

        set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
    } else {
        if (remaining)
            count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
        else
            count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
    }
    finish_wait(&pgdat->kswapd_wait, &wait);---------------------------------------------------------設置進程狀態爲TASK_RUNNING。
}

2.3 喚醒kswapd內核線程回收頁面

觸發內存回收的條件是，在內存分配路徑上，低水位狀況下內存分配失敗。那麼會經過調用wakeup_kswapd函數喚醒kswapd內核線程來回收頁面，達到釋放內存的目的。

在NUMA系統中，使用pg_data_t來描述物理內存佈局，和kswapd相關參數有：

typedef struct pglist_data {
...
    wait_queue_head_t kswapd_wait;----------------------------等待隊列
    wait_queue_head_t pfmemalloc_wait;
    struct task_struct *kswapd;    /* Protected by
                       mem_hotplug_begin/end() */
    int kswapd_max_order;-------------------------------------
    enum zone_type classzone_idx;-----------------------------最合適分配內存的zone序號
...
} pg_data_t;

最主要的兩個參數是kswapd_max_order和classzone_idx，這兩個參數會在kswapd喚醒後讀取並使用。

alloc_page()-->__alloc_pages_nodemask()-->__alloc_pages_slowpath()-->wake_all_kswapds()-->wakeup_kswapd()。

void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
{
    pg_data_t *pgdat;

    if (!populated_zone(zone))
        return;

    if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
        return;
    pgdat = zone->zone_pgdat;
    if (pgdat->kswapd_max_order < order) {
        pgdat->kswapd_max_order = order;
        pgdat->classzone_idx = min(pgdat->classzone_idx, classzone_idx);-------------------------準備內存本節點的classzone_idx和kswapd_max_order兩個參數。
    }
    if (!waitqueue_active(&pgdat->kswapd_wait))
        return;
    if (zone_balanced(zone, order, 0, 0))
        return;

    trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
    wake_up_interruptible(&pgdat->kswapd_wait);---------------------------------------------------喚醒kswapd_wait等待隊列上的TASK_INTERRUPTIBLE線程。
}

3. balance_pgdat函數

balance_pgdat()是回收頁面的主函數。這是一個大循環，首先從高端zone往低端zone方向查找第一個處於不平衡狀態end_zone；而後從最低端zone開始回收頁面，直到end_zone；在大循環裏檢查從最低端zone到classzone_idx的zone是否處於平衡狀態，而後不斷加大掃描力度。

static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
                            int *classzone_idx)
{
    int i;
    int end_zone = 0;    /* Inclusive.  0 = ZONE_DMA */
    unsigned long nr_soft_reclaimed;
    unsigned long nr_soft_scanned;
    struct scan_control sc = {
        .gfp_mask = GFP_KERNEL,
        .order = order,
        .priority = DEF_PRIORITY,-------------------------------------------------------------------成員初始掃描優先級，每次掃描的頁面數爲tatal_size>>priority。
        .may_writepage = !laptop_mode,
        .may_unmap = 1,
        .may_swap = 1,
    };
    count_vm_event(PAGEOUTRUN);

    do {
        unsigned long nr_attempted = 0;
        bool raise_priority = true;
        bool pgdat_needs_compaction = (order > 0);

        sc.nr_reclaimed = 0;

        /*
         * Scan in the highmem->dma direction for the highest
         * zone which needs scanning
         */
        for (i = pgdat->nr_zones - 1; i >= 0; i--) {-----------------------------------------------從ZONE_HIGHMEM往ZONE_NORMAL方向查找第一個不平衡狀態的end_zone，即水位處於WMARK_HIGH之下的zone爲止。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            if (sc.priority != DEF_PRIORITY &&
                !zone_reclaimable(zone))
                continue;

            /*
             * Do some background aging of the anon list, to give
             * pages a chance to be referenced before reclaiming.
             */
            age_active_anon(zone, &sc);

            /*
             * If the number of buffer_heads in the machine
             * exceeds the maximum allowed level and this node
             * has a highmem zone, force kswapd to reclaim from
             * it to relieve lowmem pressure.
             */
            if (buffer_heads_over_limit && is_highmem_idx(i)) {
                end_zone = i;
                break;
            }

            if (!zone_balanced(zone, order, 0, 0)) {---------------------------------------------當前zone是否處於平衡狀態，若是不平衡記錄到end_zone中，而後跳出當前for循環。
                end_zone = i;
                break;
            } else {
                /*
                 * If balanced, clear the dirty and congested
                 * flags
                 */
                clear_bit(ZONE_CONGESTED, &zone->flags);
                clear_bit(ZONE_DIRTY, &zone->flags);
            }
        }

        if (i < 0)
            goto out;

        for (i = 0; i <= end_zone; i++) {---------------------------------------------------------從ZONE_NORMAL往endzone方向進行掃描，開始頁面回收。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            /*
             * If any zone is currently balanced then kswapd will
             * not call compaction as it is expected that the
             * necessary pages are already available.
             */
            if (pgdat_needs_compaction &&
                    zone_watermark_ok(zone, order,
                        low_wmark_pages(zone),
                        *classzone_idx, 0))-------------------------------------------------------在order大於0的狀況下，pgdat_needs_compaction初始化爲true；若是當前zone處於WMARK_LOW水位之上，則不須要內存規整。
                pgdat_needs_compaction = false;
        }

        /*
         * If we're getting trouble reclaiming, start doing writepage
         * even in laptop mode.
         */
        if (sc.priority < DEF_PRIORITY - 2)
            sc.may_writepage = 1;

        /*
         * Now scan the zone in the dma->highmem direction, stopping
         * at the last zone which needs scanning.
         *
         * We do this because the page allocator works in the opposite
         * direction.  This prevents the page allocator from allocating
         * pages behind kswapd's direction of progress, which would
         * cause too much scanning of the lower zones.
         */
        for (i = 0; i <= end_zone; i++) {---------------------------------------------------------從ZONE_NORMAL到end_zone方向，開始回收內存。 struct zone *zone = pgdat->node_zones + i;

            if (!populated_zone(zone))
                continue;

            if (sc.priority != DEF_PRIORITY &&
                !zone_reclaimable(zone))
                continue;

            sc.nr_scanned = 0;

            nr_soft_scanned = 0;
            /*
             * Call soft limit reclaim before calling shrink_zone.
             */
            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
                            order, sc.gfp_mask,
                            &nr_soft_scanned);
            sc.nr_reclaimed += nr_soft_reclaimed;

            /*
             * There should be no need to raise the scanning
             * priority if enough pages are already being scanned
             * that that high watermark would be met at 100%
             * efficiency.
             */
            if (kswapd_shrink_zone(zone, end_zone,------------------------------------------------真正掃描和頁回收函數，掃描的參數和結果存放在struct scan_control中。返回true代表回收了所須要的頁面，不須要再提升掃描優先級。 &sc, &nr_attempted))
                raise_priority = false;
        }

        /*
         * If the low watermark is met there is no need for processes
         * to be throttled on pfmemalloc_wait as they should not be
         * able to safely make forward progress. Wake them
         */
        if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
                pfmemalloc_watermark_ok(pgdat))
            wake_up_all(&pgdat->pfmemalloc_wait);

        /*
         * Fragmentation may mean that the system cannot be rebalanced
         * for high-order allocations in all zones. If twice the
         * allocation size has been reclaimed and the zones are still
         * not balanced then recheck the watermarks at order-0 to
         * prevent kswapd reclaiming excessively. Assume that a
         * process requested a high-order can direct reclaim/compact.
         */
        if (order && sc.nr_reclaimed >= 2UL << order)------------------------------------------若是order不爲0，而且sc.nr_reclaimed即已成功回收頁面數量大於等於2^order。
            order = sc.order = 0;--------------------------------------------------------------這裏設置order爲0，爲了不碎片，方式kswapd過於激進地回收頁面。 /* Check if kswapd should be suspending */
        if (try_to_freeze() || kthread_should_stop())------------------------------------------判斷kswapd是否須要中止或者睡眠，若是是則退出。 break;

        /*
         * Compact if necessary and kswapd is reclaiming at least the
         * high watermark number of pages as requsted
         */
        if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)-------------------------判斷是否須要進行內存規整，優化內存碎片。
            compact_pgdat(pgdat, order);------------------------------------------------------參照內存規整章節關於compact_pgdat的解釋。 /*
         * Raise priority if scanning rate is too low or there was no
         * progress in reclaiming pages
         */
        if (raise_priority || !sc.nr_reclaimed)
            sc.priority--;--------------------------------------------------------------------因爲一次掃描的頁面數爲total_size>>priority，因此掃描頁面數量逐漸加大。當kswapd_shrink_zone返回true，即成功回收了頁面，纔會將raise_priority置爲false。
    } while (sc.priority >= 1 &&
         !pgdat_balanced(pgdat, order, *classzone_idx));

out:
    /*
     * Return the order we were reclaiming at so prepare_kswapd_sleep()
     * makes a decision on the order we were last reclaiming at. However,
     * if another caller entered the allocator slow path while kswapd
     * was awake, order will remain at the higher level
     */
    *classzone_idx = end_zone;
    return order;
}

pgdat_balanced用於檢查一個內存節點上的物理頁面是否處於平衡狀態，從最低端zone開始，直到classzone_idx。

其中classzone_idx從wake_all_kswapds()傳下來。

static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
{
    unsigned long managed_pages = 0;
    unsigned long balanced_pages = 0;
    int i;

    /* Check the watermark levels */
    for (i = 0; i <= classzone_idx; i++) {-----------------------------------------------從低到高遍歷zone
        struct zone *zone = pgdat->node_zones + i;

        if (!populated_zone(zone))
            continue;

        managed_pages += zone->managed_pages;

        /*
         * A special case here:
         *
         * balance_pgdat() skips over all_unreclaimable after
         * DEF_PRIORITY. Effectively, it considers them balanced so
         * they must be considered balanced here as well!
         */
        if (!zone_reclaimable(zone)) {
            balanced_pages += zone->managed_pages;
            continue;
        }

        if (zone_balanced(zone, order, 0, i))------------------------------------------若是這個zone的空閒頁面高於WMARK_HIGH水位，那麼這個zone全部管理的頁面能夠看做balances_pages。
            balanced_pages += zone->managed_pages;
        else if (!order)---------------------------------------------------------------在order爲0，即只分配一頁的狀況下，當前zone低於WMARK_HIGH水位。認爲當前內存節點不平衡。 return false;
    }

    if (order)
        return balanced_pages >= (managed_pages >> 2);---------------------------------order大於0，當全部從最低端zone到classzone_idx zone中全部balanced_pages大於managed_pages的25%，認爲此節點處於平衡狀態。 else
        return true;-------------------------------------------------------------------此處說明全部zone都是平衡的，那麼在order爲0狀況下，這個節點是處於平衡的。
}

zone_balances用於判斷zone在分配order個頁面以後的空閒頁面是否處於WMARK_HIGH水位之上。返回true，表示zone處於WMARK_HIGH之上。

static bool zone_balanced(struct zone *zone, int order,
              unsigned long balance_gap, int classzone_idx)
{
    if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
                    balance_gap, classzone_idx, 0))
        return false;

    if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
                order, 0, classzone_idx) == COMPACT_SKIPPED)
        return false;

    return true;
}

3.1 頁面分配路徑和頁面回收路徑

爲何頁面回收的路徑從ZONE_NORMAL到end_zone方向？

由於夥伴系統分配頁面從ZONE_HIGHMEM到ZONE_NORMAL方向，頁面回收剛好和其相反。

這樣有利於減小鎖的爭用，提升效率。頁面分配和頁面回收可能爭用zone->lru_lock鎖。

4. shrink_zone函數

kswapd_shrink_zone在進行shrink_zone操做以前，進行了一些檢查工做，以確保確實須要進行頁面回收。

static bool kswapd_shrink_zone(struct zone *zone,
                   int classzone_idx,
                   struct scan_control *sc,
                   unsigned long *nr_attempted)
{
    int testorder = sc->order;
    unsigned long balance_gap;
    bool lowmem_pressure;

    /* Reclaim above the high watermark. */
    sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));-----------------------計算一輪掃描最多回收頁面數 /*
     * Kswapd reclaims only single pages with compaction enabled. Trying
     * too hard to reclaim until contiguous free pages have become
     * available can hurt performance by evicting too much useful data
     * from memory. Do not reclaim more than needed for compaction.
     */
    if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
            compaction_suitable(zone, sc->order, 0, classzone_idx)
                            != COMPACT_SKIPPED)
        testorder = 0;

    /*
     * We put equal pressure on every zone, unless one zone has way too
     * many pages free already. The "too many pages" is defined as the
     * high wmark plus a "gap" where the gap is either the low
     * watermark or 1% of the zone, whichever is smaller.
     */
    balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
            zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));----------------------------balance_gap增長了水位平衡難度 /*
     * If there is no low memory pressure or the zone is balanced then no
     * reclaim is necessary
     */
    lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
    if (!lowmem_pressure && zone_balanced(zone, testorder,
                        balance_gap, classzone_idx))----------------------------------------須要判斷當前水位是否高於WMARK_HIGH+balance_gap,若是成立，則直接返回true。不須要進行shrink_zone回收頁面。 return true;

    shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

    /* Account for the number of pages attempted to reclaim */
    *nr_attempted += sc->nr_to_reclaim;

    clear_bit(ZONE_WRITEBACK, &zone->flags);

    /*
     * If a zone reaches its high watermark, consider it to be no longer
     * congested. It's possible there are dirty pages backed by congested
     * BDIs but as pressure is relieved, speculatively avoid congestion
     * waits.
     */
    if (zone_reclaimable(zone) &&
        zone_balanced(zone, testorder, 0, classzone_idx)) {
        clear_bit(ZONE_CONGESTED, &zone->flags);
        clear_bit(ZONE_DIRTY, &zone->flags);
    }

    return sc->nr_scanned >= sc->nr_to_reclaim;---------------------------------------------掃描的頁面數量大於等於待回收頁面數量，表示掃描了足夠多的頁面。 }

shrink_zone()用於掃描zone中全部可回收的頁面，參數：

static bool shrink_zone(struct zone *zone, struct scan_control *sc, bool is_classzone)

zone表示即將要掃描的zone，sc表示掃描控制參數，is_classzone表示當前zone是否爲balance_pgdat()剛開始計算的第一個處於非平衡狀態的zone。

static bool shrink_zone(struct zone *zone, struct scan_control *sc,
            bool is_classzone)
{
    struct reclaim_state *reclaim_state = current->reclaim_state;
    unsigned long nr_reclaimed, nr_scanned;
    bool reclaimable = false;

    do {
        struct mem_cgroup *root = sc->target_mem_cgroup;
        struct mem_cgroup_reclaim_cookie reclaim = {
            .zone = zone,
            .priority = sc->priority,
        };
        unsigned long zone_lru_pages = 0;
        struct mem_cgroup *memcg;

        nr_reclaimed = sc->nr_reclaimed;
        nr_scanned = sc->nr_scanned;

        memcg = mem_cgroup_iter(root, NULL, &reclaim);
        do {
            unsigned long lru_pages;
            unsigned long scanned;
            struct lruvec *lruvec;
            int swappiness;

            if (mem_cgroup_low(root, memcg)) {
                if (!sc->may_thrash)
                    continue;
                mem_cgroup_events(memcg, MEMCG_LOW, 1);
            }

            lruvec = mem_cgroup_zone_lruvec(zone, memcg);---------------------------取當前Memory CGroup的LRU鏈表數據結構，
            swappiness = mem_cgroup_swappiness(memcg);------------------------------獲取系統中的vm_swappiness參數，表示swap活躍程度。
            scanned = sc->nr_scanned;

            shrink_lruvec(lruvec, swappiness, sc, &lru_pages);----------------------掃描LRU鏈表的核心函數，並進行頁面回收。
            zone_lru_pages += lru_pages;

            if (memcg && is_classzone)
                shrink_slab(sc->gfp_mask, zone_to_nid(zone),------------------------調用內存管理系統中的shrinker接口，不少子系統會註冊shrinker接口來回收內存。
                        memcg, sc->nr_scanned - scanned,
                        lru_pages);

            /*
             * Direct reclaim and kswapd have to scan all memory
             * cgroups to fulfill the overall scan target for the
             * zone.
             *
             * Limit reclaim, on the other hand, only cares about
             * nr_to_reclaim pages to be reclaimed and it will
             * retry with decreasing priority if one round over the
             * whole hierarchy is not sufficient.
             */
            if (!global_reclaim(sc) &&
                    sc->nr_reclaimed >= sc->nr_to_reclaim) {
                mem_cgroup_iter_break(root, memcg);
                break;
            }
        } while ((memcg = mem_cgroup_iter(root, memcg, &reclaim)));---------------------遍歷Memory CGroup。 /*
         * Shrink the slab caches in the same proportion that
         * the eligible LRU pages were scanned.
         */
        if (global_reclaim(sc) && is_classzone)
            shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
                    sc->nr_scanned - nr_scanned,
                    zone_lru_pages);

        if (reclaim_state) {
            sc->nr_reclaimed += reclaim_state->reclaimed_slab;
            reclaim_state->reclaimed_slab = 0;
        }

        vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
               sc->nr_scanned - nr_scanned,
               sc->nr_reclaimed - nr_reclaimed);

        if (sc->nr_reclaimed - nr_reclaimed)
            reclaimable = true;

    } while (should_continue_reclaim(zone, sc->nr_reclaimed - nr_reclaimed,
                     sc->nr_scanned - nr_scanned, sc));----------------------------------經過此輪迴首頁面數量和掃描數量來判斷，掃描工做是否須要繼續。 return reclaimable;
}

shrink_lruvec決定從哪一個LRU鏈表中回收多少頁面。兩個重要參數是swappiness和sc->priority。

首先經過get_scan_count計算每一個LRU鏈表中有多少應該掃描的頁面數，而後開始循環LRU中每一個類型頁面鏈表。核心是使用shrink_list對LRU進行掃描，找出

static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
              struct scan_control *sc, unsigned long *lru_pages)
{
    unsigned long nr[NR_LRU_LISTS];
    unsigned long targets[NR_LRU_LISTS];
    unsigned long nr_to_scan;
    enum lru_list lru;
    unsigned long nr_reclaimed = 0;
    unsigned long nr_to_reclaim = sc->nr_to_reclaim;
    struct blk_plug plug;
    bool scan_adjusted;

    get_scan_count(lruvec, swappiness, sc, nr, lru_pages);------------------------------根據swappiness、sc->priority計算LRU4個鏈表中應該掃描的頁面數，結果放在nr[]中。 /* Record the original scan target for proportional adjustments later */
    memcpy(targets, nr, sizeof(nr));

    /*
     * Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
     * event that can occur when there is little memory pressure e.g.
     * multiple streaming readers/writers. Hence, we do not abort scanning
     * when the requested number of pages are reclaimed when scanning at
     * DEF_PRIORITY on the assumption that the fact we are direct
     * reclaiming implies that kswapd is not keeping up and it is best to
     * do a batch of work at once. For memcg reclaim one check is made to
     * abort proportional reclaim if either the file or anon lru has already
     * dropped to zero at the first pass.
     */
    scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
             sc->priority == DEF_PRIORITY);

    blk_start_plug(&plug);
    while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
                    nr[LRU_INACTIVE_FILE]) {---------------------------------------------跳過LRU_ACTIVE_ANON類型，由於活躍的匿名頁面不能直接被回收，匿名頁面須要通過老化且加入到不活躍匿名頁面LRU鏈表才能被回收。
        unsigned long nr_anon, nr_file, percentage;
        unsigned long nr_scanned;

        for_each_evictable_lru(lru) {----------------------------------------------------依次遍歷四種LRU鏈表，shrink_list函數會具體處理各類LRU鏈表狀況。 if (nr[lru]) {
                nr_to_scan = min(nr[lru], SWAP_CLUSTER_MAX);
                nr[lru] -= nr_to_scan;

                nr_reclaimed += shrink_list(lru, nr_to_scan,
                                lruvec, sc);
            }
        }

        if (nr_reclaimed < nr_to_reclaim || scan_adjusted)-------------------------------已回收的頁面數目小於待回收的頁面數目，繼續掃描下一個LRU鏈表。 continue;

        /*
         * For kswapd and memcg, reclaim at least the number of pages
         * requested. Ensure that the anon and file LRUs are scanned
         * proportionally what was requested by get_scan_count(). We
         * stop reclaiming one LRU and reduce the amount scanning
         * proportional to the original scan target.
         */
        nr_file = nr[LRU_INACTIVE_FILE] + nr[LRU_ACTIVE_FILE];
        nr_anon = nr[LRU_INACTIVE_ANON] + nr[LRU_ACTIVE_ANON];

        /*
         * It's just vindictive to attack the larger once the smaller
         * has gone to zero.  And given the way we stop scanning the
         * smaller below, this makes sure that we only make one nudge
         * towards proportionality once we've got nr_to_reclaim.
         */
        if (!nr_file || !nr_anon)--------------------------------------------------------匿名或者文件頁面已經被掃描完畢，退出循環。 break;
...
    }
...
}

shrink_slab是對slab緩存進行進行收縮的函數，它會遍歷shrinker_list列表。

內核不少子系統註冊shrinker接口，shrinker->count_objects返回當前slabcache中有多少空閒緩存；shrinker->scan_objects會掃描空閒緩存並釋放。

static unsigned long shrink_slab(gfp_t gfp_mask, int nid,
                 struct mem_cgroup *memcg,
                 unsigned long nr_scanned,
                 unsigned long nr_eligible)
{
    struct shrinker *shrinker;
    unsigned long freed = 0;

    if (memcg && !memcg_kmem_is_active(memcg))
        return 0;
...
    list_for_each_entry(shrinker, &shrinker_list, list) {------------------遍歷shrinker_list列表，提取shrinker。 struct shrink_control sc = {
            .gfp_mask = gfp_mask,
            .nid = nid,
            .memcg = memcg,
        };

        if (memcg && !(shrinker->flags & SHRINKER_MEMCG_AWARE))
            continue;

        if (!(shrinker->flags & SHRINKER_NUMA_AWARE))
            sc.nid = 0;

        freed += do_shrink_slab(&sc, shrinker, nr_scanned, nr_eligible);---以shrink_control和shrinker爲參數進行slab緩存收縮操做。
    }

    up_read(&shrinker_rwsem);
out:
    cond_resched();
    return freed;
}

下面針對不一樣類型頁面進行不一樣的處理，同時須要考慮是否打開SWAP分區的狀況。在不活躍鏈表較少(即low)狀況下，進行活躍鏈表的收縮。

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
                 struct lruvec *lruvec, struct scan_control *sc)
{
    if (is_active_lru(lru)) {-----------------------------------------lru鏈表爲LRU_ACTIVE_ANON、LRU_ACTIVE_FILE兩種狀況 if (inactive_list_is_low(lruvec, lru))------------------------若是不活躍文件或者匿名頁面爲低，則收縮活躍列表。
            shrink_active_list(nr_to_scan, lruvec, sc, lru);
        return 0;
    }

    return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);---------收縮LRU的LRU_INACTIVE_ANON和LRU_INACTIVE_FILE兩種頁面
}

下面兩章節針對活躍和不活躍鏈表進行收縮處理。

5. shrink_active_list函數

shrink_active_list看看有哪些活躍頁面能夠遷移到不活躍頁面鏈表中。

static void shrink_active_list(unsigned long nr_to_scan,
                   struct lruvec *lruvec,
                   struct scan_control *sc,
                   enum lru_list lru)
{
    unsigned long nr_taken;
    unsigned long nr_scanned;
    unsigned long vm_flags;
    LIST_HEAD(l_hold);    /* The pages which were snipped off */
    LIST_HEAD(l_active);
    LIST_HEAD(l_inactive);----------------------------------------------定義3個臨時鏈表l_hold、l_active、l_inactive。
    struct page *page;
    struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
    unsigned long nr_rotated = 0;
    isolate_mode_t isolate_mode = 0;
    int file = is_file_lru(lru);
    struct zone *zone = lruvec_zone(lruvec);

    lru_add_drain();

    if (!sc->may_unmap)
        isolate_mode |= ISOLATE_UNMAPPED;
    if (!sc->may_writepage)
        isolate_mode |= ISOLATE_CLEAN;---------------------------------設置isolate_mode

    spin_lock_irq(&zone->lru_lock);

    nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
                     &nr_scanned, sc, isolate_mode, lru);--------------經過isolate_mode限定將哪些頁面從LRU鏈表移動到l_hold中。 if (global_reclaim(sc))
        __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);

    reclaim_stat->recent_scanned[file] += nr_taken;

    __count_zone_vm_events(PGREFILL, zone, nr_scanned);
    __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
    spin_unlock_irq(&zone->lru_lock);

    while (!list_empty(&l_hold)) {------------------------------------掃描臨時l_hold鏈表中的頁面，有些頁面會添加到l_active中，有些會添加到l_inactive中，剩下部分能夠直接釋放。
        cond_resched();
        page = lru_to_page(&l_hold);
        list_del(&page->lru);-----------------------------------------將page從當前LRU鏈表l_hold中移除 if (unlikely(!page_evictable(page))) {------------------------若是頁面不可回收，則放回不可回收LRU鏈表中。繼續下一個頁面處理。
            putback_lru_page(page);
            continue;
        }

        if (unlikely(buffer_heads_over_limit)) {
            if (page_has_private(page) && trylock_page(page)) {
                if (page_has_private(page))
                    try_to_release_page(page, 0);
                unlock_page(page);
            }
        }

        if (page_referenced(page, 0, sc->target_mem_cgroup,
                    &vm_flags)) {
            nr_rotated += hpage_nr_pages(page);
            /*
             * Identify referenced, file-backed active pages and
             * give them one more trip around the active list. So
             * that executable code get better chances to stay in
             * memory under moderate memory pressure.  Anon pages
             * are not likely to be evicted by use-once streaming
             * IO, plus JVM can create lots of anon VM_EXEC pages,
             * so we ignore them here.
             */
            if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {---可執行page cache頁面保留在活躍鏈表中。
                list_add(&page->lru, &l_active);----------------------將page移入l_active中。 continue;
            }
        }

        ClearPageActive(page);    /* we are de-activating */
        list_add(&page->lru, &l_inactive);---------------------------將page移入l_inactive中。
    }

    /*
     * Move pages back to the lru list.
     */
    spin_lock_irq(&zone->lru_lock);
    /*
     * Count referenced pages from currently used mappings as rotated,
     * even though only some of them are actually re-activated.  This
     * helps balance scan pressure between file and anonymous pages in
     * get_scan_count.
     */
    reclaim_stat->recent_rotated[file] += nr_rotated;

    move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
    move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);----將l_active和l_inactive移入對應的LRU鏈表中。
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
    spin_unlock_irq(&zone->lru_lock);

    mem_cgroup_uncharge_list(&l_hold);
    free_hot_cold_page_list(&l_hold, true);---------------------------------l_hold中的鏈表是提出l_active和l_inactive剩下來部分，而後進行釋放。
}

isolate_lru_pages用於分離LRU鏈表中頁面的函數。

nr_to_scan表示在這個鏈表中掃描頁面的個數，lruvec是LRU鏈表集合，dst是臨時存放的頁面鏈表，nv_scanned是已經掃描的頁面個數。

static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
        struct lruvec *lruvec, struct list_head *dst,
        unsigned long *nr_scanned, struct scan_control *sc,
        isolate_mode_t mode, enum lru_list lru)
{
    struct list_head *src = &lruvec->lists[lru];
    unsigned long nr_taken = 0;
    unsigned long scan;

    for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
        struct page *page;
        int nr_pages;

        page = lru_to_page(src);
        prefetchw_prev_lru_page(page, src, flags);

        VM_BUG_ON_PAGE(!PageLRU(page), page);

        switch (__isolate_lru_page(page, mode)) {----------------調用此函數來分離單個頁面，0表示分離成功，並把頁面遷移到dst臨時鏈表中。 case 0:
            nr_pages = hpage_nr_pages(page);
            mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
            list_move(&page->lru, dst);
            nr_taken += nr_pages;
            break;

        case -EBUSY:
            /* else it is being freed elsewhere */
            list_move(&page->lru, src);
            continue;

        default:
            BUG();
        }
    }

    *nr_scanned = scan;
    trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
                    nr_taken, mode, is_file_lru(lru));
    return nr_taken;
}

6. shrink_inactive_list函數

shrink_inactive_list函數掃描不活躍頁面鏈表而且回收頁面。

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
             struct scan_control *sc, enum lru_list lru)
{
    LIST_HEAD(page_list);
    unsigned long nr_scanned;
    unsigned long nr_reclaimed = 0;
    unsigned long nr_taken;
    unsigned long nr_dirty = 0;
    unsigned long nr_congested = 0;
    unsigned long nr_unqueued_dirty = 0;
    unsigned long nr_writeback = 0;
    unsigned long nr_immediate = 0;
    isolate_mode_t isolate_mode = 0;
    int file = is_file_lru(lru);
    struct zone *zone = lruvec_zone(lruvec);
    struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

    while (unlikely(too_many_isolated(zone, file, sc))) {
        congestion_wait(BLK_RW_ASYNC, HZ/10);

        /* We are about to die and free our memory. Return now. */
        if (fatal_signal_pending(current))
            return SWAP_CLUSTER_MAX;
    }

    lru_add_drain();

    if (!sc->may_unmap)
        isolate_mode |= ISOLATE_UNMAPPED;
    if (!sc->may_writepage)
        isolate_mode |= ISOLATE_CLEAN;

    spin_lock_irq(&zone->lru_lock);

    nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
                     &nr_scanned, sc, isolate_mode, lru);--------------按照isolate_mode把不活躍頁面分離到臨時鏈表page_list中。

    __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);

    if (global_reclaim(sc)) {
        __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
        if (current_is_kswapd())
            __count_zone_vm_events(PGSCAN_KSWAPD, zone, nr_scanned);
        else
            __count_zone_vm_events(PGSCAN_DIRECT, zone, nr_scanned);
    }
    spin_unlock_irq(&zone->lru_lock);

    if (nr_taken == 0)
        return 0;

    nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
                &nr_dirty, &nr_unqueued_dirty, &nr_congested,
                &nr_writeback, &nr_immediate,
                false);------------------------------------------------掃描page_list鏈表頁面並返回已回收的頁面數量。

    spin_lock_irq(&zone->lru_lock);

    reclaim_stat->recent_scanned[file] += nr_taken;

    if (global_reclaim(sc)) {
        if (current_is_kswapd())
            __count_zone_vm_events(PGSTEAL_KSWAPD, zone,
                           nr_reclaimed);
        else
            __count_zone_vm_events(PGSTEAL_DIRECT, zone,
                           nr_reclaimed);
    }

    putback_inactive_pages(lruvec, &page_list);-----------------------將知足條件的頁面放回lru鏈表中

    __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);

    spin_unlock_irq(&zone->lru_lock);

    mem_cgroup_uncharge_list(&page_list);
    free_hot_cold_page_list(&page_list, true);------------------------page_list剩下部分的頁面將被釋放。 /*
     * If reclaim is isolating dirty pages under writeback, it implies
     * that the long-lived page allocation rate is exceeding the page
     * laundering rate. Either the global limits are not being effective
     * at throttling processes due to the page distribution throughout
     * zones or there is heavy usage of a slow backing device. The
     * only option is to throttle from reclaim context which is not ideal
     * as there is no guarantee the dirtying process is throttled in the
     * same way balance_dirty_pages() manages.
     *
     * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
     * of pages under pages flagged for immediate reclaim and stall if any
     * are encountered in the nr_immediate check below.
     */
    if (nr_writeback && nr_writeback == nr_taken)
        set_bit(ZONE_WRITEBACK, &zone->flags);

    /*
     * memcg will stall in page writeback so only consider forcibly
     * stalling for global reclaim
     */
    if (global_reclaim(sc)) {
        /*
         * Tag a zone as congested if all the dirty pages scanned were
         * backed by a congested BDI and wait_iff_congested will stall.
         */
        if (nr_dirty && nr_dirty == nr_congested)
            set_bit(ZONE_CONGESTED, &zone->flags);

        /*
         * If dirty pages are scanned that are not queued for IO, it
         * implies that flushers are not keeping up. In this case, flag
         * the zone ZONE_DIRTY and kswapd will start writing pages from
         * reclaim context.
         */
        if (nr_unqueued_dirty == nr_taken)
            set_bit(ZONE_DIRTY, &zone->flags);

        /*
         * If kswapd scans pages marked marked for immediate
         * reclaim and under writeback (nr_immediate), it implies
         * that pages are cycling through the LRU faster than
         * they are written so also forcibly stall.
         */
        if (nr_immediate && current_may_throttle())
            congestion_wait(BLK_RW_ASYNC, HZ/10);
    }

    /*
     * Stall direct reclaim for IO completions if underlying BDIs or zone
     * is congested. Allow kswapd to continue until it starts encountering
     * unqueued dirty pages or cycling through the LRU too quickly.
     */
    if (!sc->hibernation_mode && !current_is_kswapd() &&
        current_may_throttle())
        wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

    trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
        zone_idx(zone),
        nr_scanned, nr_reclaimed,
        sc->priority,
        trace_shrink_flags(file));
    return nr_reclaimed;
}