高端內存映射之vmalloc分配內存中不連續的頁--Linux內存管理(十九)

時間 2019-11-08

標籤高端內存映射 vmalloc 分配不連續 linux 管理十九欄目 C&C++ 简体版

原文原文鏈接

1 內存中不連續的頁的分配

根據上文的講述, 咱們知道物理上連續的映射對內核是最好的, 但並不總能成功地使用. 在分配一大塊內存時, 可能不遺餘力也沒法找到連續的內存塊.前端

在用戶空間中這不是問題，由於普通進程設計爲使用處理器的分頁機制, 固然這會下降速度並佔用TLB.node

在內核中也可使用一樣的技術. 內核分配了其內核虛擬地址空間的一部分, 用於創建連續映射.linux

在IA-32系統中, 前16M劃分給DMA區域, 後面一直到第896M做爲NORMAL直接映射區, 緊隨直接映射的前896MB物理內存，在插入的8MB安全隙以後, 是一個用於管理不連續內存的區域. 這一段具備線性地址空間的全部性質. 分配到其中的頁可能位於物理內存中的任何地方. 經過修改負責該區域的內核頁表, 便可作到這一點.c#

Persistent mappings和Fixmaps地址空間都比較小, 這裏咱們忽略它們, 這樣只剩下直接地址映射和VMALLOC區, 這個劃分應該是平衡兩個需求的結果數組

儘可能增長DMA和Normal區大小，也就是直接映射地址空間大小，當前主流平臺的內存，基本上都超過了512MB，不少都是標配1GB內存，所以註定有一部份內存沒法進行線性映射。緩存
保留必定數量的VMALLOC大小，這個值是應用平臺特定的，若是應用平臺某個驅動須要用vmalloc分配很大的地址空間，那麼最好經過在kernel參數中指定vmalloc大小的方法，預留較多的vmalloc地址空間。
並非Highmem沒有或者越少越好，這個是個人我的理解，理由以下：高端內存就像個垃圾桶和緩衝區，防止來自用戶空間或者vmalloc的映射破壞Normal zone和DMA zone的連續性，使得它們碎片化。當這個垃圾桶較大時，那麼污染Normal 和DMA的機會天然就小了。安全

經過這種方式, 將內核的內核虛擬地址空間劃分爲幾個不一樣的區域數據結構

下面的圖是VMALLOC地址空間內部劃分狀況app

2 用vmalloc分配內存

vmalloc是一個接口函數, 內核代碼使用它來分配在虛擬內存中連續但在物理內存中不必定連續的內存electron

//  http://lxr.free-electrons.com/source/include/linux/vmalloc.h?v=4.7#L70
void *vmalloc(unsigned long size);

該函數只須要一個參數, 用於指定所需內存區的長度, 與此前討論的函數不一樣, 其長度單位不是頁而是字節, 這在用戶空間程序設計中是很廣泛的.

使用vmalloc的最著名的實例是內核對模塊的實現. 由於模塊可能在任什麼時候候加載, 若是模塊數據比較多, 那麼沒法保證有足夠的連續內存可用, 特別是在系統已經運行了比較長時間的狀況下.

若是可以用小塊內存拼接出足夠的內存, 那麼使用vmalloc能夠規避該問題

內核中還有大約400處地方調用了vmalloc, 特別是在設備和聲音驅動程序中.

由於用於vmalloc的內存頁老是必須映射在內核地址空間中, 所以使用ZONE_HIGHMEM內存域的頁要優於其餘內存域. 這使得內核能夠節省更寶貴的較低端內存域, 而又不會帶來額外的壞處. 所以, vmalloc等映射函數是內核出於自身的目的(並不是由於用戶空間應用程序)使用高端內存頁的少數情形之一.

全部有關vmalloc的數據結構和API結構聲明在include/linux/vmalloc.h

聲明頭文件	NON-MMU實現	MMU實現
include/linux/vmalloc.h	mm/nommu.c	mm/vmalloc.c

2.1 數據結構

內核在管理虛擬內存中的vmalloc區域時, 內核必須跟蹤哪些子區域被使用、哪些是空閒的. 爲此定義了一個數據結構vm_struct, 將全部使用的部分保存在一個鏈表中. 該結構提的定義在include/linux/vmalloc.h?v=4.7, line 32

// http://lxr.free-electrons.com/source/include/linux/vmalloc.h?v=4.7#L32
struct vm_struct {
    struct vm_struct    *next;
    void            *addr;
    unsigned long       size;
    unsigned long       flags;
    struct page         **pages;
    unsigned int        nr_pages;
    phys_addr_t         phys_addr;
    const void          *caller;
};

注意, 內核使用了一個重要的數據結構稱之爲vm_area_struct, 以管理用戶空間進程的虛擬地址空間內容. 儘管名稱和目的都是相似的, 雖然兩者都是作虛擬地址空間映射的, 但不能混淆這兩個結構。

前者是內核虛擬地址空間映射，然後者則是應用進程虛擬地址空間映射。

前者不會產生page fault，然後者通常不會提早分配頁面，只有當訪問的時候，產生page fault來分配頁面。

對於每一個用vmalloc分配的子區域, 都對應於內核內存中的一個該結構實例. 該結構各個成員的語義以下

字段	描述
next	使得內核能夠將vmalloc區域中的全部子區域保存在一個單鏈表上
addr	定義了分配的子區域在虛擬地址空間中的起始地址。size表示該子區域的長度. 能夠根據該信息來勾畫出vmalloc區域的完整分配方案
flags	存儲了與該內存區關聯的標誌集合, 這幾乎是不可避免的. 它只用於指定內存區類型
pages	是一個指針，指向page指針的數組。每一個數組成員都表示一個映射到虛擬地址空間中的物理內存頁的page實例
nr_pages	指定pages中數組項的數目，即涉及的內存頁數目
phys_addr	僅當用ioremap映射了由物理地址描述的物理內存區域時才須要。該信息保存在phys_addr中
caller

其中flags只用於指定內存區類型, 全部可能的flag標識以宏的形式定義在include/linux/vmalloc.h?v=4.7, line 14

//  http://lxr.free-electrons.com/source/include/linux/vmalloc.h?v=4.7#L14
/* bits in flags of vmalloc's vm_struct below */
#define VM_IOREMAP              0x00000001      /* ioremap() and friends */
#define VM_ALLOC                0x00000002      /* vmalloc() */
#define VM_MAP                  0x00000004      /* vmap()ed pages */
#define VM_USERMAP              0x00000008      /* suitable for remap_vmalloc_range */
#define VM_UNINITIALIZED        0x00000020      /* vm_struct is not fully initialized */
#define VM_NO_GUARD             0x00000040      /* don't add guard page */
#define VM_KASAN                0x00000080      /* has allocated kasan shadow memory */
/* bits [20..32] reserved for arch specific ioremap internals */

flag標識	描述
VM_IOREMAP	表示將幾乎隨機的物理內存區域映射到vmalloc區域中. 這是一個特定於體系結構的操做
VM_ALLOC	指定由vmalloc產生的子區域

VM_MAP 用於表示將現存pages集合映射到連續的虛擬地址空間中
VM_USERMAP |
VM_UNINITIALIZED|
VM_NO_GUARD |
VM_KASAN|

下圖給出了該結構使用方式的一個實例. 其中依次映射了3個(假想的)物理內存頁, 在物理內存中的位置分別是1 02三、725和7 311. 在虛擬的vmalloc區域中, 內核將其看做起始於VMALLOC_START + 100的一個連續內存區, 大小爲3*PAGE_SIZE的內核地址空間，被映射到物理頁面725, 1023和7311

2.2 建立vm_area

由於大部分體系結構都支持mmu, 這裏咱們只考慮有mmu的狀況. 實際上沒有mmu支持時, vmalloc就沒法實現非連續物理地址到連續內核地址空間的映射, vmalloc退化爲kmalloc實現.

2.2.1 vmlist全局鏈表

在建立一個新的虛擬內存區以前, 必須找到一個適當的位置. vm_area實例組成的一個鏈表, 管理着vmalloc區域中已經創建的各個子區域. 定義在mm/vmalloc的全局變量vmlist是表頭. 定義在mm/vmalloc.c?v=4.7, line 1170

// http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1170
static struct vm_struct *vmlist __initdata;

2.2.2 分配函數

內核在mm/vmalloc中提供了輔助函數get_vm_area和__get_vm_area, 它們負責參數準備工做, 而實際的分配工做交給底層函數__get_vm_area_node來完成, 這些函數定義在mm/vmalloc.c?v=4.7, line 1388

struct vm_struct *__get_vm_area(unsigned long size, unsigned long flags,
                unsigned long start, unsigned long end)
{
    return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
                  GFP_KERNEL, __builtin_return_address(0));
}
EXPORT_SYMBOL_GPL(__get_vm_area);

struct vm_struct *__get_vm_area_caller(unsigned long size, unsigned long flags,
                       unsigned long start, unsigned long end,
                       const void *caller)
{
    return __get_vm_area_node(size, 1, flags, start, end, NUMA_NO_NODE,
                  GFP_KERNEL, caller);
}

/**
 *      get_vm_area  -  reserve a contiguous kernel virtual area
 *      @size:      size of the area
 *      @flags:     %VM_IOREMAP for I/O mappings or VM_ALLOC
 *
 *      Search an area of @size in the kernel virtual mapping area,
 *      and reserved it for out purposes.  Returns the area descriptor
 *      on success or %NULL on failure.
 */
struct vm_struct *get_vm_area(unsigned long size, unsigned long flags)
{
    return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
                  NUMA_NO_NODE, GFP_KERNEL,
                  __builtin_return_address(0));
}

struct vm_struct *get_vm_area_caller(unsigned long size, unsigned long flags,
                const void *caller)
{
    return __get_vm_area_node(size, 1, flags, VMALLOC_START, VMALLOC_END,
                  NUMA_NO_NODE, GFP_KERNEL, caller);
}

這些函數是負責實際工做的__get_vm_area_node函數的前端. 根據子區域的長度信息, __get_vm_area_node函數試圖在虛擬的vmalloc空間中找到一個適當的位置. 該函數定義在mm/vmalloc.c?v=4.7, line 1354

因爲各個vmalloc子區域之間須要插入1頁(警惕頁)做爲安全隙, 內核首先適當提升須要分配的內存長度.

static struct vm_struct *__get_vm_area_node(unsigned long size,
        unsigned long align, unsigned long flags, unsigned long start,
        unsigned long end, int node, gfp_t gfp_mask, const void *caller)
{
    struct vmap_area *va;
    struct vm_struct *area;

    BUG_ON(in_interrupt());
    if (flags & VM_IOREMAP)
        align = 1ul << clamp_t(int, fls_long(size),
                       PAGE_SHIFT, IOREMAP_MAX_ORDER);

    size = PAGE_ALIGN(size);
    if (unlikely(!size))
        return NULL;

    area = kzalloc_node(sizeof(*area), gfp_mask & GFP_RECLAIM_MASK, node);
    if (unlikely(!area))
        return NULL;

    if (!(flags & VM_NO_GUARD))
        size += PAGE_SIZE;

    va = alloc_vmap_area(size, align, start, end, node, gfp_mask);
    if (IS_ERR(va)) {
        kfree(area);
        return NULL;
    }

    setup_vmalloc_vm(area, va, flags, caller);

    return area;
}

start和end參數分別由調用者設置, 好比get_vm_area函數和get_vm_area_caller函數傳入VMALLOC_START和VMALLOC_END. 接下來循環遍歷vmlist的全部表元素，直至找到一個適當的項

2.2.3 釋放函數

remove_vm_area函數將一個現存的子區域從vmalloc地址空間刪除.

函數聲明以下, include/linux/vmalloc.h?v=4.7, line 121

//  http://lxr.free-electrons.com/source/include/linux/vmalloc.h?v=4.7#L121
struct vm_struct *remove_vm_area(void *addr);

函數定義在mm/vmalloc.c?v=4.7, line 1454

//  http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1446
/**
 *      remove_vm_area  -  find and remove a continuous kernel virtual area
 *      @addr:      base address
 *
 *      Search for the kernel VM area starting at @addr, and remove it.
 *      This function returns the found VM area, but using it is NOT safe
 *      on SMP machines, except for its size or flags.
 */
struct vm_struct *remove_vm_area(const void *addr)
{
    struct vmap_area *va;

    va = find_vmap_area((unsigned long)addr);
    if (va && va->flags & VM_VM_AREA) {
        struct vm_struct *vm = va->vm;

        spin_lock(&vmap_area_lock);
        va->vm = NULL;
        va->flags &= ~VM_VM_AREA;
        spin_unlock(&vmap_area_lock);

        vmap_debug_free_range(va->va_start, va->va_end);
        kasan_free_shadow(vm);
        free_unmap_vmap_area(va);

        return vm;
    }
    return NULL;
}

2.3 vmalloc分配內存區

vmalloc發起對不連續的內存區的分配操做. 該函數只是一個前端, 爲__vmalloc提供適當的參數, 後者直接調用__vmalloc_node.

vmalloc只是__vmalloc_node_flags的前端接口, 複雜向__vmalloc_node_flags傳遞數據, 而__vmalloc_node_flags又是__vmalloc_node的前端接口, 然後者又將實際的工做交給__vmalloc_node_range函數來完成

vmalloc函數定義在mm/vmalloc.c?v=4.7, line 1754, 將實際的工做交給__vmalloc_node_flags函數來完成.

//  http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1754
/**
 *      vmalloc  -  allocate virtually contiguous memory
 *      @size:      allocation size
 *      Allocate enough pages to cover @size from the page level
 *      allocator and map them into contiguous kernel virtual space.
 *
 *      For tight control over page level allocator and protection flags
 *      use __vmalloc() instead.
 */
void *vmalloc(unsigned long size)
{
    return __vmalloc_node_flags(size, NUMA_NO_NODE,
                    GFP_KERNEL | __GFP_HIGHMEM);
}
EXPORT_SYMBOL(vmalloc);

__vmalloc_node_flags函數定義在mm/vmalloc.c?v=4.7, line 1747, 經過__vmalloc_node來完成實際的工做.

//  http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1747
static inline void *__vmalloc_node_flags(unsigned long size,
                    int node, gfp_t flags)
{
    return __vmalloc_node(size, 1, flags, PAGE_KERNEL,
                    node, __builtin_return_address(0));
}

__vmalloc_node函數定義在mm/vmalloc.c?v=4.7, line 1719, 經過__vmalloc_node_range來完成實際的工做.

//  http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1719
/**
 *      __vmalloc_node  -  allocate virtually contiguous memory
 *      @size:      allocation size
 *      @align:     desired alignment
 *      @gfp_mask:      flags for the page level allocator
 *      @prot:      protection mask for the allocated pages
 *      @node:      node to use for allocation or NUMA_NO_NODE
 *      @caller:    caller's return address
 *
 *      Allocate enough pages to cover @size from the page level
 *      allocator with @gfp_mask flags.  Map them into contiguous
 *      kernel virtual space, using a pagetable protection of @prot.
 */
static void *__vmalloc_node(unsigned long size, unsigned long align,
                gfp_t gfp_mask, pgprot_t prot,
                int node, const void *caller)
{
    return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
                gfp_mask, prot, 0, node, caller);
}

__vmalloc_node_range最終完成了內存區的分配工做

//  http://lxr.free-electrons.com/source/mm/vmalloc.c?v=4.7#L1658
/**
 *      __vmalloc_node_range  -  allocate virtually contiguous memory
 *      @size:      allocation size
 *      @align:     desired alignment
 *      @start:     vm area range start
 *      @end:       vm area range end
 *      @gfp_mask:      flags for the page level allocator
 *      @prot:      protection mask for the allocated pages
 *      @vm_flags:      additional vm area flags (e.g. %VM_NO_GUARD)
 *      @node:      node to use for allocation or NUMA_NO_NODE
 *      @caller:    caller's return address
 *
 *      Allocate enough pages to cover @size from the page level
 *      allocator with @gfp_mask flags.  Map them into contiguous
 *      kernel virtual space, using a pagetable protection of @prot.
 */
void *__vmalloc_node_range(unsigned long size, unsigned long align,
            unsigned long start, unsigned long end, gfp_t gfp_mask,
            pgprot_t prot, unsigned long vm_flags, int node,
            const void *caller)
{
    struct vm_struct *area;
    void *addr;
    unsigned long real_size = size;

    size = PAGE_ALIGN(size);
    if (!size || (size >> PAGE_SHIFT) > totalram_pages)
        goto fail;

    area = __get_vm_area_node(size, align, VM_ALLOC | VM_UNINITIALIZED |
                vm_flags, start, end, node, gfp_mask, caller);
    if (!area)
        goto fail;

    addr = __vmalloc_area_node(area, gfp_mask, prot, node);
    if (!addr)
        return NULL;

    /*
     * In this function, newly allocated vm_struct has VM_UNINITIALIZED
     * flag. It means that vm_struct is not fully initialized.
     * Now, it is fully initialized, so remove this flag here.
     */
    clear_vm_uninitialized_flag(area);

    /*
     * A ref_count = 2 is needed because vm_struct allocated in
     * __get_vm_area_node() contains a reference to the virtual address of
     * the vmalloc'ed block.
     */
    kmemleak_alloc(addr, real_size, 2, gfp_mask);

    return addr;

fail:
    warn_alloc_failed(gfp_mask, 0,
              "vmalloc: allocation failure: %lu bytes\n",
              real_size);
    return NULL;
}

實現分爲3部分

首先, get_vm_area在vmalloc地址空間中找到一個適當的區域.
接下來從物理內存分配各個頁
最後將這些頁連續地映射到vmalloc區域中, 分配虛擬內存的工做就完成了.

若是顯式指定了分配頁幀的結點, 則內核調用alloc_pages_node, 不然，使用alloc_page從當前結點分配頁幀.

分配的頁從相關結點的夥伴系統移除. 在調用時, vmalloc將gfp_mask設置爲GFP_KERNEL | __GFP_HIGHMEM，內核經過該參數指示內存管理子系統儘量從ZONE_HIGHMEM內存域分配頁幀. 理由已經在上文給出：低端內存域的頁幀更爲寶貴，所以不該該浪費到vmalloc的分配中，在此使用高

3 備選映射方法

除了vmalloc以外，還有其餘方法能夠建立虛擬連續映射。這些都基於上文討論的__vmalloc函數或使用很是相似的機制

vmalloc_32的工做方式與vmalloc相同，但會確保所使用的物理內存老是能夠用普通32位指針尋址。若是某種體系結構的尋址能力超出基於字長計算的範圍, 那麼這種保證就很重要。例如，在啓用了PAE的IA-32系統上，就是如此.
vmap使用一個page數組做爲起點，來建立虛擬連續內存區。與vmalloc相比，該函數所用的物理內存位置不是隱式分配的，而須要先行分配好，做爲參數傳遞。此類映射可經過vm_map實例中的VM_MAP標誌辨別。
不一樣於上述的全部映射方法, ioremap是一個特定於處理器的函數, 必須在全部體系結構上實現. 它能夠將取自物理地址空間、由系統總線用於I/O操做的一個內存塊，映射到內核的地址空間中.

該函數在設備驅動程序中使用不少, 可將用於與外設通訊的地址區域暴露給內核的其餘部分使用(固然也包括其自己).

4 釋放內存

有兩個函數用於向內核釋放內存, vfree用於釋放vmalloc和vmalloc_32分配的區域，而vunmap用於釋放由vmap或ioremap建立的映射。這兩個函數都會歸結到__vunmap

void __vunmap(void *addr, int deallocate_pages)

addr表示要釋放的區域的起始地址, deallocate_pages指定了是否將與該區域相關的物理內存頁返回給夥伴系統. vfree將後一個參數設置爲1, 而vunmap設置爲0, 由於在這種狀況下只刪除映射, 而不將相關的物理內存頁返回給夥伴系統. 圖3-40給出了__vunmap的代碼流程圖

沒必要明確給出須要釋放的區域長度, 長度能夠從vmlist中的信息導出. 所以__vunmap的第一個任務是在__remove_vm_area(由remove_vm_area在完成鎖定以後調用）中掃描該鏈表, 以找到相關項。

unmap_vm_area使用找到的vm_area實例，從頁表刪除再也不須要的項。與分配內存時相似，該函數須要操做各級頁表，但這一次須要刪除涉及的項。它還會更新CPU高速緩存。

若是__vunmap的參數deallocate_pages設置爲1（在vfree中），內核會遍歷area->pages的所有元素，即指向所涉及的物理內存頁的page實例的指針。而後對每一項調用__free_page，將頁釋放到夥伴系統。

最後，必須釋放用於管理該內存區的內核數據結構。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。