Linux內存管理之slab分配器

時間 2019-11-26

原文原文鏈接

slab分配器是什麼？
參考：http://blog.csdn.net/vanbreaker/article/details/7664296
slab分配器是Linux內存管理中很是重要和複雜的一部分，其工做是針對一些常常分配並釋放的對象，如進程描述符等，這些對象的大小通常比較小，若是直接採用夥伴系統來進行分配和釋放，不只會形成大量的內碎片，並且處理速度也太慢。而slab分配器是基於對象進行管理的，相同類型的對象歸爲一類(如進程描述符就是一類)，每當要申請這樣一個對象，slab分配器就從一個slab列表中分配一個這樣大小的單元出去，而當要釋放時，將其從新保存在該列表中，而不是直接返回給夥伴系統。slab分配對象時，會使用最近釋放的對象內存塊，所以其駐留在CPU高速緩存的機率較高。

slab分配器主要解決了什麼問題？
參考：《Linux Kernel Development, 3rd Edition》
Allocating and freeing data structures is one of the most common operations inside any kernel.To facilitate frequent allocations and eallocations of data, programmers often introduce free lists.
A free list contains a block of available, already allocated, data structures. When code requires a new instance of a data structure, it can grab one of the structures off the free list rather than allocate the sufficient amount of memory and set it up for the data structure. Later, when the data structure is no longer needed, it is returned to the free list instead of deallocated. In this sense, the free list acts as an object cache, caching a requently used type of object.node

One of the main problems with free lists in the kernel is that there exists no global control.When available memory is low, there is no way for the kernel to communicate to every free list that it should shrink the sizes of its cache to free up memory.The kernel has no understanding of the random free lists at all.
To remedy this, and to consolidate code, the Linux kernel provides the slab layer (also called the slab allocator).The slab layer acts as a generic data structure-caching layer.linux

簡言之：須要高速緩存，而高速緩存須要被全局控制；爲了解決內存碎片問題。緩存

區別於夥伴系統：Linux採用夥伴系統解決外部碎片的問題，採用slab解決內部碎片的問題。數據結構

slab層是如何設計的？
簡單理解：高速緩存>slab>對象。
一種高速緩存對應一種對象。每一個高速緩存能夠由多個slab組成。slab由一個或多個物理連續頁組成，通常爲單頁。對象能夠是task_struct或inode等，一樣高速緩存也須要一個結構來描述。
app

slab層中關鍵數據結構有哪些，這些數據結構之間的關係是怎樣的？
參考：linux-2.6.26\mm\slab.c。這裏只給出源碼中咱們關注的部分，請注意源碼中的註釋。dom

/*
 * struct slab
 *
 * Manages the objs in a slab. Placed either at the beginning of mem allocated
 * for a slab, or allocated from an general cache.
 * Slabs are chained into three list: fully used, partial, fully free slabs.
 */
struct slab {
    struct list_head list;
    unsigned long colouroff;
    void *s_mem;        /* including colour offset */
    unsigned int inuse; /* num of objs active in slab */
    kmem_bufctl_t free;
    unsigned short nodeid;
};

//------------------------------------------------------
/*
 * struct array_cache
 *
 * Purpose:
 * - LIFO ordering, to hand out cache-warm objects from _alloc
 * - reduce the number of linked list operations
 * - reduce spinlock operations
 *
 * The limit is stored in the per-cpu structure to reduce the data cache
 * footprint.
 *
 */
struct array_cache {
    unsigned int avail;
    unsigned int limit;
    unsigned int batchcount;
    unsigned int touched;
};

//------------------------------------------------------
/*
 * The slab lists for all objects.
 */
struct kmem_list3 {
    struct list_head slabs_partial; /* partial list first, better asm code */
    struct list_head slabs_full;
    struct list_head slabs_free;
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;   /* Per-node cache coloring */
    spinlock_t list_lock;
    struct array_cache *shared; /* shared per node */
    struct array_cache **alien; /* on other nodes */
    unsigned long next_reap;    /* updated without locking */
    int free_touched;       /* updated without locking */
};

//------------------------------------------------------
/*
 * struct kmem_cache
 *
 * manages a cache.
 */

struct kmem_cache {
/* 1) per-cpu data, touched during every alloc/free */
    struct array_cache *array[NR_CPUS];
/* 2) Cache tunables. Protected by cache_chain_mutex */
    unsigned int batchcount;
    unsigned int limit;
    unsigned int shared;

    unsigned int buffer_size;
    u32 reciprocal_buffer_size;
    
    ******
    
    /*
     * We put nodelists[] at the end of kmem_cache, because we want to size
     * this array to nr_node_ids slots instead of MAX_NUMNODES
     * (see kmem_cache_init())
     * We still use [MAX_NUMNODES] and not [1] or [0] because cache_cache
     * is statically defined, so we reserve the max number of nodes.
     */

    ******
};

每一個高速緩存都由kmem_cache描述。
kmem_cache中array_cache結構代表在SMP系統中每一個CPU都擁有一種對象的一個高速緩存，例如CPU1-CPU4各自擁有一個管理inode對象的高速緩存，kmem_cache->nodelists[x]->shared代表高速緩存是能夠共享的。
kmem_cache中kmem_list3結構包含三個鏈表，slabs_full和slabs_partial和slabs_empty。
這些鏈表包含高速緩存的全部slab。ide

slab層是如何工做的？
一、初始化
二、建立緩存
參考：http://guojing.me/linux-kernel-architecture/posts/create-slab/函數

三、分配對象
參考：http://blog.csdn.net/vanbreaker/article/details/7671211
1）根據所需對象的名字（如task_struct）查看高速緩存中是否有空閒對象。
2）若是本地高速緩存中沒有對象，則從kmem_list3中的slab鏈表中尋找空閒對象並填充到本地高速緩存再分配。
3）若是全部的slab中都沒有空閒對象了，那麼就要建立新的slab，再分配。post

四、分配器的接口
建立高速緩存

撤銷高速緩存

獲取對象
void * kmem_cache_alloc(struct kmem_cache cachep, gfp_t flags)
釋放對象
void kmem_cache_free(struct kmem_cache cachep, void *objp)ui

task_struct對象是如何分配的？
參考：linux-2.6.26\kernel\fork.c

內核用一個全局變量存放指向task_struct高速緩存的指針，源碼以下

#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
# define alloc_task_struct()    kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
# define free_task_struct(tsk)  kmem_cache_free(task_struct_cachep, (tsk))
static struct kmem_cache *task_struct_cachep;
#endif

內核初始化期間，fork_init()中會建立高速緩存，源碼以下

#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
#ifndef ARCH_MIN_TASKALIGN
#define ARCH_MIN_TASKALIGN  L1_CACHE_BYTES
#endif
    /* create a slab on which task_structs can be allocated */
    task_struct_cachep =
        kmem_cache_create("task_struct", sizeof(struct task_struct),
            ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL);
#endif

每當進程調用fork()時，必定會建立一個新的進程描述符，實現過程：do_fork(), copy_process(), dup_task_struct(), alloc_task_struct()。前一個函數調用後一個，最後alloc_task_struct()又由上面給出的宏替換爲kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)，最後完成進程描述符的建立。
在建立完成後，若是沒有子進程在等待的話，它的進程描述符就會被釋放，實現過程：dup_task_struct(), free_task_struct()。最後free_task_struct()又由上面給出的宏替換爲kmem_cache_free(task_struct_cachep, tsk)。
源碼以下，只給出了咱們關注的部分

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          struct pt_regs *regs,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr)
{
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
     * We hope to recycle these flags after 2.6.26
     */
    if (unlikely(clone_flags & CLONE_STOPPED)) {
        static int __read_mostly count = 100;

        if (count > 0 && printk_ratelimit()) {
            char comm[TASK_COMM_LEN];

            count--;
            printk(KERN_INFO "fork(): process `%s' used deprecated "
                    "clone flags 0x%lx\n",
                get_task_comm(comm, current),
                clone_flags & CLONE_STOPPED);
        }
    }

    if (unlikely(current->ptrace)) {
        trace = fork_traceflag (clone_flags);
        if (trace)
            clone_flags |= CLONE_PTRACE;
    }

    p = copy_process(clone_flags, stack_start, regs, stack_size,
            child_tidptr, NULL);
    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    if (!IS_ERR(p)) {
        struct completion vfork;

        nr = task_pid_vnr(p);

        if (clone_flags & CLONE_PARENT_SETTID)
            put_user(nr, parent_tidptr);

        if (clone_flags & CLONE_VFORK) {
            p->vfork_done = &vfork;
            init_completion(&vfork);
        }

        if ((p->ptrace & PT_PTRACED) || (clone_flags & CLONE_STOPPED)) {
            /*
             * We'll start up with an immediate SIGSTOP.
             */
            sigaddset(&p->pending.signal, SIGSTOP);
            set_tsk_thread_flag(p, TIF_SIGPENDING);
        }

        if (!(clone_flags & CLONE_STOPPED))
            wake_up_new_task(p, clone_flags);
        else
            __set_task_state(p, TASK_STOPPED);

        if (unlikely (trace)) {
            current->ptrace_message = nr;
            ptrace_notify ((trace << 8) | SIGTRAP);
        }

        if (clone_flags & CLONE_VFORK) {
            freezer_do_not_count();
            wait_for_completion(&vfork);
            freezer_count();
            if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) {
                current->ptrace_message = nr;
                ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
            }
        }
    } else {
        nr = PTR_ERR(p);
    }
    return nr;
}

//------------------------------------------------------------
/*
 * This creates a new process as a copy of the old one,
 * but does not actually start it yet.
 *
 * It copies the registers, and all the appropriate
 * parts of the process environment (as per the clone
 * flags). The actual kick-off is left to the caller.
 */
static struct task_struct *copy_process(unsigned long clone_flags,
                    unsigned long stack_start,
                    struct pt_regs *regs,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid)
{
    int retval;
    struct task_struct *p;
    int cgroup_callbacks_done = 0;

    if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
        return ERR_PTR(-EINVAL);

    /*
     * Thread groups must share signals as well, and detached threads
     * can only be started up within the thread group.
     */
    if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
        return ERR_PTR(-EINVAL);

    /*
     * Shared signal handlers imply shared VM. By way of the above,
     * thread groups also imply shared VM. Blocking this case allows
     * for various simplifications in other code.
     */
    if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
        return ERR_PTR(-EINVAL);

    retval = security_task_create(clone_flags);
    if (retval)
        goto fork_out;

    retval = -ENOMEM;
    p = dup_task_struct(current);
    if (!p)
        goto fork_out;

    rt_mutex_init_task(p);
    
    ******
fork_out:
    return ERR_PTR(retval);
}

//------------------------------------------------------------
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
    struct task_struct *tsk;
    struct thread_info *ti;
    int err;

    prepare_to_copy(orig);

    tsk = alloc_task_struct();
    if (!tsk)
        return NULL;

    ti = alloc_thread_info(tsk);
    if (!ti) {
        free_task_struct(tsk);
        return NULL;
    }

    err = arch_dup_task_struct(tsk, orig);
    if (err)
        goto out;

    tsk->stack = ti;

    err = prop_local_init_single(&tsk->dirties);
    if (err)
        goto out;

    setup_thread_stack(tsk, orig);

#ifdef CONFIG_CC_STACKPROTECTOR
    tsk->stack_canary = get_random_int();
#endif

    /* One for us, one for whoever does the "release_task()" (usually parent) */
    atomic_set(&tsk->usage,2);
    atomic_set(&tsk->fs_excl, 0);
#ifdef CONFIG_BLK_DEV_IO_TRACE
    tsk->btrace_seq = 0;
#endif
    tsk->splice_pipe = NULL;
    return tsk;

out:
    free_thread_info(ti);
    free_task_struct(tsk);
    return NULL;
}