閱讀 Linux 內核源碼——共享內存

時間 2019-12-07

標籤閱讀 linux 內核源碼共享內存欄目 Linux 简体版

原文原文鏈接

介紹

我看的是linux-4.2.3的源碼。參考了《邊幹邊學——Linux內核指導》（鬼畜的書名）第16章內容，他們用的是2.6.15的內核源碼。node

如今linux中可使用共享內存的方式有兩種linux

POSIX的shm_open()在/dev/shm/下打開一個文件，用mmap()映射到進程本身的內存地址編程
System V的shmget()獲得一個共享內存對象的id，用shmat()映射到進程本身的內存地址數組

POSIX的實現是基於tmpfs的，函數都寫在libc裏，沒什麼好說的，主要仍是看System V的實現方式。在System V中共享內存屬於IPC子系統。所謂ipc，就是InterProcess Communication即進程間通訊的意思，System V比前面的Unix增長了3中進程間通訊的方式，共享內存、消息隊列、信號量，統稱IPC。主要代碼在如下文件中數據結構

ipc/shm.capp
include/linux/shm.h函數
ipc/util.coop
ipc/util.hui
include/linux/ipc.hthis

同一塊共享內存在內核中至少有3個標識符

IPC對象id（IPC對象是保存IPC信息的數據結構）
進程虛擬內存中文件的inode，即每一個進程中的共享內存也是以文件的方式存在的，但並非顯式的。能夠經過某個vm_area_struct->vm_file->f_dentry->d_inode->i_ino表示
IPC對象的key。若是在shmget()中傳入同一個key能夠獲取到同一塊共享內存。但因爲key是用戶指定的，可能重複，並且也不多程序寫以前會約定一個key，因此這種方法不是很經常使用。一般System V這種共享內存的方式是用於有父子關係的進程的。或者用ftok()函數用路徑名來生成一個key。

首先看一下在內核中表示一塊共享內存的數據結構，在include/linux/shm.h中
/* */是內核源碼的註釋，// 是個人註釋

struct shmid_kernel /* private to the kernel */
{    
    struct kern_ipc_perm    shm_perm; // 權限，這個結構體中還有一些重要的內容，後面會提到
    struct file        *shm_file;        // 表示這塊共享內存的內核文件，文件內容即共享內存的內容
    unsigned long        shm_nattch;   // 鏈接到這塊共享內存的進程數
    unsigned long        shm_segsz;    // 大小，字節爲單位
    time_t            shm_atim;         // 最後一次鏈接時間
    time_t            shm_dtim;         // 最後一次斷開時間
    time_t            shm_ctim;         // 最後一次更改信息的時間
    pid_t            shm_cprid;        // 建立者進程id
    pid_t            shm_lprid;        // 最後操做者進程id
    struct user_struct    *mlock_user;

    /* The task created the shm object.  NULL if the task is dead. */
    struct task_struct    *shm_creator;
    struct list_head    shm_clist;    /* list by creator */
};

再看一下struct shmid_kernel中存儲權限信息的shm_perm，在include/linux/ipc.h中

/* used by in-kernel data structures */
struct kern_ipc_perm
{
    spinlock_t    lock;
    bool        deleted;
    int        id;           // IPC對象id
    key_t        key;      // IPC對象鍵值，即建立共享內存時用戶指定的
    kuid_t        uid;      // IPC對象擁有者id
    kgid_t        gid;      // 組id
    kuid_t        cuid;     // 建立者id
    kgid_t        cgid;
    umode_t        mode; 
    unsigned long    seq;
    void        *security;
};

爲啥有這樣一個struct呢？由於這些權限、id、key是IPC對象都有的屬性，因此好比表示semaphore的結構struct semid_kernel中也有一個這樣的struct kern_ipc_perm。而後在傳遞IPC對象的時候，傳的也是struct kern_ipc_perm的指針，再用container_of這樣的宏得到外面的struct，這樣就能用同一個函數操做3種IPC對象，達到較好的代碼重用。

接下來咱們看一下共享內存相關函數。首先它們都是系統調用，對應的用戶API在libc裏面，參數是相同的，只是libc中的API作了一些調用系統調用須要的平常工做（保護現場、恢復現場之類的），因此就直接看這個系統調用了。

聲明在include/linux/syscalls.h中

asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg);
asmlinkage long sys_shmget(key_t key, size_t size, int flag);
asmlinkage long sys_shmdt(char __user *shmaddr);
asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);

定義在ipc/shm.c中

shmget

SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg)
{
    struct ipc_namespace *ns;
    static const struct ipc_ops shm_ops = {
        .getnew = newseg,
        .associate = shm_security,
        .more_checks = shm_more_checks,
    };
    struct ipc_params shm_params;

    ns = current->nsproxy->ipc_ns;

    shm_params.key = key;
    shm_params.flg = shmflg;
    shm_params.u.size = size;

    return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params);
}

首先看到這個函數定義可能會很奇怪，不過這個SYSCALL_DEFINE3的宏展開來最後形式確定和.h文件中聲明的同樣，即仍是long sys_shmget(key_t key, size_t size, int flag)這個宏是爲了修一個bug，純粹黑科技，這裏不提它。

而後這裏實際調用的函數是ipcget()。爲了統一一個ipc的接口也是煞費苦心，共享內存、信號量、消息隊列三種對象建立的時候都會調用這個函數，但其實建立的邏輯並不在這裏。而在shm_ops中的三個函數裏。

namespace

順便提一下其中的current->nsproxy->ipc_ns。這個的類型是struct ipc_namespace。它是啥呢？咱們知道，共享內存這些進程間通訊的數據結構是全局的，但有時候須要把他們隔離開，即某一組進程並不知道另外的進程的共享內存，它們只但願在組內共用這些東西，這樣就不會與其餘進程衝突。因而就煞費苦心在內核中加了一個namespace。只要在clone()函數中加入CLONE_NEWIPC標誌就能建立一個新的IPC namespace。

那麼這個IPC namespace和咱們的共享內存的數據結構有什麼關係呢，能夠看一下結構體

struct ipc_ids {
    int in_use;
    unsigned short seq;
    struct rw_semaphore rwsem;
    struct idr ipcs_idr;
    int next_id;
};

struct ipc_namespace {
    atomic_t    count;
    struct ipc_ids    ids[3];
    ...
};

比較重要的是其中的ids，它存的是所用IPC對象的id，其中共享內存都存在ids[2]中。而在ids[2]中真正負責管理數據的是ipcs_idr，它也是內核中一個煞費苦心弄出來的id管理機制，一個id能夠對應任意惟一肯定的對象。把它理解成一個數組就好。它們之間的關係大概以下圖所示。

[0] struct kern_ipc_perm <==> struct shmid_kernel
struct ipc_namespace => struct ipc_ids => struct idr => [1] struct kern_ipc_perm <==> struct shmid_kernel
                                                        [2] struct kern_ipc_perm <==> struct shmid_kernel

回到shmget

好的，咱們回頭來看看shmget()究竟幹了啥，首先看一下ipcget()

int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
            const struct ipc_ops *ops, struct ipc_params *params)
{
    if (params->key == IPC_PRIVATE)
        return ipcget_new(ns, ids, ops, params);
    else
        return ipcget_public(ns, ids, ops, params);
}

若是傳進來的參數是IPC_PRIVATE（這個宏的值是0）的話，不管是什麼mode，都會建立一塊新的共享內存。若是非0，則會去已有的共享內存中找有沒有這個key的，有就返回，沒有就新建。

首先看一下新建的函數newseg()

static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
{
    key_t key = params->key;
    int shmflg = params->flg;
    size_t size = params->u.size;
    int error;
    struct shmid_kernel *shp;
    size_t numpages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
    struct file *file;
    char name[13];
    int id;
    vm_flags_t acctflag = 0;

    if (size < SHMMIN || size > ns->shm_ctlmax)
        return -EINVAL;

    if (numpages << PAGE_SHIFT < size)
        return -ENOSPC;

    if (ns->shm_tot + numpages < ns->shm_tot ||
            ns->shm_tot + numpages > ns->shm_ctlall)
        return -ENOSPC;

    shp = ipc_rcu_alloc(sizeof(*shp));
    if (!shp)
        return -ENOMEM;

    shp->shm_perm.key = key;
    shp->shm_perm.mode = (shmflg & S_IRWXUGO);
    shp->mlock_user = NULL;

    shp->shm_perm.security = NULL;
    error = security_shm_alloc(shp);
    if (error) {
        ipc_rcu_putref(shp, ipc_rcu_free);
        return error;
    }

    sprintf(name, "SYSV%08x", key);
    if (shmflg & SHM_HUGETLB) {
        struct hstate *hs;
        size_t hugesize;

        hs = hstate_sizelog((shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
        if (!hs) {
            error = -EINVAL;
            goto no_file;
        }
        hugesize = ALIGN(size, huge_page_size(hs));

        /* hugetlb_file_setup applies strict accounting */
        if (shmflg & SHM_NORESERVE)
            acctflag = VM_NORESERVE;
        file = hugetlb_file_setup(name, hugesize, acctflag,
                  &shp->mlock_user, HUGETLB_SHMFS_INODE,
                (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK);
    } else {
        /*
         * Do not allow no accounting for OVERCOMMIT_NEVER, even
         * if it's asked for.
         */
        if  ((shmflg & SHM_NORESERVE) &&
                sysctl_overcommit_memory != OVERCOMMIT_NEVER)
            acctflag = VM_NORESERVE;
        file = shmem_kernel_file_setup(name, size, acctflag);
    }
    error = PTR_ERR(file);
    if (IS_ERR(file))
        goto no_file;

    id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni);
    if (id < 0) {
        error = id;
        goto no_id;
    }

    shp->shm_cprid = task_tgid_vnr(current);
    shp->shm_lprid = 0;
    shp->shm_atim = shp->shm_dtim = 0;
    shp->shm_ctim = get_seconds();
    shp->shm_segsz = size;
    shp->shm_nattch = 0;
    shp->shm_file = file;
    shp->shm_creator = current;
    list_add(&shp->shm_clist, &current->sysvshm.shm_clist);

    /*
     * shmid gets reported as "inode#" in /proc/pid/maps.
     * proc-ps tools use this. Changing this will break them.
     */
    file_inode(file)->i_ino = shp->shm_perm.id;

    ns->shm_tot += numpages;
    error = shp->shm_perm.id;

    ipc_unlock_object(&shp->shm_perm);
    rcu_read_unlock();
    return error;

no_id:
    if (is_file_hugepages(file) && shp->mlock_user)
        user_shm_unlock(size, shp->mlock_user);
    fput(file);
no_file:
    ipc_rcu_putref(shp, shm_rcu_free);
    return error;
}

這個函數首先幾個if檢查size是否是合法的參數，而且檢查有沒有足夠的pages。而後調用ipc_rcu_alloc()函數給共享內存數據結構shp分配空間。而後把一些參數寫到shp的shm_perm成員中。而後sprintf下面那個大的if-else是爲表示共享內存內容的file分配空間。再而後ipc_addid()是一個比較重要的函數，它把剛纔新建的這個共享內存的數據結構的指針加入到namespace的ids裏，便可以想象成加入到數組裏，並得到一個能夠找到它的id。這裏的id並不徹底是數組的下標，由於要避免重複，因此這裏有一個簡單的機制來保證生成的id幾乎是unique的，即ids裏面有個seq變量，每次新加入共享內存對象時都會加1，而真正的id是這樣生成的SEQ_MULTIPLIER * seq + id。而後初始化一些成員，再把這個數據結構的指針加到當前進程的一個list裏。這個函數的工做就基本完成了。

接下來咱們再看一下若是建立時傳入一個已有的key，即ipcget_public()的邏輯

static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids,
        const struct ipc_ops *ops, struct ipc_params *params)
{
    struct kern_ipc_perm *ipcp;
    int flg = params->flg;
    int err;

    /*
     * Take the lock as a writer since we are potentially going to add
     * a new entry + read locks are not "upgradable"
     */
    down_write(&ids->rwsem);
    ipcp = ipc_findkey(ids, params->key);
    if (ipcp == NULL) {
        /* key not used */
        if (!(flg & IPC_CREAT))
            err = -ENOENT;
        else
            err = ops->getnew(ns, params);
    } else {
        /* ipc object has been locked by ipc_findkey() */

        if (flg & IPC_CREAT && flg & IPC_EXCL)
            err = -EEXIST;
        else {
            err = 0;
            if (ops->more_checks)
                err = ops->more_checks(ipcp, params);
            if (!err)
                /*
                 * ipc_check_perms returns the IPC id on
                 * success
                 */
                err = ipc_check_perms(ns, ipcp, ops, params);
        }
        ipc_unlock(ipcp);
    }
    up_write(&ids->rwsem);

    return err;
}

邏輯很是簡單，先去找有沒有這個key。沒有的話仍是建立一個新的，注意ops->getnew()對應的就是剛纔的newseg()函數。若是找到了就判斷一下權限有沒有問題，沒有問題就直接返回IPC id。

能夠再看下ipc_findkey()這個函數

static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key)
{
    struct kern_ipc_perm *ipc;
    int next_id;
    int total;

    for (total = 0, next_id = 0; total < ids->in_use; next_id++) {
        ipc = idr_find(&ids->ipcs_idr, next_id);

        if (ipc == NULL)
            continue;

        if (ipc->key != key) {
            total++;
            continue;
        }

        rcu_read_lock();
        ipc_lock_object(ipc);
        return ipc;
    }

    return NULL;
}

邏輯也很簡單，注意到ids->ipcs_idr就是以前提到的Interger ID Managenent機制，裏面存的就是shmid和對象一一對應的關係。而後這裏能夠看到ids->in_use表示的是共享內存的個數，因爲中間的有些可能刪掉了，因此total在找到一個不爲空的共享內存的時候才++。而後咱們也能夠看到，這裏對重複的key並無作任何處理。因此咱們在編程的時候也應該避免直接約定用某一個數字當key。

shmat

接下來咱們看一下shmat()，它的邏輯全在do_shmat()中，因此咱們直接看這個函數。

long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr,
          unsigned long shmlba)
{
    struct shmid_kernel *shp;
    unsigned long addr;
    unsigned long size;
    struct file *file;
    int    err;
    unsigned long flags;
    unsigned long prot;
    int acc_mode;
    struct ipc_namespace *ns;
    struct shm_file_data *sfd;
    struct path path;
    fmode_t f_mode;
    unsigned long populate = 0;

    err = -EINVAL;
    if (shmid < 0)
        goto out;
    else if ((addr = (ulong)shmaddr)) {
        if (addr & (shmlba - 1)) {
            if (shmflg & SHM_RND)
                addr &= ~(shmlba - 1);       /* round down */
            else
#ifndef __ARCH_FORCE_SHMLBA
                if (addr & ~PAGE_MASK)
#endif
                    goto out;
        }
        flags = MAP_SHARED | MAP_FIXED;
    } else {
        if ((shmflg & SHM_REMAP))
            goto out;

        flags = MAP_SHARED;
    }

    if (shmflg & SHM_RDONLY) {
        prot = PROT_READ;
        acc_mode = S_IRUGO;
        f_mode = FMODE_READ;
    } else {
        prot = PROT_READ | PROT_WRITE;
        acc_mode = S_IRUGO | S_IWUGO;
        f_mode = FMODE_READ | FMODE_WRITE;
    }
    if (shmflg & SHM_EXEC) {
        prot |= PROT_EXEC;
        acc_mode |= S_IXUGO;
    }

    /*
     * We cannot rely on the fs check since SYSV IPC does have an
     * additional creator id...
     */
    ns = current->nsproxy->ipc_ns;
    rcu_read_lock();
    shp = shm_obtain_object_check(ns, shmid);
    if (IS_ERR(shp)) {
        err = PTR_ERR(shp);
        goto out_unlock;
    }

    err = -EACCES;
    if (ipcperms(ns, &shp->shm_perm, acc_mode))
        goto out_unlock;

    err = security_shm_shmat(shp, shmaddr, shmflg);
    if (err)
        goto out_unlock;

    ipc_lock_object(&shp->shm_perm);

    /* check if shm_destroy() is tearing down shp */
    if (!ipc_valid_object(&shp->shm_perm)) {
        ipc_unlock_object(&shp->shm_perm);
        err = -EIDRM;
        goto out_unlock;
    }

    path = shp->shm_file->f_path;
    path_get(&path);
    shp->shm_nattch++;
    size = i_size_read(d_inode(path.dentry));
    ipc_unlock_object(&shp->shm_perm);
    rcu_read_unlock();

    err = -ENOMEM;
    sfd = kzalloc(sizeof(*sfd), GFP_KERNEL);
    if (!sfd) {
        path_put(&path);
        goto out_nattch;
    }

    file = alloc_file(&path, f_mode,
              is_file_hugepages(shp->shm_file) ?
                &shm_file_operations_huge :
                &shm_file_operations);
    err = PTR_ERR(file);
    if (IS_ERR(file)) {
        kfree(sfd);
        path_put(&path);
        goto out_nattch;
    }

    file->private_data = sfd;
    file->f_mapping = shp->shm_file->f_mapping;
    sfd->id = shp->shm_perm.id;
    sfd->ns = get_ipc_ns(ns);
    sfd->file = shp->shm_file;
    sfd->vm_ops = NULL;

    err = security_mmap_file(file, prot, flags);
    if (err)
        goto out_fput;

    down_write(&current->mm->mmap_sem);
    if (addr && !(shmflg & SHM_REMAP)) {
        err = -EINVAL;
        if (addr + size < addr)
            goto invalid;

        if (find_vma_intersection(current->mm, addr, addr + size))
            goto invalid;
    }

    addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate);
    *raddr = addr;
    err = 0;
    if (IS_ERR_VALUE(addr))
        err = (long)addr;
invalid:
    up_write(&current->mm->mmap_sem);
    if (populate)
        mm_populate(addr, populate);

out_fput:
    fput(file);

out_nattch:
    down_write(&shm_ids(ns).rwsem);
    shp = shm_lock(ns, shmid);
    shp->shm_nattch--;
    if (shm_may_destroy(ns, shp))
        shm_destroy(ns, shp);
    else
        shm_unlock(shp);
    up_write(&shm_ids(ns).rwsem);
    return err;

out_unlock:
    rcu_read_unlock();
out:
    return err;
}

首先檢查shmaddr的合法性並進行對齊，即調整爲shmlba的整數倍。若是傳入addr是0，前面檢查部分只會加上一個MAP_SHARED標誌，由於後面的mmap會自動爲其分配地址。而後從那一段兩行的註釋開始，函數經過shmid嘗試獲取共享內存對象，並進行權限檢查。而後修改shp中的一些數據，好比鏈接進程數加一。而後是經過alloc_file()建立真正的要作mmap的file。在mmap以前還要對地址空間進行檢查，檢查是否和別的地址重疊，是否夠用。實際的映射工做就在do_mmap_pgoff()函數中作了。

shmdt

SYSCALL_DEFINE1(shmdt, char __user *, shmaddr)
{
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    unsigned long addr = (unsigned long)shmaddr;
    int retval = -EINVAL;
#ifdef CONFIG_MMU
    loff_t size = 0;
    struct file *file;
    struct vm_area_struct *next;
#endif

    if (addr & ~PAGE_MASK)
        return retval;

    down_write(&mm->mmap_sem);

    /*
     * This function tries to be smart and unmap shm segments that
     * were modified by partial mlock or munmap calls:
     * - It first determines the size of the shm segment that should be
     *   unmapped: It searches for a vma that is backed by shm and that
     *   started at address shmaddr. It records it's size and then unmaps
     *   it.
     * - Then it unmaps all shm vmas that started at shmaddr and that
     *   are within the initially determined size and that are from the
     *   same shm segment from which we determined the size.
     * Errors from do_munmap are ignored: the function only fails if
     * it's called with invalid parameters or if it's called to unmap
     * a part of a vma. Both calls in this function are for full vmas,
     * the parameters are directly copied from the vma itself and always
     * valid - therefore do_munmap cannot fail. (famous last words?)
     */
    /*
     * If it had been mremap()'d, the starting address would not
     * match the usual checks anyway. So assume all vma's are
     * above the starting address given.
     */
    vma = find_vma(mm, addr);

#ifdef CONFIG_MMU
    while (vma) {
        next = vma->vm_next;

        /*
         * Check if the starting address would match, i.e. it's
         * a fragment created by mprotect() and/or munmap(), or it
         * otherwise it starts at this address with no hassles.
         */
        if ((vma->vm_ops == &shm_vm_ops) &&
            (vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) {

            /*
             * Record the file of the shm segment being
             * unmapped.  With mremap(), someone could place
             * page from another segment but with equal offsets
             * in the range we are unmapping.
             */
            file = vma->vm_file;
            size = i_size_read(file_inode(vma->vm_file));
            do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
            /*
             * We discovered the size of the shm segment, so
             * break out of here and fall through to the next
             * loop that uses the size information to stop
             * searching for matching vma's.
             */
            retval = 0;
            vma = next;
            break;
        }
        vma = next;
    }

    /*
     * We need look no further than the maximum address a fragment
     * could possibly have landed at. Also cast things to loff_t to
     * prevent overflows and make comparisons vs. equal-width types.
     */
    size = PAGE_ALIGN(size);
    while (vma && (loff_t)(vma->vm_end - addr) <= size) {
        next = vma->vm_next;

        /* finding a matching vma now does not alter retval */
        if ((vma->vm_ops == &shm_vm_ops) &&
            ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
            (vma->vm_file == file))
            do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
        vma = next;
    }

#else /* CONFIG_MMU */
    /* under NOMMU conditions, the exact address to be destroyed must be
     * given */
    if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {
        do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
        retval = 0;
    }

#endif

    up_write(&mm->mmap_sem);
    return retval;
}

接下來是shmdt()，這個函數很是簡單，找到傳入的shmaddr對應的虛擬內存數據結構vma，檢查它的地址是否是正確的，而後調用do_munmap()函數斷開對共享內存的鏈接。注意此操做並不會銷燬共享內存，即便沒有進程鏈接到它也不會，只有手動調用shmctl(id, IPC_RMID, NULL)才能銷燬。

shmctl()整體就是一個switch語句，大多數作的是讀取信息的或者設置標誌位的工做，這裏不贅述。