我看的是linux-4.2.3的源碼。參考了《邊幹邊學——Linux內核指導》(鬼畜的書名)第16章內容,他們用的是2.6.15的內核源碼。node
如今linux中可使用共享內存的方式有兩種linux
POSIX的shm_open()
在/dev/shm/
下打開一個文件,用mmap()
映射到進程本身的內存地址編程
System V的shmget()
獲得一個共享內存對象的id,用shmat()
映射到進程本身的內存地址數組
POSIX的實現是基於tmpfs的,函數都寫在libc裏,沒什麼好說的,主要仍是看System V的實現方式。在System V中共享內存屬於IPC子系統。所謂ipc,就是InterProcess Communication即進程間通訊的意思,System V比前面的Unix增長了3中進程間通訊的方式,共享內存、消息隊列、信號量,統稱IPC。主要代碼在如下文件中數據結構
ipc/shm.capp
include/linux/shm.h函數
ipc/util.coop
ipc/util.hui
include/linux/ipc.hthis
同一塊共享內存在內核中至少有3個標識符
IPC對象id(IPC對象是保存IPC信息的數據結構)
進程虛擬內存中文件的inode,即每一個進程中的共享內存也是以文件的方式存在的,但並非顯式的。能夠經過某個vm_area_struct->vm_file->f_dentry->d_inode->i_ino
表示
IPC對象的key。若是在shmget()
中傳入同一個key能夠獲取到同一塊共享內存。但因爲key是用戶指定的,可能重複,並且也不多程序寫以前會約定一個key,因此這種方法不是很經常使用。一般System V這種共享內存的方式是用於有父子關係的進程的。或者用ftok()
函數用路徑名來生成一個key。
首先看一下在內核中表示一塊共享內存的數據結構,在include/linux/shm.h
中/* */
是內核源碼的註釋,//
是個人註釋
struct shmid_kernel /* private to the kernel */ { struct kern_ipc_perm shm_perm; // 權限,這個結構體中還有一些重要的內容,後面會提到 struct file *shm_file; // 表示這塊共享內存的內核文件,文件內容即共享內存的內容 unsigned long shm_nattch; // 鏈接到這塊共享內存的進程數 unsigned long shm_segsz; // 大小,字節爲單位 time_t shm_atim; // 最後一次鏈接時間 time_t shm_dtim; // 最後一次斷開時間 time_t shm_ctim; // 最後一次更改信息的時間 pid_t shm_cprid; // 建立者進程id pid_t shm_lprid; // 最後操做者進程id struct user_struct *mlock_user; /* The task created the shm object. NULL if the task is dead. */ struct task_struct *shm_creator; struct list_head shm_clist; /* list by creator */ };
再看一下struct shmid_kernel
中存儲權限信息的shm_perm
,在include/linux/ipc.h
中
/* used by in-kernel data structures */ struct kern_ipc_perm { spinlock_t lock; bool deleted; int id; // IPC對象id key_t key; // IPC對象鍵值,即建立共享內存時用戶指定的 kuid_t uid; // IPC對象擁有者id kgid_t gid; // 組id kuid_t cuid; // 建立者id kgid_t cgid; umode_t mode; unsigned long seq; void *security; };
爲啥有這樣一個struct呢?由於這些權限、id、key是IPC對象都有的屬性,因此好比表示semaphore的結構struct semid_kernel
中也有一個這樣的struct kern_ipc_perm
。而後在傳遞IPC對象的時候,傳的也是struct kern_ipc_perm
的指針,再用container_of
這樣的宏得到外面的struct,這樣就能用同一個函數操做3種IPC對象,達到較好的代碼重用。
接下來咱們看一下共享內存相關函數。首先它們都是系統調用,對應的用戶API在libc裏面,參數是相同的,只是libc中的API作了一些調用系統調用須要的平常工做(保護現場、恢復現場之類的),因此就直接看這個系統調用了。
聲明在include/linux/syscalls.h
中
asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg); asmlinkage long sys_shmget(key_t key, size_t size, int flag); asmlinkage long sys_shmdt(char __user *shmaddr); asmlinkage long sys_shmctl(int shmid, int cmd, struct shmid_ds __user *buf);
定義在ipc/shm.c
中
SYSCALL_DEFINE3(shmget, key_t, key, size_t, size, int, shmflg) { struct ipc_namespace *ns; static const struct ipc_ops shm_ops = { .getnew = newseg, .associate = shm_security, .more_checks = shm_more_checks, }; struct ipc_params shm_params; ns = current->nsproxy->ipc_ns; shm_params.key = key; shm_params.flg = shmflg; shm_params.u.size = size; return ipcget(ns, &shm_ids(ns), &shm_ops, &shm_params); }
首先看到這個函數定義可能會很奇怪,不過這個SYSCALL_DEFINE3
的宏展開來最後形式確定和.h文件中聲明的同樣,即仍是long sys_shmget(key_t key, size_t size, int flag)
這個宏是爲了修一個bug,純粹黑科技,這裏不提它。
而後這裏實際調用的函數是ipcget()
。爲了統一一個ipc的接口也是煞費苦心,共享內存、信號量、消息隊列三種對象建立的時候都會調用這個函數,但其實建立的邏輯並不在這裏。而在shm_ops
中的三個函數裏。
順便提一下其中的current->nsproxy->ipc_ns
。這個的類型是struct ipc_namespace
。它是啥呢?咱們知道,共享內存這些進程間通訊的數據結構是全局的,但有時候須要把他們隔離開,即某一組進程並不知道另外的進程的共享內存,它們只但願在組內共用這些東西,這樣就不會與其餘進程衝突。因而就煞費苦心在內核中加了一個namespace。只要在clone()
函數中加入CLONE_NEWIPC
標誌就能建立一個新的IPC namespace。
那麼這個IPC namespace和咱們的共享內存的數據結構有什麼關係呢,能夠看一下結構體
struct ipc_ids { int in_use; unsigned short seq; struct rw_semaphore rwsem; struct idr ipcs_idr; int next_id; }; struct ipc_namespace { atomic_t count; struct ipc_ids ids[3]; ... };
比較重要的是其中的ids
,它存的是所用IPC對象的id,其中共享內存都存在ids[2]
中。而在ids[2]
中真正負責管理數據的是ipcs_idr
,它也是內核中一個煞費苦心弄出來的id管理機制,一個id能夠對應任意惟一肯定的對象。把它理解成一個數組就好。它們之間的關係大概以下圖所示。
[0] struct kern_ipc_perm <==> struct shmid_kernel struct ipc_namespace => struct ipc_ids => struct idr => [1] struct kern_ipc_perm <==> struct shmid_kernel [2] struct kern_ipc_perm <==> struct shmid_kernel
好的,咱們回頭來看看shmget()
究竟幹了啥,首先看一下ipcget()
int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids, const struct ipc_ops *ops, struct ipc_params *params) { if (params->key == IPC_PRIVATE) return ipcget_new(ns, ids, ops, params); else return ipcget_public(ns, ids, ops, params); }
若是傳進來的參數是IPC_PRIVATE
(這個宏的值是0)的話,不管是什麼mode,都會建立一塊新的共享內存。若是非0,則會去已有的共享內存中找有沒有這個key的,有就返回,沒有就新建。
首先看一下新建的函數newseg()
static int newseg(struct ipc_namespace *ns, struct ipc_params *params) { key_t key = params->key; int shmflg = params->flg; size_t size = params->u.size; int error; struct shmid_kernel *shp; size_t numpages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; struct file *file; char name[13]; int id; vm_flags_t acctflag = 0; if (size < SHMMIN || size > ns->shm_ctlmax) return -EINVAL; if (numpages << PAGE_SHIFT < size) return -ENOSPC; if (ns->shm_tot + numpages < ns->shm_tot || ns->shm_tot + numpages > ns->shm_ctlall) return -ENOSPC; shp = ipc_rcu_alloc(sizeof(*shp)); if (!shp) return -ENOMEM; shp->shm_perm.key = key; shp->shm_perm.mode = (shmflg & S_IRWXUGO); shp->mlock_user = NULL; shp->shm_perm.security = NULL; error = security_shm_alloc(shp); if (error) { ipc_rcu_putref(shp, ipc_rcu_free); return error; } sprintf(name, "SYSV%08x", key); if (shmflg & SHM_HUGETLB) { struct hstate *hs; size_t hugesize; hs = hstate_sizelog((shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK); if (!hs) { error = -EINVAL; goto no_file; } hugesize = ALIGN(size, huge_page_size(hs)); /* hugetlb_file_setup applies strict accounting */ if (shmflg & SHM_NORESERVE) acctflag = VM_NORESERVE; file = hugetlb_file_setup(name, hugesize, acctflag, &shp->mlock_user, HUGETLB_SHMFS_INODE, (shmflg >> SHM_HUGE_SHIFT) & SHM_HUGE_MASK); } else { /* * Do not allow no accounting for OVERCOMMIT_NEVER, even * if it's asked for. */ if ((shmflg & SHM_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) acctflag = VM_NORESERVE; file = shmem_kernel_file_setup(name, size, acctflag); } error = PTR_ERR(file); if (IS_ERR(file)) goto no_file; id = ipc_addid(&shm_ids(ns), &shp->shm_perm, ns->shm_ctlmni); if (id < 0) { error = id; goto no_id; } shp->shm_cprid = task_tgid_vnr(current); shp->shm_lprid = 0; shp->shm_atim = shp->shm_dtim = 0; shp->shm_ctim = get_seconds(); shp->shm_segsz = size; shp->shm_nattch = 0; shp->shm_file = file; shp->shm_creator = current; list_add(&shp->shm_clist, ¤t->sysvshm.shm_clist); /* * shmid gets reported as "inode#" in /proc/pid/maps. * proc-ps tools use this. Changing this will break them. */ file_inode(file)->i_ino = shp->shm_perm.id; ns->shm_tot += numpages; error = shp->shm_perm.id; ipc_unlock_object(&shp->shm_perm); rcu_read_unlock(); return error; no_id: if (is_file_hugepages(file) && shp->mlock_user) user_shm_unlock(size, shp->mlock_user); fput(file); no_file: ipc_rcu_putref(shp, shm_rcu_free); return error; }
這個函數首先幾個if檢查size是否是合法的參數,而且檢查有沒有足夠的pages。而後調用ipc_rcu_alloc()
函數給共享內存數據結構shp分配空間。而後把一些參數寫到shp的shm_perm成員中。而後sprintf下面那個大的if-else是爲表示共享內存內容的file分配空間。再而後ipc_addid()
是一個比較重要的函數,它把剛纔新建的這個共享內存的數據結構的指針加入到namespace的ids裏,便可以想象成加入到數組裏,並得到一個能夠找到它的id。這裏的id並不徹底是數組的下標,由於要避免重複,因此這裏有一個簡單的機制來保證生成的id幾乎是unique的,即ids裏面有個seq變量,每次新加入共享內存對象時都會加1,而真正的id是這樣生成的SEQ_MULTIPLIER * seq + id
。而後初始化一些成員,再把這個數據結構的指針加到當前進程的一個list裏。這個函數的工做就基本完成了。
接下來咱們再看一下若是建立時傳入一個已有的key,即ipcget_public()
的邏輯
static int ipcget_public(struct ipc_namespace *ns, struct ipc_ids *ids, const struct ipc_ops *ops, struct ipc_params *params) { struct kern_ipc_perm *ipcp; int flg = params->flg; int err; /* * Take the lock as a writer since we are potentially going to add * a new entry + read locks are not "upgradable" */ down_write(&ids->rwsem); ipcp = ipc_findkey(ids, params->key); if (ipcp == NULL) { /* key not used */ if (!(flg & IPC_CREAT)) err = -ENOENT; else err = ops->getnew(ns, params); } else { /* ipc object has been locked by ipc_findkey() */ if (flg & IPC_CREAT && flg & IPC_EXCL) err = -EEXIST; else { err = 0; if (ops->more_checks) err = ops->more_checks(ipcp, params); if (!err) /* * ipc_check_perms returns the IPC id on * success */ err = ipc_check_perms(ns, ipcp, ops, params); } ipc_unlock(ipcp); } up_write(&ids->rwsem); return err; }
邏輯很是簡單,先去找有沒有這個key。沒有的話仍是建立一個新的,注意ops->getnew()
對應的就是剛纔的newseg()
函數。若是找到了就判斷一下權限有沒有問題,沒有問題就直接返回IPC id。
能夠再看下ipc_findkey()
這個函數
static struct kern_ipc_perm *ipc_findkey(struct ipc_ids *ids, key_t key) { struct kern_ipc_perm *ipc; int next_id; int total; for (total = 0, next_id = 0; total < ids->in_use; next_id++) { ipc = idr_find(&ids->ipcs_idr, next_id); if (ipc == NULL) continue; if (ipc->key != key) { total++; continue; } rcu_read_lock(); ipc_lock_object(ipc); return ipc; } return NULL; }
邏輯也很簡單,注意到ids->ipcs_idr
就是以前提到的Interger ID Managenent機制,裏面存的就是shmid和對象一一對應的關係。而後這裏能夠看到ids->in_use
表示的是共享內存的個數,因爲中間的有些可能刪掉了,因此total在找到一個不爲空的共享內存的時候才++。而後咱們也能夠看到,這裏對重複的key並無作任何處理。因此咱們在編程的時候也應該避免直接約定用某一個數字當key。
接下來咱們看一下shmat()
,它的邏輯全在do_shmat()
中,因此咱們直接看這個函數。
long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr, unsigned long shmlba) { struct shmid_kernel *shp; unsigned long addr; unsigned long size; struct file *file; int err; unsigned long flags; unsigned long prot; int acc_mode; struct ipc_namespace *ns; struct shm_file_data *sfd; struct path path; fmode_t f_mode; unsigned long populate = 0; err = -EINVAL; if (shmid < 0) goto out; else if ((addr = (ulong)shmaddr)) { if (addr & (shmlba - 1)) { if (shmflg & SHM_RND) addr &= ~(shmlba - 1); /* round down */ else #ifndef __ARCH_FORCE_SHMLBA if (addr & ~PAGE_MASK) #endif goto out; } flags = MAP_SHARED | MAP_FIXED; } else { if ((shmflg & SHM_REMAP)) goto out; flags = MAP_SHARED; } if (shmflg & SHM_RDONLY) { prot = PROT_READ; acc_mode = S_IRUGO; f_mode = FMODE_READ; } else { prot = PROT_READ | PROT_WRITE; acc_mode = S_IRUGO | S_IWUGO; f_mode = FMODE_READ | FMODE_WRITE; } if (shmflg & SHM_EXEC) { prot |= PROT_EXEC; acc_mode |= S_IXUGO; } /* * We cannot rely on the fs check since SYSV IPC does have an * additional creator id... */ ns = current->nsproxy->ipc_ns; rcu_read_lock(); shp = shm_obtain_object_check(ns, shmid); if (IS_ERR(shp)) { err = PTR_ERR(shp); goto out_unlock; } err = -EACCES; if (ipcperms(ns, &shp->shm_perm, acc_mode)) goto out_unlock; err = security_shm_shmat(shp, shmaddr, shmflg); if (err) goto out_unlock; ipc_lock_object(&shp->shm_perm); /* check if shm_destroy() is tearing down shp */ if (!ipc_valid_object(&shp->shm_perm)) { ipc_unlock_object(&shp->shm_perm); err = -EIDRM; goto out_unlock; } path = shp->shm_file->f_path; path_get(&path); shp->shm_nattch++; size = i_size_read(d_inode(path.dentry)); ipc_unlock_object(&shp->shm_perm); rcu_read_unlock(); err = -ENOMEM; sfd = kzalloc(sizeof(*sfd), GFP_KERNEL); if (!sfd) { path_put(&path); goto out_nattch; } file = alloc_file(&path, f_mode, is_file_hugepages(shp->shm_file) ? &shm_file_operations_huge : &shm_file_operations); err = PTR_ERR(file); if (IS_ERR(file)) { kfree(sfd); path_put(&path); goto out_nattch; } file->private_data = sfd; file->f_mapping = shp->shm_file->f_mapping; sfd->id = shp->shm_perm.id; sfd->ns = get_ipc_ns(ns); sfd->file = shp->shm_file; sfd->vm_ops = NULL; err = security_mmap_file(file, prot, flags); if (err) goto out_fput; down_write(¤t->mm->mmap_sem); if (addr && !(shmflg & SHM_REMAP)) { err = -EINVAL; if (addr + size < addr) goto invalid; if (find_vma_intersection(current->mm, addr, addr + size)) goto invalid; } addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, &populate); *raddr = addr; err = 0; if (IS_ERR_VALUE(addr)) err = (long)addr; invalid: up_write(¤t->mm->mmap_sem); if (populate) mm_populate(addr, populate); out_fput: fput(file); out_nattch: down_write(&shm_ids(ns).rwsem); shp = shm_lock(ns, shmid); shp->shm_nattch--; if (shm_may_destroy(ns, shp)) shm_destroy(ns, shp); else shm_unlock(shp); up_write(&shm_ids(ns).rwsem); return err; out_unlock: rcu_read_unlock(); out: return err; }
首先檢查shmaddr的合法性並進行對齊,即調整爲shmlba的整數倍。若是傳入addr是0,前面檢查部分只會加上一個MAP_SHARED標誌,由於後面的mmap會自動爲其分配地址。而後從那一段兩行的註釋開始,函數經過shmid嘗試獲取共享內存對象,並進行權限檢查。而後修改shp中的一些數據,好比鏈接進程數加一。而後是經過alloc_file()
建立真正的要作mmap的file。在mmap以前還要對地址空間進行檢查,檢查是否和別的地址重疊,是否夠用。實際的映射工做就在do_mmap_pgoff()
函數中作了。
SYSCALL_DEFINE1(shmdt, char __user *, shmaddr) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma; unsigned long addr = (unsigned long)shmaddr; int retval = -EINVAL; #ifdef CONFIG_MMU loff_t size = 0; struct file *file; struct vm_area_struct *next; #endif if (addr & ~PAGE_MASK) return retval; down_write(&mm->mmap_sem); /* * This function tries to be smart and unmap shm segments that * were modified by partial mlock or munmap calls: * - It first determines the size of the shm segment that should be * unmapped: It searches for a vma that is backed by shm and that * started at address shmaddr. It records it's size and then unmaps * it. * - Then it unmaps all shm vmas that started at shmaddr and that * are within the initially determined size and that are from the * same shm segment from which we determined the size. * Errors from do_munmap are ignored: the function only fails if * it's called with invalid parameters or if it's called to unmap * a part of a vma. Both calls in this function are for full vmas, * the parameters are directly copied from the vma itself and always * valid - therefore do_munmap cannot fail. (famous last words?) */ /* * If it had been mremap()'d, the starting address would not * match the usual checks anyway. So assume all vma's are * above the starting address given. */ vma = find_vma(mm, addr); #ifdef CONFIG_MMU while (vma) { next = vma->vm_next; /* * Check if the starting address would match, i.e. it's * a fragment created by mprotect() and/or munmap(), or it * otherwise it starts at this address with no hassles. */ if ((vma->vm_ops == &shm_vm_ops) && (vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) { /* * Record the file of the shm segment being * unmapped. With mremap(), someone could place * page from another segment but with equal offsets * in the range we are unmapping. */ file = vma->vm_file; size = i_size_read(file_inode(vma->vm_file)); do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); /* * We discovered the size of the shm segment, so * break out of here and fall through to the next * loop that uses the size information to stop * searching for matching vma's. */ retval = 0; vma = next; break; } vma = next; } /* * We need look no further than the maximum address a fragment * could possibly have landed at. Also cast things to loff_t to * prevent overflows and make comparisons vs. equal-width types. */ size = PAGE_ALIGN(size); while (vma && (loff_t)(vma->vm_end - addr) <= size) { next = vma->vm_next; /* finding a matching vma now does not alter retval */ if ((vma->vm_ops == &shm_vm_ops) && ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) && (vma->vm_file == file)) do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); vma = next; } #else /* CONFIG_MMU */ /* under NOMMU conditions, the exact address to be destroyed must be * given */ if (vma && vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) { do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start); retval = 0; } #endif up_write(&mm->mmap_sem); return retval; }
接下來是shmdt()
,這個函數很是簡單,找到傳入的shmaddr對應的虛擬內存數據結構vma,檢查它的地址是否是正確的,而後調用do_munmap()
函數斷開對共享內存的鏈接。注意此操做並不會銷燬共享內存,即便沒有進程鏈接到它也不會,只有手動調用shmctl(id, IPC_RMID, NULL)
才能銷燬。
shmctl()
整體就是一個switch語句,大多數作的是讀取信息的或者設置標誌位的工做,這裏不贅述。