Linux Namespace 入門系列：Namespace API

時間 2020-03-30

標籤 linux namespace 入門系列 api 欄目 Linux 简体版

原文原文鏈接

Linux Namespace 是 Linux 提供的一種內核級別環境隔離的方法。用官方的話來講，Linux Namespace 將全局系統資源封裝在一個抽象中，從而使 namespace 內的進程認爲本身具備獨立的資源實例。這項技術原本沒有掀起多大的波瀾，是容器技術的崛起讓他從新引發了你們的注意。php

Linux Namespace 有以下 6 個種類：html

分類	系統調用參數	相關內核版本
Mount namespaces	CLONE_NEWNS	Linux 2.4.19
UTS namespaces	CLONE_NEWUTS	Linux 2.6.19
IPC namespaces	CLONE_NEWIPC	Linux 2.6.19
PID namespaces	CLONE_NEWPID	Linux 2.6.24
Network namespaces	CLONE_NEWNET	始於Linux 2.6.24 完成於 Linux 2.6.29
User namespaces	CLONE_NEWUSER	始於 Linux 2.6.23 完成於 Linux 3.8

namespace 的 API 由三個系統調用和一系列 /proc 文件組成，本文將會詳細介紹這些系統調用和 /proc 文件。爲了指定要操做的 namespace 類型，須要在系統調用的 flag 中經過常量 CLONE_NEW* 指定（包括 CLONE_NEWIPC，CLONE_NEWNS， CLONE_NEWNET，CLONE_NEWPID，CLONE_NEWUSER 和 `CLONE_NEWUTS），能夠指定多個常量，經過 |（位或）操做來實現。node

簡單描述一下三個系統調用的功能：linux

clone() : 實現線程的系統調用，用來建立一個新的進程，並能夠經過設計上述系統調用參數達到隔離的目的。
unshare() : 使某進程脫離某個 namespace。
setns() : 把某進程加入到某個 namespace。

具體的實現原理請往下看。shell

1. clone()

clone() 的原型以下：bash

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

child_func : 傳入子進程運行的程序主函數。
child_stack : 傳入子進程使用的棧空間。
flags : 表示使用哪些 CLONE_* 標誌位。
args : 用於傳入用戶參數。

clone() 與 fork() 相似，都至關於把當前進程複製了一份，但 clone() 能夠更細粒度地控制與子進程共享的資源（其實就是經過 flags 來控制），包括虛擬內存、打開的文件描述符和信號量等等。一旦指定了標誌位 CLONE_NEW*，相對應類型的 namespace 就會被建立，新建立的進程也會成爲該 namespace 中的一員。微信

clone() 的原型並非最底層的系統調用，而是封裝過的，真正的系統調用內核實現函數爲 do_fork()，形式以下：dom

long do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr)

其中 clone_flags 能夠賦值爲上面提到的標誌。函數

下面來看一個例子：post

/* demo_uts_namespaces.c

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Demonstrate the operation of UTS namespaces.
*/
#define _GNU_SOURCE
#include <sys/wait.h>
#include <sys/utsname.h>
#include <sched.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static int              /* Start function for cloned child */
childFunc(void *arg)
{
    struct utsname uts;

    /* 在新的 UTS namespace 中修改主機名 */

    if (sethostname(arg, strlen(arg)) == -1)
        errExit("sethostname");

    /* 獲取並顯示主機名 */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in child:  %s\n", uts.nodename);

    /* Keep the namespace open for a while, by sleeping.
       This allows some experimentation--for example, another
       process might join the namespace. */
     
    sleep(100);

    return 0;           /* Terminates child */
}

/* 定義一個給 clone 用的棧，棧大小1M */
#define STACK_SIZE (1024 * 1024) 

static char child_stack[STACK_SIZE];

int
main(int argc, char *argv[])
{
    pid_t child_pid;
    struct utsname uts;

    if (argc < 2) {
        fprintf(stderr, "Usage: %s <child-hostname>\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    /* 調用 clone 函數建立一個新的 UTS namespace，其中傳出一個函數，還有一個棧空間（爲何傳尾指針，由於棧是反着的）;
       新的進程將在用戶定義的函數 childFunc() 中執行 */

    child_pid = clone(childFunc, 
                    child_stack + STACK_SIZE,   /* 由於棧是反着的， 
                                                   因此傳尾指針 */ 
                    CLONE_NEWUTS | SIGCHLD, argv[1]);
    if (child_pid == -1)
        errExit("clone");
    printf("PID of child created by clone() is %ld\n", (long) child_pid);

    /* Parent falls through to here */

    sleep(1);           /* 給子進程預留必定的時間來改變主機名 */

    /* 顯示當前 UTS namespace 中的主機名，和 
       子進程所在的 UTS namespace 中的主機名不一樣 */

    if (uname(&uts) == -1)
        errExit("uname");
    printf("uts.nodename in parent: %s\n", uts.nodename);

    if (waitpid(child_pid, NULL, 0) == -1)      /* 等待子進程結束 */
        errExit("waitpid");
    printf("child has terminated\n");

    exit(EXIT_SUCCESS);
}

該程序經過標誌位 CLONE_NEWUTS 調用 clone() 函數建立一個 UTS namespace。UTS namespace 隔離了兩個系統標識符 — 主機名和 NIS 域名 —它們分別經過 sethostname() 和 setdomainname() 這兩個系統調用來設置，並經過系統調用 uname() 來獲取。

下面將對程序中的一些關鍵部分進行解讀（爲了簡單起見，咱們將省略其中的錯誤檢查）。

程序運行時後面須要跟上一個命令行參數，它將會建立一個在新的 UTS namespace 中執行的子進程，該子進程會在新的 UTS namespace 中將主機名改成命令行參數中提供的值。

主程序的第一個關鍵部分是經過系統調用 clone() 來建立子進程：

child_pid = clone(childFunc, 
                  child_stack + STACK_SIZE,   /* Points to start of 
                                                 downwardly growing stack */ 
                  CLONE_NEWUTS | SIGCHLD, argv[1]);

printf("PID of child created by clone() is %ld\n", (long) child_pid);

子進程將會在用戶定義的函數 childFunc() 中開始執行，該函數將會接收 clone() 最後的參數（argv[1]）做爲本身的參數，而且標誌位包含了 CLONE_NEWUTS，因此子進程會在新建立的 UTS namespace 中執行。

接下來主進程睡眠一段時間，讓子進程可以有時間更改其 UTS namespace 中的主機名。而後調用 uname() 來檢索當前 UTS namespace 中的主機名，並顯示該主機名：

sleep(1);           /* Give child time to change its hostname */

uname(&uts);
printf("uts.nodename in parent: %s\n", uts.nodename);

與此同時，由 clone() 建立的子進程執行的函數 childFunc() 首先將主機名改成命令行參數中提供的值，而後檢索並顯示修改後的主機名：

sethostname(arg, strlen(arg);
    
uname(&uts);
printf("uts.nodename in child:  %s\n", uts.nodename);

子進程退出以前也睡眠了一段時間，這樣能夠防止新的 UTS namespace 不會被關閉，讓咱們可以有機會進行後續的實驗。

執行程序，觀察父進程和子進程是否處於不一樣的 UTS namespace 中：

$ su                   # 須要特權才能建立 UTS namespace
Password: 
# uname -n
antero
# ./demo_uts_namespaces bizarro
PID of child created by clone() is 27514
uts.nodename in child:  bizarro
uts.nodename in parent: antero

除了 User namespace 以外，建立其餘的 namespace 都須要特權，更確切地說，是須要相應的 Linux Capabilities，即 CAP_SYS_ADMIN。這樣就能夠避免設置了 SUID（Set User ID on execution）的程序由於主機名不一樣而作出一些愚蠢的行爲。若是對 Linux Capabilities 不是很熟悉，能夠參考我以前的文章：Linux Capabilities 入門教程：概念篇。

2. proc 文件

每一個進程都有一個 /proc/PID/ns 目錄，其下面的文件依次表示每一個 namespace, 例如 user 就表示 user namespace。從 3.8 版本的內核開始，該目錄下的每一個文件都是一個特殊的符號連接，連接指向 $namespace:[$namespace-inode-number]，前半部份爲 namespace 的名稱，後半部份的數字表示這個 namespace 的句柄號。句柄號用來對進程所關聯的 namespace 執行某些操做。

$ ls -l /proc/$$/ns         # $$ 表示當前所在的 shell 的 PID
total 0
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan  8 04:12 uts -> uts:[4026531838]

這些符號連接的用途之一是用來確認兩個不一樣的進程是否處於同一 namespace 中。若是兩個進程指向的 namespace inode number 相同，就說明他們在同一個 namespace 下，不然就在不一樣的 namespace 下。這些符號連接指向的文件比較特殊，不能直接訪問，事實上指向的文件存放在被稱爲 nsfs 的文件系統中，該文件系統用戶不可見，可使用系統調用 stat() 在返回的結構體的 st_ino 字段中獲取 inode number。在 shell 終端中能夠用命令（實際上就是調用了 stat()）看到指向文件的 inode 信息：

$ stat -L /proc/$$/ns/net
  File: /proc/3232/ns/net
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 4h/4d	Inode: 4026531956  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-01-17 15:45:23.783304900 +0800
Modify: 2020-01-17 15:45:23.783304900 +0800
Change: 2020-01-17 15:45:23.783304900 +0800
 Birth: -

除了上述用途以外，這些符號連接還有其餘的用途，若是咱們打開了其中一個文件，那麼只要與該文件相關聯的文件描述符處於打開狀態，即便該 namespace 中的全部進程都終止了，該 namespace 依然不會被刪除。經過 bind mount 將符號連接掛載到系統的其餘位置，也能夠得到相同的效果：

$ touch ~/uts
$ mount --bind /proc/27514/ns/uts ~/uts

3. setns()

加入一個已經存在的 namespace 能夠經過系統調用 setns() 來完成。它的原型以下：

int setns(int fd, int nstype);

更確切的說法是：setns() 將調用的進程與特定類型 namespace 的一個實例分離，並將該進程與該類型 namespace 的另外一個實例從新關聯。

fd 表示要加入的 namespace 的文件描述符，能夠經過打開其中一個符號連接來獲取，也能夠經過打開 bind mount 到其中一個連接的文件來獲取。
nstype 讓調用者能夠去檢查 fd 指向的 namespace 類型，值能夠設置爲前文提到的常量 CLONE_NEW*，填 0 表示不檢查。若是調用者已經明確知道本身要加入了 namespace 類型，或者不關心 namespace 類型，就可使用該參數來自動校驗。

結合 setns() 和 execve() 能夠實現一個簡單但很是有用的功能：將某個進程加入某個特定的 namespace，而後在該 namespace 中執行命令。直接來看例子：

/* ns_exec.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   Join a namespace and execute a command in the namespace
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

int
main(int argc, char *argv[])
{
    int fd;

    if (argc < 3) {
        fprintf(stderr, "%s /proc/PID/ns/FILE cmd [arg...]\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    fd = open(argv[1], O_RDONLY);   /* 獲取想要加入的 namespace 的文件描述符 */
    if (fd == -1)
        errExit("open");

    if (setns(fd, 0) == -1)         /* 加入該 namespace */
        errExit("setns");

    execvp(argv[2], &argv[2]);      /* 在加入的 namespace 中執行相應的命令 */
    errExit("execvp");
}

該程序運行須要兩個或兩個以上的命令行參數，第一個參數表示特定的 namespace 符號連接的路徑（或者 bind mount 到這些符號連接的文件路徑）；第二個參數表示要在該符號連接相對應的 namespace 中執行的程序名稱，以及執行這個程序所需的命令行參數。關鍵步驟以下：

fd = open(argv[1], O_RDONLY);   /* 獲取想要加入的 namespace 的文件描述符 */

setns(fd, 0);                   /* 加入該 namespace */

execvp(argv[2], &argv[2]);      /* 在加入的 namespace 中執行相應的命令 */

還記得咱們以前已經經過 bind mount 將 demo_uts_namespaces 建立的 UTS namespace 掛載到 ~/uts 中了嗎？能夠將本例中的程序與之結合，讓新進程能夠在該 UTS namespace 中執行 shell：

$ ./ns_exec ~/uts /bin/bash     # ~/uts 被 bind mount 到了 /proc/27514/ns/uts
    My PID is: 28788

驗證新的 shell 是否與 demo_uts_namespaces 建立的子進程處於同一個 UTS namespace：

$ hostname
bizarro
$ readlink /proc/27514/ns/uts
uts:[4026532338]
$ readlink /proc/$$/ns/uts      # $$ 表示當前 shell 的 PID
uts:[4026532338]

在早期的內核版本中，不能使用 setns() 來加入 mount namespace、PID namespace 和 user namespace，從 3.8 版本的內核開始，setns() 支持加入全部的 namespace。

util-linux 包裏提供了nsenter 命令，其提供了一種方式將新建立的進程運行在指定的 namespace 裏面，它的實現很簡單，就是經過命令行（-t 參數）指定要進入的 namespace 的符號連接，而後利用 setns() 將當前的進程放到指定的 namespace 裏面，再調用 clone() 運行指定的執行文件。咱們能夠用 strace 來看看它的運行狀況：

# strace nsenter -t 27242 -i -m -n -p -u /bin/bash
execve("/usr/bin/nsenter", ["nsenter", "-t", "27242", "-i", "-m", "-n", "-p", "-u", "/bin/bash"], [/* 21 vars */]) = 0
…………
…………
pen("/proc/27242/ns/ipc", O_RDONLY)    = 3
open("/proc/27242/ns/uts", O_RDONLY)    = 4
open("/proc/27242/ns/net", O_RDONLY)    = 5
open("/proc/27242/ns/pid", O_RDONLY)    = 6
open("/proc/27242/ns/mnt", O_RDONLY)    = 7
setns(3, CLONE_NEWIPC)                  = 0
close(3)                                = 0
setns(4, CLONE_NEWUTS)                  = 0
close(4)                                = 0
setns(5, CLONE_NEWNET)                  = 0
close(5)                                = 0
setns(6, CLONE_NEWPID)                  = 0
close(6)                                = 0
setns(7, CLONE_NEWNS)                   = 0
close(7)                                = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4deb1faad0) = 4968

4. unshare()

最後一個要介紹的系統調用是 unshare()，它的原型以下：

int unshare(int flags);

unshare() 與 clone() 相似，但它運行在原先的進程上，不須要建立一個新進程，即：先經過指定的 flags 參數 CLONE_NEW* 建立一個新的 namespace，而後將調用者加入該 namespace。最後實現的效果其實就是將調用者從當前的 namespace 分離，而後加入一個新的 namespace。

Linux 中自帶的 unshare 命令，就是經過 unshare() 系統調用實現的，使用方法以下：

$ unshare [options] program [arguments]

options 指定要建立的 namespace 類型。

unshare 命令的主要實現以下：

/* 經過提供的命令行參數初始化 'flags' */

unshare(flags);

/* Now execute 'program' with 'arguments'; 'optind' is the index
   of the next command-line argument after options */

execvp(argv[optind], &argv[optind]);

unshare 命令的完整實現以下：

/* unshare.c 

   Copyright 2013, Michael Kerrisk
   Licensed under GNU General Public License v2 or later

   A simple implementation of the unshare(1) command: unshare
   namespaces and execute a command.
*/

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

/* A simple error-handling function: print an error message based
   on the value in 'errno' and terminate the calling process */

#define errExit(msg)    do { perror(msg); exit(EXIT_FAILURE); \
                        } while (0)

static void
usage(char *pname)
{
    fprintf(stderr, "Usage: %s [options] program [arg...]\n", pname);
    fprintf(stderr, "Options can be:\n");
    fprintf(stderr, "    -i   unshare IPC namespace\n");
    fprintf(stderr, "    -m   unshare mount namespace\n");
    fprintf(stderr, "    -n   unshare network namespace\n");
    fprintf(stderr, "    -p   unshare PID namespace\n");
    fprintf(stderr, "    -u   unshare UTS namespace\n");
    fprintf(stderr, "    -U   unshare user namespace\n");
    exit(EXIT_FAILURE);
}

int
main(int argc, char *argv[])
{
    int flags, opt;

    flags = 0;

    while ((opt = getopt(argc, argv, "imnpuU")) != -1) {
        switch (opt) {
        case 'i': flags |= CLONE_NEWIPC;        break;
        case 'm': flags |= CLONE_NEWNS;         break;
        case 'n': flags |= CLONE_NEWNET;        break;
        case 'p': flags |= CLONE_NEWPID;        break;
        case 'u': flags |= CLONE_NEWUTS;        break;
        case 'U': flags |= CLONE_NEWUSER;       break;
        default:  usage(argv[0]);
        }
    }

    if (optind >= argc)
        usage(argv[0]);

    if (unshare(flags) == -1)
        errExit("unshare");

    execvp(argv[optind], &argv[optind]);  
    errExit("execvp");
}

下面咱們執行 unshare.c 程序在一個新的 mount namespace 中執行 shell：

$ echo $$                             # 顯示當前 shell 的 PID
8490
$ cat /proc/8490/mounts | grep mq     # 顯示當前 namespace 中的某個掛載點
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0
$ readlink /proc/8490/ns/mnt          # 顯示當前 namespace 的 ID 
mnt:[4026531840]
$ ./unshare -m /bin/bash              # 在新建立的 mount namespace 中執行新的 shell
$ readlink /proc/$$/ns/mnt            # 顯示新 namespace 的 ID 
mnt:[4026532325]

對比兩個 readlink 命令的輸出，能夠知道兩個shell 處於不一樣的 mount namespace 中。改變新的 namespace 中的某個掛載點，而後觀察兩個 namespace 的掛載點是否有變化：

$ umount /dev/mqueue                  # 移除新 namespace 中的掛載點
$ cat /proc/$$/mounts | grep mq       # 檢查是否生效
$ cat /proc/8490/mounts | grep mq     # 查看原來的 namespace 中的掛載點是否依然存在?
mqueue /dev/mqueue mqueue rw,seclabel,relatime 0 0

能夠看出，新的 namespace 中的掛載點 /dev/mqueue 已經消失了，但在原來的 namespace 中依然存在。