Docker Namespace資源隔離源碼深度剖析-Docker商業環境實戰

時間 2019-11-07

標籤 docker namespace 資源隔離源碼深度剖析商業環境實戰欄目 Docker 简体版

原文原文鏈接

專一於大數據及容器雲核心技術解密，可提供全棧的大數據+雲原平生臺諮詢方案，請持續關注本套博客。若有任何學術交流，可隨時聯繫。更多內容請關注《數據雲技術社區》公衆號。 linux

1 Namespace 概述

Namespace是將內核的全局資源作封裝，使得每一個Namespace都有一份獨立的資源，所以不一樣的進程在各自的Namespace內對同一種資源的使用不會互相干擾。實際上，Linux內核實現namespace的主要目的就是爲了實現輕量級虛擬化（容器）服務。
在同一個namespace下的進程能夠感知彼此的變化，而對外界的進程一無所知。這樣就可讓容器中的進程產生錯覺，彷彿本身置身於一個獨立的系統環境中，以此達到獨立和隔離的目的。
namespace的API包括clone()、setns()以及unshare()，還有/proc下的部分文件。爲了肯定隔離的究竟是哪一種namespace，在使用這些API時，一般須要指定如下六個常數的一個或多個，經過|（位或）操做來實現。這六個參數分別是CLONE_NEWIPC、CLONE_NEWNS、CLONE_NEWNET、CLONE_NEWPID、CLONE_NEWUSER和CLONE_NEWUTS。
經過clone()建立新進程的同時建立namespace

IPC：隔離System V IPC和POSIX消息隊列。
Network：隔離網絡資源。
Mount：隔離文件系統掛載點。每一個容器能看到不一樣的文件系統層次結構。
PID：隔離進程ID。
UTS：隔離主機名和域名。
User：隔離用戶ID和組ID。

int clone(int (*child_func)(void *), void *child_stack, int flags, void *arg);

參數child_func傳入子進程運行的程序主函數。
參數child_stack傳入子進程使用的棧空間
參數flags表示使用哪些CLONE_*標誌位
參數args則可用於傳入用戶參數

clone()其實是傳統UNIX系統調用fork()的一種更通用的實現方式，它能夠經過flags來控制使用多
少功能。一共有二十多種CLONE_*的flag（標誌位）參數用來控制clone進程的方方面面（如是否與父
進程共享虛擬內存等等）。
複製代碼

經過setns()加入一個已經存在的namespace

在進程都結束的狀況下，也能夠經過掛載的形式把namespace保留下來，保留namespace的目的天然是
爲之後有進程加入作準備。經過setns()系統調用，你的進程從原先的namespace加入咱們準備好的新
namespace，使用方法以下:

int setns(int fd, int nstype)
參數fd表示咱們要加入的namespace的文件描述符。上文已經提到，它是一個指向/proc/[pid]/ns目錄
的文件描述符，能夠經過直接打開該目錄下的連接或者打開一個掛載了該目錄下連接的文件獲得。

參數nstype讓調用者能夠去檢查fd指向的namespace類型是否符合咱們實際的要求。若是填0表示不檢查。
複製代碼

經過unshare()在原先進程上進行namespace隔離

後要提的系統調用是unshare()，它跟clone()很像，不一樣的是，unshare()運行在原先的進程上，
不須要啓動一個新進程，使用方法以下:

int unshare(int flags);
調用unshare()的主要做用就是不啓動一個新進程就能夠起到隔離的效果，至關於跳出原先的
namespace進行操做。這樣，你就能夠在原進程進行一些須要隔離的操做。Linux中自帶的
unshare命令，就是經過unshare()系統調用實現的。
複製代碼

以下Docker源碼，呈現了namespace的建立過程。

2 Namespace源碼執行流程

2.1 容器對象建立階段

具體流程請參考《Docker源碼解析》

startContainer() => createContainer() => loadFactory() => libcontainer.New() 
複製代碼

2.2 容器對象運行階段(nsexec)

總體流程以下

startContainer() => runner.run() => newProcess() => runner.container.Run(process) 
=> linuxContainer.strat() => linuxContainer.newParentProcess(process) 
=>linuxContainer.commandTemplate() => linuxContaine.newInitProcess() =>parent.start() 
=> initProcess.start()
複製代碼

linuxContainer.strat()
首先建立newParentProcess，生成InitProcess，實現對container的process進行Namespace相關設置如uid/gid、pid、uts、ns、cgroup等。

func (c *linuxContainer) newParentProcess(p *Process, doInit bool) (parentProcess, error) {
	parentPipe, childPipe, err := newPipe()
	if err != nil {
		return nil, newSystemError(err)
	}
	cmd, err := c.commandTemplate(p, childPipe)
	if err != nil {
		return nil, newSystemError(err)
	}
	if !doInit {
		return c.newSetnsProcess(p, cmd, parentPipe, childPipe), nil
	}
	return c.newInitProcess(p, cmd, parentPipe, childPipe)
}
複製代碼

建立容器的 init 進程時相關namespace 配置項(newInitProcess)
initProcess.start()。
InitProcess.start() 容器的初始化配置，此處 cmd.start() 調用實則是 runC init命令執行
InitProcess.start() 容器的初始化配置，此處 cmd.start() 調用實則是 runC init命令執行:
Init() 完成容器的相關初始化配置(網絡/路由、rootfs、selinux、console、主機名、apparmor、Sysctl、seccomp、capability 等)

func (l *LinuxFactory) StartInitialization() (err error) {
  //...
    i, err := newContainerInit(it, pipe, consoleSocket, fifofd) 
  //...
  // newContainerInit()返回的initer實現對象的Init()方法調用 "linuxStandardInit.Init()"
  return i.Init()                    
}

func (l *linuxStandardInit) Init() error {
  //...
  // 配置network,
  //  配置路由
  // selinux配置
  // + 準備rootfs
  // 配置console
  // 完成rootfs設置
  // 主機名設置
  // 應用apparmor配置
  // Sysctl系統參數調節
  // path只讀屬性配置
  // 告訴runC進程，咱們已經完成了初始化工做
  // 進程標籤設置
  // seccomp配置
  // 設置正確的capability，用戶以及工做目錄
  // 肯定用戶指定的容器進程在容器文件系統中的路徑
  // 關閉管道，告訴runC進程，咱們已經完成了初始化工做
  // 在exec用戶進程以前等待exec.fifo管道在另外一端被打開
  // 咱們經過/proc/self/fd/$fd打開它
  // ......
  // 向exec.fifo管道寫數據，阻塞，直到用戶調用`runc start`，讀取管道中的數據
  // 此時當前進程已處於阻塞狀態，等待信號執行後面代碼
  //
    if _, err := unix.Write(fd, []byte("0")); err != nil {
        return newSystemErrorWithCause(err, "write 0 exec fifo")
    }
  // 關閉fifofd管道 fix CVE-2016-9962
  // 初始化Seccomp配置
  // 調用系統exec()命令，執行entrypoint
    if err := syscall.Exec(name, l.config.Args[0:], os.Environ()); err != nil {
        return newSystemErrorWithCause(err, "exec user process")
    }
    return nil
}


複製代碼

3 nsenter源碼執行流程

Nsexec() 爲 nsenter 主幹執行邏輯代碼,全部 namespaces 配置都在此 func 內執行完成,clone_parent就是實現Namespace建立的基本。

void nsexec()
{
	char *namespaces[] = { "ipc", "uts", "net", "pid", "mnt" };
	const int num = sizeof(namespaces) / sizeof(char *);
	jmp_buf env;
	char buf[PATH_MAX], *val;
	int i, tfd, child, len, pipenum, consolefd = -1;
	pid_t pid;
	char *console;

	val = getenv("_LIBCONTAINER_INITPID");
	if (val == NULL)
		return;

	pid = atoi(val);
	snprintf(buf, sizeof(buf), "%d", pid);
	if (strcmp(val, buf)) {
		pr_perror("Unable to parse _LIBCONTAINER_INITPID");
		exit(1);
	}

	val = getenv("_LIBCONTAINER_INITPIPE");
	if (val == NULL) {
		pr_perror("Child pipe not found");
		exit(1);
	}

	pipenum = atoi(val);
	snprintf(buf, sizeof(buf), "%d", pipenum);
	if (strcmp(val, buf)) {
		pr_perror("Unable to parse _LIBCONTAINER_INITPIPE");
		exit(1);
	}

	console = getenv("_LIBCONTAINER_CONSOLE_PATH");
	if (console != NULL) {
		consolefd = open(console, O_RDWR);
		if (consolefd < 0) {
			pr_perror("Failed to open console %s", console);
			exit(1);
		}
	}

	/* Check that the specified process exists */
	snprintf(buf, PATH_MAX - 1, "/proc/%d/ns", pid);
	tfd = open(buf, O_DIRECTORY | O_RDONLY);
	if (tfd == -1) {
		pr_perror("Failed to open \"%s\"", buf);
		exit(1);
	}

	for (i = 0; i < num; i++) {
		struct stat st;
		int fd;

		/* Symlinks on all namespaces exist for dead processes, but they can't be opened */ if (fstatat(tfd, namespaces[i], &st, AT_SYMLINK_NOFOLLOW) == -1) { // Ignore nonexistent namespaces. if (errno == ENOENT) continue; } fd = openat(tfd, namespaces[i], O_RDONLY); if (fd == -1) { pr_perror("Failed to open ns file %s for ns %s", buf, namespaces[i]); exit(1); } // Set the namespace. if (setns(fd, 0) == -1) { pr_perror("Failed to setns for %s", namespaces[i]); exit(1); } close(fd); } if (setjmp(env) == 1) { // Child if (setsid() == -1) { pr_perror("setsid failed"); exit(1); } if (consolefd != -1) { if (ioctl(consolefd, TIOCSCTTY, 0) == -1) { pr_perror("ioctl TIOCSCTTY failed"); exit(1); } if (dup3(consolefd, STDIN_FILENO, 0) != STDIN_FILENO) { pr_perror("Failed to dup 0"); exit(1); } if (dup3(consolefd, STDOUT_FILENO, 0) != STDOUT_FILENO) { pr_perror("Failed to dup 1"); exit(1); } if (dup3(consolefd, STDERR_FILENO, 0) != STDERR_FILENO) { pr_perror("Failed to dup 2"); exit(1); } } // Finish executing, let the Go runtime take over. return; } // Parent // We must fork to actually enter the PID namespace, use CLONE_PARENT // so the child can have the right parent, and we don't need to forward
	// the child's exit code or resend its death signal. child = clone_parent(&env); if (child < 0) { pr_perror("Unable to fork"); exit(1); } len = snprintf(buf, sizeof(buf), "{ \"pid\" : %d }\n", child); if (write(pipenum, buf, len) != len) { pr_perror("Unable to send a child pid"); kill(child, SIGKILL); exit(1); } exit(0); } 複製代碼