理解docker,主要從namesapce,cgroups,聯合文件,運行時(runC),網絡幾個方面。接下來咱們會花一些時間,分別介紹。linux
namesapce主要是隔離做用,cgroups主要是資源限制,聯合文件主要用於鏡像分層存儲和管理,runC是運行時,遵循了oci接口,通常來講基於libcontainer。網絡主要是docker單機網絡和多主機通訊模式。git
RunC 是一個輕量級的工具,它是用來運行容器的,只用來作這一件事,而且這一件事要作好。咱們能夠認爲它就是個命令行小工具,能夠不用經過 docker 引擎,直接運行容器。事實上,runC 是標準化的產物,它根據 OCI 標準來建立和運行容器。而 OCI(Open Container Initiative)組織,旨在圍繞容器格式和運行時制定一個開放的工業化標準。
OCI 由 docker、coreos 以及其餘容器相關公司建立於 2015 年,目前主要有兩個標準文檔:容器運行時標準 (runtime spec)和 容器鏡像標準(image spec)。
runC 由golang語言實現,基於libcontainer庫。從docker1.11之後,docker架構圖:github
runc目前支持各類架構的Linux平臺。必須使用Go 1.6或更高版本構建它才能使某些功能正常運行。
要啓用seccomp支持,您須要在平臺上安裝libseccomp。golang
e.g. libseccomp-devel for CentOS, or libseccomp-dev for Ubuntu
不然,若是您不想使用seccomp支持構建runc,則能夠在運行make時添加BUILDTAGS =「」。docker
# create a 'github.com/opencontainers' in your GOPATH/src cd github.com/opencontainers git clone https://github.com/opencontainers/runc cd runc make sudo make install
runc支持可選的構建標記,用於編譯各類功能的支持。要將構建標記添加到make選項,必須設置BUILDTAGS變量。json
make BUILDTAGS='seccomp apparmor'
Build Tag | Feature | Dependency |
---|---|---|
seccomp | Syscall filtering | libseccomp |
selinux | selinux process and mount labeling | <none> |
apparmor | apparmor profile support | <none> |
ambient | ambient capability support | kernel 4.3 |
要使用runc,您必須使用OCI包的格式容器。若是安裝了Docker,則可使用其導出方法從現有Docker容器中獲取根文件系統。bootstrap
# create the top most bundle directory mkdir /mycontainer cd /mycontainer # create the rootfs directory mkdir rootfs # export busybox via Docker into the rootfs directory docker export $(docker create busybox) | tar -C rootfs -xvf -
runc提供了一個spec命令來生成您能夠編輯的基本模板規範。c#
runc spec
先來準備一個工做目錄,下面全部的操做都是在這個目錄下執行的,好比 mycontainer:segmentfault
# mkdir mycontainer
接下來,準備容器鏡像的文件系統,咱們選擇從 docker 鏡像中提取:windows
# mkdir rootfs # docker export $(docker create busybox) | tar -C rootfs -xvf - # ls rootfs bin dev etc home proc root sys tmp usr var
有了 rootfs 以後,咱們還要按照 OCI 標準有一個配置文件 config.json 說明如何運行容器,包括要運行的命令、權限、環境變量等等內容,runc 提供了一個命令能夠自動幫咱們生成:
# runc spec # ls config.json rootfs
這樣就構成了一個 OCI runtime bundle 的內容,這個 bundle 很是簡單,就上面兩個內容:config.json 文件和 rootfs 文件系統。config.json 裏面的內容很長,這裏就不貼出來了,咱們也不會對其進行修改,直接使用這個默認生成的文件。有了這些信息,runc 就能知道怎麼怎麼運行容器了,咱們先來看看簡單的方法 runc run(這個命令須要 root 權限),這個命令相似於 docker run,它會建立並啓動一個容器:
runc run simplebusybox / # ls bin dev etc home proc root sys tmp usr var / # hostname runc / # whoami root / # pwd / / # ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever / # ps aux PID USER TIME COMMAND 1 root 0:00 sh 11 root 0:00 ps aux
此時,另開一個終端,能夠查看運行的容器信息:
runc list ID PID STATUS BUNDLE CREATED OWNER simplebusybox 18073 running /home/cizixs/Workspace/runc/mycontainer 2017-11-02T06:54:52.023379345Z root
整體來講,runC代碼比較簡單。主要是引用github.com/urfave/cli庫,實現了一系列命令
app.Commands = []cli.Command{ checkpointCommand, createCommand, deleteCommand, eventsCommand, execCommand, initCommand, killCommand, listCommand, pauseCommand, psCommand, restoreCommand, resumeCommand, runCommand, specCommand, startCommand, stateCommand, updateCommand, }
熟悉docker命令的人,應該對此很熟悉了。
這些命令底層是調用 libcontainer庫實現具體的操做。
例如create 命令:
var createCommand = cli.Command{ Name: "create", Usage: "create a container", ArgsUsage: `<container-id> Where "<container-id>" is your name for the instance of the container that you are starting. The name you provide for the container instance must be unique on your host.`, Description: `The create command creates an instance of a container for a bundle. The bundle is a directory with a specification file named "` + specConfig + `" and a root filesystem. The specification file includes an args parameter. The args parameter is used to specify command(s) that get run when the container is started. To change the command(s) that get executed on start, edit the args parameter of the spec. See "runc spec --help" for more explanation.`, Flags: []cli.Flag{ cli.StringFlag{ Name: "bundle, b", Value: "", Usage: `path to the root of the bundle directory, defaults to the current directory`, }, cli.StringFlag{ Name: "console-socket", Value: "", Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the master end of the console's pseudoterminal", }, cli.StringFlag{ Name: "pid-file", Value: "", Usage: "specify the file to write the process id to", }, cli.BoolFlag{ Name: "no-pivot", Usage: "do not use pivot root to jail process inside rootfs. This should be used whenever the rootfs is on top of a ramdisk", }, cli.BoolFlag{ Name: "no-new-keyring", Usage: "do not create a new session keyring for the container. This will cause the container to inherit the calling processes session key", }, cli.IntFlag{ Name: "preserve-fds", Usage: "Pass N additional file descriptors to the container (stdio + $LISTEN_FDS + N in total)", }, }, Action: func(context *cli.Context) error { if err := checkArgs(context, 1, exactArgs); err != nil { return err } if err := revisePidFile(context); err != nil { return err } spec, err := setupSpec(context) if err != nil { return err } status, err := startContainer(context, spec, CT_ACT_CREATE, nil) if err != nil { return err } // exit with the container's exit status so any external supervisor is // notified of the exit with the correct exit status. os.Exit(status) return nil }, }
其實若是須要更深的理解,更多須要理解libcontainer了。
主要有如下幾個重要的文件須要理解
下面咱們經過如何建立一個容器來剖析和理解上面的幾個文件。
先調用spec, err := setupSpec(context)加載配置文件config.json的內容。此處是和我們前面提到的OCI bundle 相關。
spec, err := setupSpec(context) if err != nil { return err }
最終生成了Spec對象,spec定義以下:
// Spec is the base configuration for the container. type Spec struct { // Version of the Open Container Runtime Specification with which the bundle complies. Version string `json:"ociVersion"` // Process configures the container process. Process *Process `json:"process,omitempty"` // Root configures the container's root filesystem. Root *Root `json:"root,omitempty"` // Hostname configures the container's hostname. Hostname string `json:"hostname,omitempty"` // Mounts configures additional mounts (on top of Root). Mounts []Mount `json:"mounts,omitempty"` // Hooks configures callbacks for container lifecycle events. Hooks *Hooks `json:"hooks,omitempty" platform:"linux,solaris"` // Annotations contains arbitrary metadata for the container. Annotations map[string]string `json:"annotations,omitempty"` // Linux is platform-specific configuration for Linux based containers. Linux *Linux `json:"linux,omitempty" platform:"linux"` // Solaris is platform-specific configuration for Solaris based containers. Solaris *Solaris `json:"solaris,omitempty" platform:"solaris"` // Windows is platform-specific configuration for Windows based containers. Windows *Windows `json:"windows,omitempty" platform:"windows"` }
以後調用status, err := startcontainer(context, spec, CT_ACT_CREATE, nil)進行容器的建立工做。其中CT_ACT_CREATE表示建立操做。CT_ACT_CREATE是一個枚舉類型。
type CtAct uint8 const ( CT_ACT_CREATE CtAct = iota + 1 CT_ACT_RUN CT_ACT_RESTORE )
status, err := startContainer(context, spec, CT_ACT_CREATE, nil)
而startcontainer具體代碼:
func startContainer(context *cli.Context, spec *specs.Spec, action CtAct, criuOpts *libcontainer.CriuOpts) (int, error) { id := context.Args().First() if id == "" { return -1, errEmptyID } notifySocket := newNotifySocket(context, os.Getenv("NOTIFY_SOCKET"), id) if notifySocket != nil { notifySocket.setupSpec(context, spec) } container, err := createContainer(context, id, spec) if err != nil { return -1, err } if notifySocket != nil { err := notifySocket.setupSocket() if err != nil { return -1, err } } // Support on-demand socket activation by passing file descriptors into the container init process. listenFDs := []*os.File{} if os.Getenv("LISTEN_FDS") != "" { listenFDs = activation.Files(false) } r := &runner{ enableSubreaper: !context.Bool("no-subreaper"), shouldDestroy: true, container: container, listenFDs: listenFDs, notifySocket: notifySocket, consoleSocket: context.String("console-socket"), detach: context.Bool("detach"), pidFile: context.String("pid-file"), preserveFDs: context.Int("preserve-fds"), action: action, criuOpts: criuOpts, init: true, } return r.run(spec.Process) }
首先調用container, err := createContainer(context, id, spec)建立容器, 以後填充runner結構r。
func createContainer(context *cli.Context, id string, spec *specs.Spec) (libcontainer.Container, error) { rootless, err := isRootless(context) if err != nil { return nil, err } config, err := specconv.CreateLibcontainerConfig(&specconv.CreateOpts{ CgroupName: id, UseSystemdCgroup: context.GlobalBool("systemd-cgroup"), NoPivotRoot: context.Bool("no-pivot"), NoNewKeyring: context.Bool("no-new-keyring"), Spec: spec, Rootless: rootless, }) if err != nil { return nil, err } factory, err := loadFactory(context) if err != nil { return nil, err } return factory.Create(id, config) }
注意factory, err := loadFactory(context)和factory.Create(id, config),這兩個就是咱們上面提到的factory.go。由工廠來根據配置config建立具體容器。
最後調用了run方法。run方法傳遞了一個process對象,表示容器內進程的信息。即上面提到的process.go文件中的內容。
// Process contains information to start a specific application inside the container. type Process struct { // Terminal creates an interactive terminal for the container. Terminal bool `json:"terminal,omitempty"` // ConsoleSize specifies the size of the console. ConsoleSize *Box `json:"consoleSize,omitempty"` // User specifies user information for the process. User User `json:"user"` // Args specifies the binary and arguments for the application to execute. Args []string `json:"args"` // Env populates the process environment for the process. Env []string `json:"env,omitempty"` // Cwd is the current working directory for the process and must be // relative to the container's root. Cwd string `json:"cwd"` // Capabilities are Linux capabilities that are kept for the process. Capabilities *LinuxCapabilities `json:"capabilities,omitempty" platform:"linux"` // Rlimits specifies rlimit options to apply to the process. Rlimits []POSIXRlimit `json:"rlimits,omitempty" platform:"linux,solaris"` // NoNewPrivileges controls whether additional privileges could be gained by processes in the container. NoNewPrivileges bool `json:"noNewPrivileges,omitempty" platform:"linux"` // ApparmorProfile specifies the apparmor profile for the container. ApparmorProfile string `json:"apparmorProfile,omitempty" platform:"linux"` // Specify an oom_score_adj for the container. OOMScoreAdj *int `json:"oomScoreAdj,omitempty" platform:"linux"` // SelinuxLabel specifies the selinux context that the container process is run as. SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"` }
run方法主要是newProcess方法
process, err := newProcess(*config, r.init)
newProcess 主要是填充 libcontainer.Process 結構體,包括參數,環境變量,user 權限,工做目錄,cpabilities,資源限制等。
具體的操做是:
switch r.action { case CT_ACT_CREATE: err = r.container.Start(process) case CT_ACT_RESTORE: err = r.container.Restore(process, r.criuOpts) case CT_ACT_RUN: err = r.container.Run(process) default: panic("Unknown action") }
啓動容器代碼container.Start(process):
func (c *linuxContainer) start(process *Process) error { parent, err := c.newParentProcess(process) if err != nil { return newSystemErrorWithCause(err, "creating new parent process") } if err := parent.start(); err != nil { // terminate the process to ensure that it properly is reaped. if err := ignoreTerminateErrors(parent.terminate()); err != nil { logrus.Warn(err) } return newSystemErrorWithCause(err, "starting container process") } // generate a timestamp indicating when the container was started c.created = time.Now().UTC() if process.Init { c.state = &createdState{ c: c, } state, err := c.updateState(parent) if err != nil { return err } c.initProcessStartTime = state.InitProcessStartTime if c.config.Hooks != nil { bundle, annotations := utils.Annotations(c.config.Labels) s := configs.HookState{ Version: c.config.Version, ID: c.id, Pid: parent.pid(), Bundle: bundle, Annotations: annotations, } for i, hook := range c.config.Hooks.Poststart { if err := hook.Run(s); err != nil { if err := ignoreTerminateErrors(parent.terminate()); err != nil { logrus.Warn(err) } return newSystemErrorWithCausef(err, "running poststart hook %d", i) } } } } return nil }
1.建立一對pipe,parentPipe和childPipe,做爲 runc start 進程與容器內部 init 進程通訊管道
2.建立一個命令模版做爲 Parent 進程啓動的模板
3.newInitProcess 封裝 initProcess。主要工做爲添加初始化類型環境變量,將namespace、uid/gid 映射等信息使用 bootstrapData 封裝爲一個 io.Reader
添加初始化類型環境變量,將namespace、uid/gid 映射等信息使用 bootstrapData 函數封裝爲一個 io.Reader,使用的是 netlink 用於內核間的通訊,返回 initProcess 結構體。
最後調用func (l *linuxStandardInit) Init() error方法,這裏是上面提到的init_linux.go文件。
func (l *linuxStandardInit) Init() error { if !l.config.Config.NoNewKeyring { ringname, keepperms, newperms := l.getSessionRingParams() // Do not inherit the parent's session keyring. sessKeyId, err := keys.JoinSessionKeyring(ringname) if err != nil { return errors.Wrap(err, "join session keyring") } // Make session keyring searcheable. if err := keys.ModKeyringPerm(sessKeyId, keepperms, newperms); err != nil { return errors.Wrap(err, "mod keyring permissions") } } if err := setupNetwork(l.config); err != nil { return err } if err := setupRoute(l.config.Config); err != nil { return err } label.Init() if err := prepareRootfs(l.pipe, l.config); err != nil { return err } // Set up the console. This has to be done *before* we finalize the rootfs, // but *after* we've given the user the chance to set up all of the mounts // they wanted. if l.config.CreateConsole { if err := setupConsole(l.consoleSocket, l.config, true); err != nil { return err } if err := system.Setctty(); err != nil { return errors.Wrap(err, "setctty") } } // Finish the rootfs setup. if l.config.Config.Namespaces.Contains(configs.NEWNS) { if err := finalizeRootfs(l.config.Config); err != nil { return err } } if hostname := l.config.Config.Hostname; hostname != "" { if err := unix.Sethostname([]byte(hostname)); err != nil { return errors.Wrap(err, "sethostname") } } if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil { return errors.Wrap(err, "apply apparmor profile") } if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil { return errors.Wrap(err, "set process label") } for key, value := range l.config.Config.Sysctl { if err := writeSystemProperty(key, value); err != nil { return errors.Wrapf(err, "write sysctl key %s", key) } } for _, path := range l.config.Config.ReadonlyPaths { if err := readonlyPath(path); err != nil { return errors.Wrapf(err, "readonly path %s", path) } } for _, path := range l.config.Config.MaskPaths { if err := maskPath(path, l.config.Config.MountLabel); err != nil { return errors.Wrapf(err, "mask path %s", path) } } pdeath, err := system.GetParentDeathSignal() if err != nil { return errors.Wrap(err, "get pdeath signal") } if l.config.NoNewPrivileges { if err := unix.Prctl(unix.PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); err != nil { return errors.Wrap(err, "set nonewprivileges") } } // Tell our parent that we're ready to Execv. This must be done before the // Seccomp rules have been applied, because we need to be able to read and // write to a socket. if err := syncParentReady(l.pipe); err != nil { return errors.Wrap(err, "sync ready") } // Without NoNewPrivileges seccomp is a privileged operation, so we need to // do this before dropping capabilities; otherwise do it as late as possible // just before execve so as few syscalls take place after it as possible. if l.config.Config.Seccomp != nil && !l.config.NoNewPrivileges { if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil { return err } } if err := finalizeNamespace(l.config); err != nil { return err } // finalizeNamespace can change user/group which clears the parent death // signal, so we restore it here. if err := pdeath.Restore(); err != nil { return errors.Wrap(err, "restore pdeath signal") } // Compare the parent from the initial start of the init process and make // sure that it did not change. if the parent changes that means it died // and we were reparented to something else so we should just kill ourself // and not cause problems for someone else. if unix.Getppid() != l.parentPid { return unix.Kill(unix.Getpid(), unix.SIGKILL) } // Check for the arg before waiting to make sure it exists and it is // returned as a create time error. name, err := exec.LookPath(l.config.Args[0]) if err != nil { return err } // Close the pipe to signal that we have completed our init. l.pipe.Close() // Wait for the FIFO to be opened on the other side before exec-ing the // user process. We open it through /proc/self/fd/$fd, because the fd that // was given to us was an O_PATH fd to the fifo itself. Linux allows us to // re-open an O_PATH fd through /proc. fd, err := unix.Open(fmt.Sprintf("/proc/self/fd/%d", l.fifoFd), unix.O_WRONLY|unix.O_CLOEXEC, 0) if err != nil { return newSystemErrorWithCause(err, "open exec fifo") } if _, err := unix.Write(fd, []byte("0")); err != nil { return newSystemErrorWithCause(err, "write 0 exec fifo") } // Close the O_PATH fifofd fd before exec because the kernel resets // dumpable in the wrong order. This has been fixed in newer kernels, but // we keep this to ensure CVE-2016-9962 doesn't re-emerge on older kernels. // N.B. the core issue itself (passing dirfds to the host filesystem) has // since been resolved. // https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318 unix.Close(l.fifoFd) // Set seccomp as close to execve as possible, so as few syscalls take // place afterward (reducing the amount of syscalls that users need to // enable in their seccomp profiles). if l.config.Config.Seccomp != nil && l.config.NoNewPrivileges { if err := seccomp.InitSeccomp(l.config.Config.Seccomp); err != nil { return newSystemErrorWithCause(err, "init seccomp") } } if err := syscall.Exec(name, l.config.Args[0:], os.Environ()); err != nil { return newSystemErrorWithCause(err, "exec user process") } return nil }
(1)、該函數先處理l.config.Config.NoNewKeyring,l.config.Console, setupNetwork, setupRoute, label.Init()
(2)、if l.config.Config.Namespaces.Contains(configs.NEWNS) -> setupRootfs(l.config.Config, console, l.pipe)
(3)、設置hostname, apparmor.ApplyProfile(...), label.SetProcessLabel(...),l.config.Config.Sysctl
(4)、調用remountReadonly(path)從新掛載ReadonlyPaths,在配置文件中爲/proc/asound,/proc/bus, /proc/fs等等
(5)、調用maskPath(path)設置maskedPaths,pdeath := system.GetParentDeathSignal(), 處理l.config.NoNewPrivileges
(6)、調用syncParentReady(l.pipe) // 告訴父進程容器能夠執行Execv了, 從父進程來看,create已經完成了
(7)、處理l.config.Config.Seccomp 和 l.config.NoNewPrivileges, finalizeNamespace(l.config),pdeath.Restore(), 判斷syscall.Getppid()和l.parentPid是否相等,找到name, err := exec.Lookpath(l.config.Args[0]),最後l.pipe.Close(),init完成。此時create 在子進程中也完成了。
(8)、fd, err := syscall.Openat(l.stateDirFD, execFifoFilename, os.O_WRONLY|syscall.O_CLOEXEC, 0) ---> wait for the fifo to be opened on the other side before exec'ing the user process,其實此處就是在等待start命令。以後,再往fd中寫一個字節,用於同步:syscall.Write(fd, []byte("0"))
(9)、調用syscall.Exec(name, l.config.Args[0:], os.Environ())執行容器命令