理解docker,主要從namesapce,cgroups,聯合文件,運行時(runC),網絡幾個方面。接下來咱們會花一些時間,分別介紹。node
namesapce主要是隔離做用,cgroups主要是資源限制,聯合文件主要用於鏡像分層存儲和管理,runC是運行時,遵循了oci接口,通常來講基於libcontainer。網絡主要是docker單機網絡和多主機通訊模式。linux
Cgroup是control group的簡寫,屬於Linux內核提供的一個特性,用於限制和隔離一組進程對系統資源的使用,也就是作資源QoS,這些資源主要包括CPU、內存、block I/O和網絡帶寬。Cgroup從2.6.24開始進入內核主線,目前各大發行版都默認打開了Cgroup特性。
Cgroups提供瞭如下四大功能:git
cgroup中實現的子系統及其做用以下:github
每一個子系統的目錄下有更詳細的設置項,例如:
cpu
CPU資源的控制也有兩種策略,一種是徹底公平調度 (CFS:Completely Fair Scheduler)策略,提供了限額和按比例分配兩種方式進行資源控制;另外一種是實時調度(Real-Time Scheduler)策略,針對實時進程按週期分配固定的運行時間。配置時間都以微秒(µs)爲單位,文件名中用us表示。docker
cpuset CPU綁定:
除了限制 CPU 的使用量,cgroup 還能把任務綁定到特定的 CPU,讓它們只運行在這些 CPU 上,這就是 cpuset 子資源的功能。除了 CPU 以外,還能綁定內存節點(memory node)。
在把任務加入到 cpuset 的 task 文件以前,用戶必須設置 cpuset.cpus 和 cpuset.mems 參數。json
memory:
segmentfault
這裏專門講一下監控和統計相關的參數,好比cadvisor採集的那些參數。網絡
# Run a container that will spawn 300 processes. docker run cirocosta/stress pid -n 300 Starting to spawn 300 blocking children [1] Waiting for SIGINT # Open another window and see that we have 300 # PIDS docker stats CONTAINER … MEM USAGE / LIMIT PIDS a730051832 … 21.02MiB / 1.951GiB 300
# let's get the ID of the container. Docker uses that ID # to name things in the host to we can probably use it to # find the cgroup created for the container # under the parent docker cgroup docker ps CONTAINER ID IMAGE COMMAND a730051832e7 cirocosta/stress "pid -n 300" # Having the prefix in hands, let's search for it under the # mountpoint for cgroups in our system find /sys/fs/cgroup/ -name "a730051832e7*" /sys/fs/cgroup/cpu,cpuacct/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/cpuset/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/devices/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/pids/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/freezer/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/perf_event/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/blkio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/memory/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/net_cls,net_prio/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/hugetlb/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 /sys/fs/cgroup/systemd/docker/a730051832e7d776442b2e969e057660ad108a7d6e6a30569398ec660a75a959 # There they are! Docker creates a control group with the name # being the exact ID of the container under all the subsystems. # What can we discover from this inspection? We can look at the # subsystem that we want to place contrainst on (PIDs), for instance: tree /sys/fs/cgroup/pids/docker/a7300518327d... /sys/fs/cgroup/pids/docker/a73005183... ├── cgroup.clone_children ├── cgroup.procs ├── notify_on_release ├── pids.current ├── pids.events ├── pids.max └── tasks # Which means that, if we want to know how many PIDs are in use right # now we can look at 'pids.current', to know the limits, 'pids.max' and # to know which processes have been assigned to this control group, # look at tasks. Lets do it: cat /sys/fs/cgroup/pids/docker/a730...c660a75a959/tasks 5329 5371 5372 5373 5374 5375 5376 5377 (...) # continues until the 300th entry - as we have 300 processes in this container # 300 pids cat /sys/fs/cgroup/pids/docker/a730051832e7d7764...9/pids.current 300 # no max set cat /sys/fs/cgroup/pids/docker/a730051832e7d77.../pids.max max
通常在安裝k8s的過程當中常常會遇到以下錯誤:ide
create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
其實此處錯誤信息已經很明白了,就是docker 和kubelet指定的cgroup driver不同。 docker
支持systemd和cgroupfs兩種驅動方式。經過runc代碼能夠更加直觀瞭解。
工具
關於cgroups在runc的代碼部分,你們能夠點擊進去詳細閱讀。這邊咱們只講一個大概。
首先container的建立是由factory調用create方法實現的,而cgroup相關,factory實現了根據配置文件cgroup drive驅動的配置項,新建CgroupsManager的方法,systemd和cgroupfs兩種實現方式:
// SystemdCgroups is an options func to configure a LinuxFactory to return // containers that use systemd to create and manage cgroups. func SystemdCgroups(l *LinuxFactory) error { l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager { return &systemd.Manager{ Cgroups: config, Paths: paths, } } return nil } // Cgroupfs is an options func to configure a LinuxFactory to return containers // that use the native cgroups filesystem implementation to create and manage // cgroups. func Cgroupfs(l *LinuxFactory) error { l.NewCgroupsManager = func(config *configs.Cgroup, paths map[string]string) cgroups.Manager { return &fs.Manager{ Cgroups: config, Paths: paths, } } return nil }
抽象cgroup manager接口。接口以下:
type Manager interface { // Applies cgroup configuration to the process with the specified pid Apply(pid int) error // Returns the PIDs inside the cgroup set GetPids() ([]int, error) // Returns the PIDs inside the cgroup set & all sub-cgroups GetAllPids() ([]int, error) // Returns statistics for the cgroup set GetStats() (*Stats, error) // Toggles the freezer cgroup according with specified state Freeze(state configs.FreezerState) error // Destroys the cgroup set Destroy() error // The option func SystemdCgroups() and Cgroupfs() require following attributes: // Paths map[string]string // Cgroups *configs.Cgroup // Paths maps cgroup subsystem to path at which it is mounted. // Cgroups specifies specific cgroup settings for the various subsystems // Returns cgroup paths to save in a state file and to be able to // restore the object later. GetPaths() map[string]string // Sets the cgroup as configured. Set(container *configs.Config) error }
在建立container的過程當中,會調用上面接口的方法。例如:
在container_linux.go中,
func (c *linuxContainer) Set(config configs.Config) error { c.m.Lock() defer c.m.Unlock() status, err := c.currentStatus() if err != nil { return err } ... if err := c.cgroupManager.Set(&config); err != nil { // Set configs back if err2 := c.cgroupManager.Set(c.config); err2 != nil { logrus.Warnf("Setting back cgroup configs failed due to error: %v, your state.json and actual configs might be inconsistent.", err2) } return err } ... }
接下來咱們重點講一下fs的實現。
在fs中,基本上每一個子系統都是一個文件,如上圖。
重點說一下memory.go,即memory子系統,其餘子系統與此相似。
關鍵方法:
func (s *MemoryGroup) Apply(d *cgroupData) (err error) { path, err := d.path("memory") if err != nil && !cgroups.IsNotFound(err) { return err } else if path == "" { return nil } if memoryAssigned(d.config) { if _, err := os.Stat(path); os.IsNotExist(err) { if err := os.MkdirAll(path, 0755); err != nil { return err } // Only enable kernel memory accouting when this cgroup // is created by libcontainer, otherwise we might get // error when people use `cgroupsPath` to join an existed // cgroup whose kernel memory is not initialized. if err := EnableKernelMemoryAccounting(path); err != nil { return err } } } defer func() { if err != nil { os.RemoveAll(path) } }() // We need to join memory cgroup after set memory limits, because // kmem.limit_in_bytes can only be set when the cgroup is empty. _, err = d.join("memory") if err != nil && !cgroups.IsNotFound(err) { return err } return nil }
func (raw *cgroupData) path(subsystem string) (string, error) { mnt, err := cgroups.FindCgroupMountpoint(subsystem) // If we didn't mount the subsystem, there is no point we make the path. if err != nil { return "", err } // If the cgroup name/path is absolute do not look relative to the cgroup of the init process. if filepath.IsAbs(raw.innerPath) { // Sometimes subsystems can be mounted together as 'cpu,cpuacct'. return filepath.Join(raw.root, filepath.Base(mnt), raw.innerPath), nil } // Use GetOwnCgroupPath instead of GetInitCgroupPath, because the creating // process could in container and shared pid namespace with host, and // /proc/1/cgroup could point to whole other world of cgroups. parentPath, err := cgroups.GetOwnCgroupPath(subsystem) if err != nil { return "", err } return filepath.Join(parentPath, raw.innerPath), nil }
func (raw *cgroupData) join(subsystem string) (string, error) { path, err := raw.path(subsystem) if err != nil { return "", err } if err := os.MkdirAll(path, 0755); err != nil { return "", err } if err := cgroups.WriteCgroupProc(path, raw.pid); err != nil { return "", err } return path, nil }