kubernetes version: v1.3.0node
在分析kubelet啓動流程時,總是會碰到各種GC,這裏單獨提出來作下較詳細的分析。
kubelet's Garbage Collection主要由兩部分組成:docker
containerGC: 使用指定的container回收策略,刪除那些已經結束的containersjson
imageManager: k8s全部images的生命週期管理就是經過imageManager來實現的,其實該imageManager也是依賴了cAdvisor。api
imageManager的回收策略結構以下:app
type ImageGCPolicy struct { // Any usage above this threshold will always trigger garbage collection. // This is the highest usage we will allow. HighThresholdPercent int // Any usage below this threshold will never trigger garbage collection. // This is the lowest threshold we will try to garbage collect to. LowThresholdPercent int // Minimum age at which a image can be garbage collected. MinAge time.Duration }
該結構的出廠設置在cmd/kubelet/app/server.go中的UnsecuredKubeletConfig()接口進行。less
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) { ... imageGCPolicy := kubelet.ImageGCPolicy{ MinAge: s.ImageMinimumGCAge.Duration, HighThresholdPercent: int(s.ImageGCHighThresholdPercent), LowThresholdPercent: int(s.ImageGCLowThresholdPercent), } ... }
賦值的KubeletServer的幾個參數的初始化在cmd/kubelet/app/options/options.go中的NewKubeletServer()接口中進行:async
func NewKubeletServer() *KubeletServer { return &KubeletServer{ ... ImageMinimumGCAge: unversioned.Duration{Duration: 2 * time.Minute}, ImageGCHighThresholdPercent: 90, ImageGCLowThresholdPercent: 80, ... } }
從上面的初始化過程能夠得出:ide
在磁盤的佔用率高於90%時,imageGC將一直被觸發源碼分析
在磁盤的佔用率低於80%時,imageGC將不會觸發學習
imageGC會嘗試先delete最少使用的image,可是若是該image的建立時間才低於2min,將不會被刪除。
上面介紹的都是imageManager的回收策略參數初始化,下面開始介紹imageManager。
結構所在目錄:pkg/kubelet/image_manager.go
結構以下:
type imageManager interface { // Applies the garbage collection policy. Errors include being unable to free // enough space as per the garbage collection policy. GarbageCollect() error // Start async garbage collection of images. Start() error GetImageList() ([]kubecontainer.Image, error) // TODO(vmarmol): Have this subsume pulls as well. }
能夠看到imageManager是個interface,實際初始化的結構體是realImageManager:
type realImageManager struct { // Container runtime runtime container.Runtime // Records of images and their use. imageRecords map[string]*imageRecord imageRecordsLock sync.Mutex // The image garbage collection policy in use. policy ImageGCPolicy // cAdvisor instance. cadvisor cadvisor.Interface // Recorder for Kubernetes events. recorder record.EventRecorder // Reference to this node. nodeRef *api.ObjectReference // Track initialization initialized bool }
該接口的初始化須要先回到pkg/kubelet/kubelet.go中的NewMainKubelet()接口中:
func NewMainKubelet( hostname string, nodeName string, ... ) (*Kubelet, error) { ... // setup containerGC containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy) if err != nil { return nil, err } klet.containerGC = containerGC // setup imageManager imageManager, err := newImageManager(klet.containerRuntime, cadvisorInterface, recorder, nodeRef, imageGCPolicy) if err != nil { return nil, fmt.Errorf("failed to initialize image manager: %v", err) } klet.imageManager = imageManager ... }
能夠看到上面的接口中對containerGC和imageManager都進行了初始化,這裏先介紹imageManager,containerGC留到下面再講。
newImageManager()接口以下:
func newImageManager(runtime container.Runtime, cadvisorInterface cadvisor.Interface, recorder record.EventRecorder, nodeRef *api.ObjectReference, policy ImageGCPolicy) (imageManager, error) { // 檢查policy參數有效性 if policy.HighThresholdPercent < 0 || policy.HighThresholdPercent > 100 { return nil, fmt.Errorf("invalid HighThresholdPercent %d, must be in range [0-100]", policy.HighThresholdPercent) } if policy.LowThresholdPercent < 0 || policy.LowThresholdPercent > 100 { return nil, fmt.Errorf("invalid LowThresholdPercent %d, must be in range [0-100]", policy.LowThresholdPercent) } if policy.LowThresholdPercent > policy.HighThresholdPercent { return nil, fmt.Errorf("LowThresholdPercent %d can not be higher than HighThresholdPercent %d", policy.LowThresholdPercent, policy.HighThresholdPercent) } // 初始化realImageManager結構 im := &realImageManager{ runtime: runtime, policy: policy, imageRecords: make(map[string]*imageRecord), cadvisor: cadvisorInterface, recorder: recorder, nodeRef: nodeRef, initialized: false, } return im, nil }
查看上面的初始化接口,能夠看出該imageManager跟容器runtime、cAdvisor、EventRecorder、nodeRef、Policy都有關。
這裏能夠進行大膽的猜想:
runtime用於進行image的刪除操做
cAdvisor用於獲取image佔用磁盤的狀況
EventRecorder用於發送具體的回收事件
Policy就是具體的回收策略了
nodeRef幹嗎的?猜不到,仍是後面繼續看源碼吧!
全部的參數初始化結束後,須要開始進入真正的GC啓動流程,該步驟仍是須要查看CreateAndInitKubelet()接口。
接口目錄:cmd/kubelet/app/server.go
接口調用流程:main -> app.Run -> run -> RunKubelet -> CreateAndInitKubelet
接口以下:
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap, pc *config.PodConfig, err error) { ... k.StartGarbageCollection() return k, pc, nil }
該接口調用了啓動GC的接口StartGarbageCollection(),具體實現以下:
func (kl *Kubelet) StartGarbageCollection() { go wait.Until(func() { if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil { glog.Errorf("Container garbage collection failed: %v", err) } }, ContainerGCPeriod, wait.NeverStop) go wait.Until(func() { if err := kl.imageManager.GarbageCollect(); err != nil { glog.Errorf("Image garbage collection failed: %v", err) } }, ImageGCPeriod, wait.NeverStop) }
上面的接口分別啓動了containerGC和imageManager的協程,能夠看出containerGC是每1分鐘觸發回收,imageManager是每5分鐘觸發回收。
該GarbageCollect()接口須要根據以前參數初始化時的realImageManager結構進行查看,進入kl.imageManager.GarbageCollect()一看究竟:
func (im *realImageManager) GarbageCollect() error { // 獲取節點上所存在的images的磁盤佔用率 fsInfo, err := im.cadvisor.ImagesFsInfo() if err != nil { return err } // 容量及可利用的空間 capacity := int64(fsInfo.Capacity) available := int64(fsInfo.Available) if available > capacity { glog.Warningf("available %d is larger than capacity %d", available, capacity) available = capacity } // Check valid capacity. if capacity == 0 { err := fmt.Errorf("invalid capacity %d on device %q at mount point %q", capacity, fsInfo.Device, fsInfo.Mountpoint) im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.InvalidDiskCapacity, err.Error()) return err } // 查看images的磁盤佔用率是否大於等於HighThresholdPercent usagePercent := 100 - int(available*100/capacity) if usagePercent >= im.policy.HighThresholdPercent { // 嘗試去回收images的佔用率到LowThresholdPercent之下 amountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available glog.Infof("[ImageManager]: Disk usage on %q (%s) is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes", fsInfo.Device, fsInfo.Mountpoint, usagePercent, im.policy.HighThresholdPercent, amountToFree) // 真正的回收接口 freed, err := im.freeSpace(amountToFree, time.Now()) if err != nil { return err } if freed < amountToFree { err := fmt.Errorf("failed to garbage collect required amount of images. Wanted to free %d, but freed %d", amountToFree, freed) im.recorder.Eventf(im.nodeRef, api.EventTypeWarning, container.FreeDiskSpaceFailed, err.Error()) return err · } } return nil }
這裏最關鍵的接口即是im.freeSpace(),該接口才是真正進行資源回收的接口。
該接口有兩個參數:第一個即是設置此次打算回收的空間,第二個是傳入調用回收接口的當前時間。
具體的回收,咱們進入接口繼續細看:
func (im *realImageManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) { // 用im.runtime遍歷現存的全部的images,並更新im.imageRecords,下面會用到。 err := im.detectImages(freeTime) if err != nil { return 0, err } // 操做imageRecords的鎖 im.imageRecordsLock.Lock() defer im.imageRecordsLock.Unlock() // 獲取全部的images images := make([]evictionInfo, 0, len(im.imageRecords)) for image, record := range im.imageRecords { images = append(images, evictionInfo{ id: image, imageRecord: *record, }) } sort.Sort(byLastUsedAndDetected(images)) // 下面的循環將嘗試刪除images,直到知足須要刪除的空間爲止 var lastErr error spaceFreed := int64(0) for _, image := range images { glog.V(5).Infof("Evaluating image ID %s for possible garbage collection", image.id) // Images that are currently in used were given a newer lastUsed. if image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) { glog.V(5).Infof("Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection", image.id, image.lastUsed, freeTime) break } // Avoid garbage collect the image if the image is not old enough. // In such a case, the image may have just been pulled down, and will be used by a container right away. // 查看該image的空閒時間是否夠久,不夠久的話將不刪除 // 這個時間在GC的策略中有配置 if freeTime.Sub(image.firstDetected) < im.policy.MinAge { glog.V(5).Infof("Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge) continue } // 調用runtime(即Docker)的接口刪除指定的image glog.Infof("[ImageManager]: Removing image %q to free %d bytes", image.id, image.size) err := im.runtime.RemoveImage(container.ImageSpec{Image: image.id}) if err != nil { lastErr = err continue } // 將刪除的鏡像從imageRecords中去除,因此前面須要加鎖 delete(im.imageRecords, image.id) // 增長已經刪除的image的size spaceFreed += image.size // 若是已經刪除的image的大小已經知足要求,則退出回收流程 if spaceFreed >= bytesToFree { break } } return spaceFreed, lastErr }
基本的imageManager模塊流程差很少就這樣了,這裏還能夠繼續深刻學習下cAdvisor和docker runtime的接口實現。
containerGC回收策略相關結構以下:
type ContainerGCPolicy struct { // Minimum age at which a container can be garbage collected, zero for no limit. MinAge time.Duration // Max number of dead containers any single pod (UID, container name) pair is // allowed to have, less than zero for no limit. MaxPerPodContainer int // Max number of total dead containers, less than zero for no limit. MaxContainers int }
該結構的初始化是在cmd/kubelet/app/kubelet.go文件中的CreateAndInitKubelet()接口中進行。
調用流程:main --> app.Run --> RunKubelet --> CreateAndInitKubelet
func CreateAndInitKubelet(kc *KubeletConfig) (k KubeletBootstrap, pc *config.PodConfig, err error) { var kubeClient clientset.Interface if kc.KubeClient != nil { kubeClient = kc.KubeClient // TODO: remove this when we've refactored kubelet to only use clientset. } // containerGC的回收策略初始化 gcPolicy := kubecontainer.ContainerGCPolicy{ MinAge: kc.MinimumGCAge, MaxPerPodContainer: kc.MaxPerPodContainerCount, MaxContainers: kc.MaxContainerCount, } ... }
能夠看到實際的參數來源於kc結構,而該結構的初始化是在cmd/kubelet/app/kubelet.go文件中的UnsecuredKubeletConfig()接口中進行。
調用流程:main --> app.Run --> UnsecuredKubeletConfig
func UnsecuredKubeletConfig(s *options.KubeletServer) (*KubeletConfig, error) { ... MaxContainerCount: int(s.MaxContainerCount), MaxPerPodContainerCount: int(s.MaxPerPodContainerCount), MinimumGCAge: s.MinimumGCAge.Duration, ... }
最開始的參數都來源於KubeletServer中的KubeletConfiguration結構,相關的參數以下:
type KubeletConfiguration struct { ... // containerGC會回收已經結束的container,可是該container結束後必需要停留 // 大於MinimumGCAge時間才能被回收。 default: 1min MinimumGCAge unversioned.Duration `json:"minimumGCAge"` // 用於指定每一個已經結束的Pod最多能夠存在containers的數量,default: 2 MaxPerPodContainerCount int32 `json:"maxPerPodContainerCount"` // 集羣最大支持的container數量 MaxContainerCount int32 `json:"maxContainerCount"` }
而該入參的初始化仍是須要回到cmd/kubelet/app/options/options.go中的NewKubeletServer()接口,實際初始化以下:
func NewKubeletServer() *KubeletServer { ... MaxContainerCount: 240, MaxPerPodContainerCount: 2, MinimumGCAge: unversioned.Duration{Duration: 1 * time.Minute},
從上面的初始化能夠看出:
該節點能夠建立的最大container數量是240
每一個Pod最大能夠容納2個containers
container結束以後,至少須要在1分鐘以後才能被containerGC回收
因此基本的containerGC策略就明白了。
策略結構初始化完以後,還要進行最後一步containerGC結構初始化,須要進入pkg/kubelet/kubelet.go的NewMainKubelet()接口查看:
func NewMainKubelet(...) { ... // setup containerGC containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy) if err != nil { return nil, err } klet.containerGC = containerGC ... }
繼續查看NewContainerGC(),該接口在pkg/kubelet/container/container_gc.go中,看下幹了啥:
func NewContainerGC(runtime Runtime, policy ContainerGCPolicy) (ContainerGC, error) { if policy.MinAge < 0 { return nil, fmt.Errorf("invalid minimum garbage collection age: %v", policy.MinAge) } return &realContainerGC{ runtime: runtime, policy: policy, }, nil }
接口很簡單,根據以前的策略結構體又初始化了一個realContainerGC結構,能夠看出該接口就比較完整了,能夠想象一下須要進行container的回收的話,必需要用到runtime的接口(好比查看當前容器狀態,刪除容器等操做),因此結構中帶入實際使用的runtime是必然的。
能夠關注下該對象支持的方法,後面會用到。
全部的參數初始化結束後,須要開始進入真正的GC啓動流程,該步驟上面講imageManager時已經說起,這裏直接進入正題。
啓動containerGC的接口是StartGarbageCollection(),具體實現以下:
func (kl *Kubelet) StartGarbageCollection() { go wait.Until(func() { if err := kl.containerGC.GarbageCollect(kl.sourcesReady.AllReady()); err != nil { glog.Errorf("Container garbage collection failed: %v", err) } }, ContainerGCPeriod, wait.NeverStop) go wait.Until(func() { if err := kl.imageManager.GarbageCollect(); err != nil { glog.Errorf("Image garbage collection failed: %v", err) } }, ImageGCPeriod, wait.NeverStop) }
接下來咱們一塊兒看下containerGC的GarbageCollect()接口,但要找到這個接口的話,咱們得回到以前初始化containerGC的步驟。
實際初始化containerGC時真正返回的是realContainerGC結構,因此GarbageCollect()是該結構的方法:
func (cgc *realContainerGC) GarbageCollect(allSourcesReady bool) error { return cgc.runtime.GarbageCollect(cgc.policy, allSourcesReady) }
看到這裏,發現containerGC的套路跟imageManager同樣,因此一招鮮吃遍天。
咱們使用的runtime就是docker,因此須要去找docker的GarbageCollect()接口實現,具體runtime的初始化能夠查看以前一篇文章<Kubelet源碼分析(二) DockerClient>的介紹,這裏就不講具體的初始化了,直接進入正題。
Docker的GarbageCollect()接口在pkg/kubelet/dockertools/container_gc.go中:
func (cgc *containerGC) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool) error { // 從全部的容器中分離出那些能夠被回收的contianers // evictUnits: 能夠識別的但已經dead,而且建立時間大於回收策略中的minAge的containers // unidentifiedContainers: 沒法識別的containers evictUnits, unidentifiedContainers, err := cgc.evictableContainers(gcPolicy.MinAge) if err != nil { return err } // 先刪除沒法識別的containers for _, container := range unidentifiedContainers { glog.Infof("Removing unidentified dead container %q with ID %q", container.name, container.id) err = cgc.client.RemoveContainer(container.id, dockertypes.ContainerRemoveOptions{RemoveVolumes: true}) if err != nil { glog.Warningf("Failed to remove unidentified dead container %q: %v", container.name, err) } } // 全部資源都已經準備好以後,能夠刪除那些已經dead的containers if allSourcesReady { for key, unit := range evictUnits { if cgc.isPodDeleted(key.uid) { cgc.removeOldestN(unit, len(unit)) // Remove all. delete(evictUnits, key) } } } // 檢查全部的evictUnits, 刪除每一個Pod中超出的containers if gcPolicy.MaxPerPodContainer >= 0 { cgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer) } // 確保節點的最大containers數量 // 檢查節點containers數量是否超出了最大限制,是的話就刪除多出來的containers // 優先刪除最早建立的containers if gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers { // 計算每一個單元最多能夠有幾個containers numContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits() if numContainersPerEvictUnit < 1 { numContainersPerEvictUnit = 1 } // cgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit) // 須要刪除containers的話,優先刪除最老的containers numContainers := evictUnits.NumContainers() if numContainers > gcPolicy.MaxContainers { flattened := make([]containerGCInfo, 0, numContainers) for uid := range evictUnits { // 先整合全部的containers flattened = append(flattened, evictUnits[uid]...) } sort.Sort(byCreated(flattened)) // 刪除numContainers-gcPolicy.MaxContainers個最早建立的contianers cgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers) } } // 刪除containers以後,須要清除對應的軟鏈接 logSymlinks, _ := filepath.Glob(path.Join(cgc.containerLogsDir, fmt.Sprintf("*.%s", LogSuffix))) for _, logSymlink := range logSymlinks { if _, err = os.Stat(logSymlink); os.IsNotExist(err) { err = os.Remove(logSymlink) if err != nil { glog.Warningf("Failed to remove container log dead symlink %q: %v", logSymlink, err) } } } return nil }
上面經過源碼介紹了imageManager和containerGC的實現,裏面也涉及到了GC Policy的配置,咱們也能夠經過手動修改kubelet的flags來改變參數默認值。
image-gc-high-threshold: 該值表示磁盤佔用率達到該值後會觸發image garbage collection。默認值是90%
image-gc-low-threshold: 該值表示image GC嘗試以回收的方式來達到的磁盤佔用率,若磁盤佔用率本來就小於該值,不會觸發GC。默認值是80%
minimum-container-ttl-duration: 表示container結束後多長時間能夠被GC回收,默認是1min
maximum-dead-containers-per-container: 表示每一個已經結束的Pod中最多能夠存在多少個containers,默認值是2個
maximum-dead-containers: 表示kubelet所在節點最多能夠保留已經結束的containers的數量,默認值是240
容器在中止工做後是能夠被garbage collection進行回收,可是咱們也須要對containers進行保留,由於有些containers多是異常中止的,而container能夠保留有logs或者別的游泳的數據用於開發進行問題定位。根據上面的需求,咱們就能夠經過maximum-dead-containers-per-container和maximum-dead-containers很好的來實現這個目標。