深刻分析Kubernetes Critical Pod(二)

深刻分析Kubernetes Critical Pod(一)介紹了Scheduler對Critical Pod的處理邏輯,下面咱們再看下Kubelet Eviction Manager對Critical Pod的處理邏輯是怎樣的,以便咱們瞭解Kubelet Evict Pod時對Critical Pod是否有保護措施,若是有,又是如何保護的。node

Kubelet Eviction Manager Admit

kubelet在syncLoop中每一個1s會循環調用syncLoopIteration,從config change channel | pleg channel | sync channel | houseKeeping channel | liveness manager's update channel中獲取event,而後分別調用對應的event handler進行處理。app

  • configCh: dispatch the pods for the config change to the appropriate handler callback for the event type
  • plegCh: update the runtime cache; sync pod
  • syncCh: sync all pods waiting for sync
  • houseKeepingCh: trigger cleanup of pods
  • liveness manager's update channel: sync pods that have failed or in which one or more containers have failed liveness checks

特別提一下,houseKeeping channel是每隔houseKeeping(10s)時間就會有event,而後執行HandlePodCleanups,執行如下清理操做:ide

  • Stop the workers for no-longer existing pods.(每一個pod對應會有一個worker,也就是goruntine)
  • killing unwanted pods
  • removes the volumes of pods that should not be running and that have no containers running.
  • Remove any orphaned mirror pods.
  • Remove any cgroups in the hierarchy for pods that are no longer running.
pkg/kubelet/kubelet.go:1753

func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
	case u, open := <-configCh:
		
		if !open {
			glog.Errorf("Update channel is closed. Exiting the sync loop.")
			return false
		}

		switch u.Op {
		case kubetypes.ADD:
			
			handler.HandlePodAdditions(u.Pods)
		...
		case kubetypes.RESTORE:
			glog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
			// These are pods restored from the checkpoint. Treat them as new
			// pods.
			handler.HandlePodAdditions(u.Pods)
		...
		}

		if u.Op != kubetypes.RESTORE {
			...
		}
	case e := <-plegCh:
		...
	case <-syncCh:
		...
	case update := <-kl.livenessManager.Updates():
		...
	case <-housekeepingCh:
		...
	}
	return true
}

syncLoopIteration中定義了當kubelet配置變動重啓後的邏輯:kubelet會對正在running的Pods進行Admission處理,Admission的結果有可能會讓該Pod被本節點拒絕。oop

HandlePodAdditions就是用來處理Kubelet ConficCh中的event的Handler。spa

// HandlePodAdditions is the callback in SyncHandler for pods being added from a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
	start := kl.clock.Now()
	sort.Sort(sliceutils.PodsByCreationTime(pods))
	for _, pod := range pods {
		...

		if !kl.podIsTerminated(pod) {
			...
			// Check if we can admit the pod; if not, reject it.
			if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
				kl.rejectPod(pod, reason, message)
				continue
			}
		}
		...
	}
}

若是該Pod Status不是屬於Terminated,就調用canAdmitPod對該Pod進行准入檢查。若是准入檢查結果表示該Pod被拒絕,那麼就會將該Pod Phase設置爲Failed。rest

pkg/kubelet/kubelet.go:1643

func (kl *Kubelet) canAdmitPod(pods []*v1.Pod, pod *v1.Pod) (bool, string, string) {
	// the kubelet will invoke each pod admit handler in sequence
	// if any handler rejects, the pod is rejected.
	// TODO: move out of disk check into a pod admitter
	// TODO: out of resource eviction should have a pod admitter call-out
	attrs := &lifecycle.PodAdmitAttributes{Pod: pod, OtherPods: pods}
	for _, podAdmitHandler := range kl.admitHandlers {
		if result := podAdmitHandler.Admit(attrs); !result.Admit {
			return false, result.Reason, result.Message
		}
	}

	return true, "", ""
}

canAdmitPod就會調用kubelet啓動時註冊的一系列admitHandlers對該Pod進行准入檢查,其中就包括kubelet eviction manager對應的admitHandle。code

pkg/kubelet/eviction/eviction_manager.go:123

// Admit rejects a pod if its not safe to admit for node stability.
func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
	m.RLock()
	defer m.RUnlock()
	if len(m.nodeConditions) == 0 {
		return lifecycle.PodAdmitResult{Admit: true}
	}
	
	if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(attrs.Pod) {
		return lifecycle.PodAdmitResult{Admit: true}
	}

	if hasNodeCondition(m.nodeConditions, v1.NodeMemoryPressure) {
		notBestEffort := v1.PodQOSBestEffort != v1qos.GetPodQOS(attrs.Pod)
		if notBestEffort {
			return lifecycle.PodAdmitResult{Admit: true}
		}
	}

		return lifecycle.PodAdmitResult{
		Admit:   false,
		Reason:  reason,
		Message: fmt.Sprintf(message, m.nodeConditions),
	}
}

eviction manager的Admit的邏輯以下:orm

  • 若是該node的Conditions爲空,則Admit成功;
  • 若是enable了ExperimentalCriticalPodAnnotation Feature Gate,而且該Pod是Critical Pod(Pod有Critical的Annotation,或者Pod的優先級不小於SystemCriticalPriority),則Admit成功;
    • SystemCriticalPriority的值爲2 billion。
  • 若是該node的Condition爲Memory Pressure,而且Pod QoS爲非best-effort,則Admit成功;
  • 其餘狀況都表示Admit失敗,即不容許該Pod在該node上Running。

Kubelet Eviction Manager SyncLoop

另外,在kubelet eviction manager的syncLoop中,也會對Critical Pod有特殊處理,代碼以下。rem

pkg/kubelet/eviction/eviction_manager.go:226

// synchronize is the main control loop that enforces eviction thresholds.
// Returns the pod that was killed, or nil if no pod was killed.
func (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {
	...

	// we kill at most a single pod during each eviction interval
	for i := range activePods {
		pod := activePods[i]
		
		if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
			kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
			continue
		}
		...
		return []*v1.Pod{pod}
	}
	glog.Infof("eviction manager: unable to evict any pods from the node")
	return nil
}

當觸發了kubelet evict pod時,若是該pod知足如下全部條件時,將不會被kubelet eviction manager kill掉。get

  • 該Pod Status不是Terminated;
  • Enable ExperimentalCriticalPodAnnotation Feature Gate;
  • 該Pod是Critical Pod;
  • 該Pod時Static Pod;

總結

通過上面的分析,咱們獲得如下Kubelet Eviction Manager對Critical Pod處理的關鍵點:

  • kubelet重啓後,eviction manager的Admit流程中對Critical Pod作以下特殊處理:若是enable了ExperimentalCriticalPodAnnotation Feature Gate,則容許該Critical Pod准入該node,無視該node的Condition。

  • 當觸發了kubelet evict pod時,若是該Critical Pod知足如下全部條件時,將不會被kubelet eviction manager kill掉。

    • 該Pod Status不是Terminated;
    • Enable ExperimentalCriticalPodAnnotation Feature Gate;
    • 該Pod是Critical Pod;
    • 該Pod是Static Pod;
相關文章
相關標籤/搜索