深刻分析Kubernetes Critical Pod（四）

時間 2019-11-13

標籤深刻分析 kubernetes critical pod 简体版

原文原文鏈接

摘要：本文分析了DeamonSetController及PriorityClass Validate時，對CriticalPod的所作的特殊處理。node

Daemonset Controller對CriticalPod的特殊處理

深刻分析Kubernetes Critical Pod系列：深刻分析Kubernetes Critical Pod（一）深刻分析Kubernetes Critical Pod（二）深刻分析Kubernetes Critical Pod（三）深刻分析Kubernetes Critical Pod（四）bootstrap

在DaemonSetController判斷某個node上是否要運行某個DaemonSet時，會調用DaemonSetsController.simulate來分析PredicateFailureReason。api

pkg/controller/daemon/daemon_controller.go:1206

func (dsc *DaemonSetsController) simulate(newPod *v1.Pod, node *v1.Node, ds *apps.DaemonSet) ([]algorithm.PredicateFailureReason, *schedulercache.NodeInfo, error) {
	// DaemonSet pods shouldn't be deleted by NodeController in case of node problems.
	// Add infinite toleration for taint notReady:NoExecute here
	// to survive taint-based eviction enforced by NodeController
	// when node turns not ready.
	v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{
		Key:      algorithm.TaintNodeNotReady,
		Operator: v1.TolerationOpExists,
		Effect:   v1.TaintEffectNoExecute,
	})

	// DaemonSet pods shouldn't be deleted by NodeController in case of node problems.
	// Add infinite toleration for taint unreachable:NoExecute here
	// to survive taint-based eviction enforced by NodeController
	// when node turns unreachable.
	v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{
		Key:      algorithm.TaintNodeUnreachable,
		Operator: v1.TolerationOpExists,
		Effect:   v1.TaintEffectNoExecute,
	})

	// According to TaintNodesByCondition, all DaemonSet pods should tolerate
	// MemoryPressure and DisPressure taints, and the critical pods should tolerate
	// OutOfDisk taint additional.
	v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{
		Key:      algorithm.TaintNodeDiskPressure,
		Operator: v1.TolerationOpExists,
		Effect:   v1.TaintEffectNoSchedule,
	})

	v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{
		Key:      algorithm.TaintNodeMemoryPressure,
		Operator: v1.TolerationOpExists,
		Effect:   v1.TaintEffectNoSchedule,
	})

	// TODO(#48843) OutOfDisk taints will be removed in 1.10
	if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
		kubelettypes.IsCriticalPod(newPod) {
		v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{
			Key:      algorithm.TaintNodeOutOfDisk,
			Operator: v1.TolerationOpExists,
			Effect:   v1.TaintEffectNoSchedule,
		})
	}

	...

	_, reasons, err := Predicates(newPod, nodeInfo)
	return reasons, nodeInfo, err
}

DeamonSetController會給Pod添加如下Toleratoins，防止Node出現如下Conditions被Node Controller Taint-based eviction殺死。
- NotReady:NoExecute
- Unreachable:NoExecute
- MemoryPressure:NoSchedule
- DisPressure:NoSchedule
當ExperimentalCriticalPodAnnotation Feature Gate Enable，而且該Pod是CriticalPod時，還會給該Pod加上OutOfDisk:NoSchedule Toleration。

在simulate中，還會像相似scheduler同樣，進行Predicates處理。Predicates過程當中也對CriticalPod作了區分對待。app

pkg/controller/daemon/daemon_controller.go:1413

// Predicates checks if a DaemonSet's pod can be scheduled on a node using GeneralPredicates
// and PodToleratesNodeTaints predicate
func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason

	// If ScheduleDaemonSetPods is enabled, only check nodeSelector and nodeAffinity.
	if false /*disabled for 1.10*/ && utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
		fit, reasons, err := nodeSelectionPredicates(pod, nil, nodeInfo)
		if err != nil {
			return false, predicateFails, err
		}
		if !fit {
			predicateFails = append(predicateFails, reasons...)
		}

		return len(predicateFails) == 0, predicateFails, nil
	}

	critical := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
		kubelettypes.IsCriticalPod(pod)

	fit, reasons, err := predicates.PodToleratesNodeTaints(pod, nil, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}
	if critical {
		// If the pod is marked as critical and support for critical pod annotations is enabled,
		// check predicates for critical pods only.
		fit, reasons, err = predicates.EssentialPredicates(pod, nil, nodeInfo)
	} else {
		fit, reasons, err = predicates.GeneralPredicates(pod, nil, nodeInfo)
	}
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

若是是CriticalPod，調用predicates.EssentialPredicates，不然調用predicates.GeneralPredicates。
這裏的GeneralPredicates與EssentialPredicates有何不一樣呢？其實GeneralPredicates就是比EssentialPredicates多了noncriticalPredicates處理，也就是Scheduler的Predicate中的PodFitsResources。

pkg/scheduler/algorithm/predicates/predicates.go:1076

// noncriticalPredicates are the predicates that only non-critical pods need
func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	var predicateFails []algorithm.PredicateFailureReason
	fit, reasons, err := PodFitsResources(pod, meta, nodeInfo)
	if err != nil {
		return false, predicateFails, err
	}
	if !fit {
		predicateFails = append(predicateFails, reasons...)
	}

	return len(predicateFails) == 0, predicateFails, nil
}

所以，對於CriticalPod，DeamonSetController進行Predicate時不會進行PodFitsResources檢查。ui

PriorityClass Validate對CriticalPod的特殊處理

在Kubernetes 1.11中，很重要的個更新就是，Priority和Preemption從alpha升級爲Beta了，而且是Enabled by default。code

Kubernetes Version	Priority and Preemption State	Enabled by default
1.8	alpha	no
1.9	alpha	no
1.10	alpha	no
1.11	beta	yes

PriorityClass是屬於scheduling.k8s.io/v1alpha1GroupVersion的，在client提交建立PriorityClass請求後，寫入etcd前，會進行合法性檢查（Validate），這其中就有對SystemClusterCritical和SystemNodeCritical兩個PriorityClass的特殊對待。ip

pkg/apis/scheduling/validation/validation.go:30

// ValidatePriorityClass tests whether required fields in the PriorityClass are
// set correctly.
func ValidatePriorityClass(pc *scheduling.PriorityClass) field.ErrorList {
	...
	// If the priorityClass starts with a system prefix, it must be one of the
	// predefined system priority classes.
	if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) {
		if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is {
			allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error()))
		}
	} 
	...
	return allErrs
}

// IsKnownSystemPriorityClass checks that "pc" is equal to one of the system PriorityClasses.
// It ignores "description", labels, annotations, etc. of the PriorityClass.
func IsKnownSystemPriorityClass(pc *PriorityClass) (bool, error) {
	for _, spc := range systemPriorityClasses {
		if spc.Name == pc.Name {
			if spc.Value != pc.Value {
				return false, fmt.Errorf("value of %v PriorityClass must be %v", spc.Name, spc.Value)
			}
			if spc.GlobalDefault != pc.GlobalDefault {
				return false, fmt.Errorf("globalDefault of %v PriorityClass must be %v", spc.Name, spc.GlobalDefault)
			}
			return true, nil
		}
	}
	return false, fmt.Errorf("%v is not a known system priority class", pc.Name)
}

PriorityClass的Validate時，若是PriorityClass's Name是以**system-**爲前綴的，那麼必須是system-cluster-critical或者system-node-critical之一。不然就會Validate Error，拒絕提交。
若是提交的PriorityClass's Name爲system-cluster-critical或者system-node-critical，那麼要求globalDefault必須爲false，即system-cluster-critical或者system-node-critical不能是全局默認的PriorityClass。

另外，在PriorityClass進行Update時，目前是不容許其Name和Value的，也就是說只能更新Description和globalDefault。rem

pkg/apis/scheduling/helpers.go:27

// SystemPriorityClasses define system priority classes that are auto-created at cluster bootstrapping.
// Our API validation logic ensures that any priority class that has a system prefix or its value
// is higher than HighestUserDefinablePriority is equal to one of these SystemPriorityClasses.
var systemPriorityClasses = []*PriorityClass{
	{
		ObjectMeta: metav1.ObjectMeta{
			Name: SystemNodeCritical,
		},
		Value:       SystemCriticalPriority + 1000,
		Description: "Used for system critical pods that must not be moved from their current node.",
	},
	{
		ObjectMeta: metav1.ObjectMeta{
			Name: SystemClusterCritical,
		},
		Value:       SystemCriticalPriority,
		Description: "Used for system critical pods that must run in the cluster, but can be moved to another node if necessary.",
	},
}

總結

所以DeamonSetController及PriorityClass Validate時，對CriticalPod的特殊處理總結以下：get

DaemonSetController會爲CriticalPod加上OutOfDisk:NoScheduleToleration。
DeamonSetController對於CriticalPod進行Predicate時不會進行PodFitsResources檢查。
PriorityClass的Validate時，若是PriorityClass's Name是以**system-**爲前綴的，那麼必須是system-cluster-critical或者system-node-critical之一。不然就會Validate Error，拒絕提交。
若是提交的PriorityClass's Name爲system-cluster-critical或者system-node-critical，那麼要求globalDefault必須爲false，即system-cluster-critical或者system-node-critical不能是全局默認的PriorityClass。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。