摘要:本文分析了DeamonSetController及PriorityClass Validate時,對CriticalPod的所作的特殊處理。node
深刻分析Kubernetes Critical Pod系列: 深刻分析Kubernetes Critical Pod(一) 深刻分析Kubernetes Critical Pod(二) 深刻分析Kubernetes Critical Pod(三) 深刻分析Kubernetes Critical Pod(四)bootstrap
在DaemonSetController判斷某個node上是否要運行某個DaemonSet時,會調用DaemonSetsController.simulate來分析PredicateFailureReason。api
pkg/controller/daemon/daemon_controller.go:1206 func (dsc *DaemonSetsController) simulate(newPod *v1.Pod, node *v1.Node, ds *apps.DaemonSet) ([]algorithm.PredicateFailureReason, *schedulercache.NodeInfo, error) { // DaemonSet pods shouldn't be deleted by NodeController in case of node problems. // Add infinite toleration for taint notReady:NoExecute here // to survive taint-based eviction enforced by NodeController // when node turns not ready. v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{ Key: algorithm.TaintNodeNotReady, Operator: v1.TolerationOpExists, Effect: v1.TaintEffectNoExecute, }) // DaemonSet pods shouldn't be deleted by NodeController in case of node problems. // Add infinite toleration for taint unreachable:NoExecute here // to survive taint-based eviction enforced by NodeController // when node turns unreachable. v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{ Key: algorithm.TaintNodeUnreachable, Operator: v1.TolerationOpExists, Effect: v1.TaintEffectNoExecute, }) // According to TaintNodesByCondition, all DaemonSet pods should tolerate // MemoryPressure and DisPressure taints, and the critical pods should tolerate // OutOfDisk taint additional. v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{ Key: algorithm.TaintNodeDiskPressure, Operator: v1.TolerationOpExists, Effect: v1.TaintEffectNoSchedule, }) v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{ Key: algorithm.TaintNodeMemoryPressure, Operator: v1.TolerationOpExists, Effect: v1.TaintEffectNoSchedule, }) // TODO(#48843) OutOfDisk taints will be removed in 1.10 if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(newPod) { v1helper.AddOrUpdateTolerationInPod(newPod, &v1.Toleration{ Key: algorithm.TaintNodeOutOfDisk, Operator: v1.TolerationOpExists, Effect: v1.TaintEffectNoSchedule, }) } ... _, reasons, err := Predicates(newPod, nodeInfo) return reasons, nodeInfo, err }
OutOfDisk:NoSchedule
Toleration。在simulate中,還會像相似scheduler同樣,進行Predicates處理。Predicates過程當中也對CriticalPod作了區分對待。app
pkg/controller/daemon/daemon_controller.go:1413 // Predicates checks if a DaemonSet's pod can be scheduled on a node using GeneralPredicates // and PodToleratesNodeTaints predicate func Predicates(pod *v1.Pod, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason // If ScheduleDaemonSetPods is enabled, only check nodeSelector and nodeAffinity. if false /*disabled for 1.10*/ && utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) { fit, reasons, err := nodeSelectionPredicates(pod, nil, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil } critical := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(pod) fit, reasons, err := predicates.PodToleratesNodeTaints(pod, nil, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } if critical { // If the pod is marked as critical and support for critical pod annotations is enabled, // check predicates for critical pods only. fit, reasons, err = predicates.EssentialPredicates(pod, nil, nodeInfo) } else { fit, reasons, err = predicates.GeneralPredicates(pod, nil, nodeInfo) } if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
pkg/scheduler/algorithm/predicates/predicates.go:1076 // noncriticalPredicates are the predicates that only non-critical pods need func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason fit, reasons, err := PodFitsResources(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
所以,對於CriticalPod,DeamonSetController進行Predicate時不會進行PodFitsResources檢查。ui
在Kubernetes 1.11中,很重要的個更新就是,Priority和Preemption從alpha升級爲Beta了,而且是Enabled by default。code
Kubernetes Version | Priority and Preemption State | Enabled by default |
---|---|---|
1.8 | alpha | no |
1.9 | alpha | no |
1.10 | alpha | no |
1.11 | beta | yes |
PriorityClass是屬於scheduling.k8s.io/v1alpha1
GroupVersion的,在client提交建立PriorityClass請求後,寫入etcd前,會進行合法性檢查(Validate),這其中就有對SystemClusterCritical和SystemNodeCritical兩個PriorityClass的特殊對待。ip
pkg/apis/scheduling/validation/validation.go:30 // ValidatePriorityClass tests whether required fields in the PriorityClass are // set correctly. func ValidatePriorityClass(pc *scheduling.PriorityClass) field.ErrorList { ... // If the priorityClass starts with a system prefix, it must be one of the // predefined system priority classes. if strings.HasPrefix(pc.Name, scheduling.SystemPriorityClassPrefix) { if is, err := scheduling.IsKnownSystemPriorityClass(pc); !is { allErrs = append(allErrs, field.Forbidden(field.NewPath("metadata", "name"), "priority class names with '"+scheduling.SystemPriorityClassPrefix+"' prefix are reserved for system use only. error: "+err.Error())) } } ... return allErrs } // IsKnownSystemPriorityClass checks that "pc" is equal to one of the system PriorityClasses. // It ignores "description", labels, annotations, etc. of the PriorityClass. func IsKnownSystemPriorityClass(pc *PriorityClass) (bool, error) { for _, spc := range systemPriorityClasses { if spc.Name == pc.Name { if spc.Value != pc.Value { return false, fmt.Errorf("value of %v PriorityClass must be %v", spc.Name, spc.Value) } if spc.GlobalDefault != pc.GlobalDefault { return false, fmt.Errorf("globalDefault of %v PriorityClass must be %v", spc.Name, spc.GlobalDefault) } return true, nil } } return false, fmt.Errorf("%v is not a known system priority class", pc.Name) }
system-cluster-critical
或者system-node-critical
之一。不然就會Validate Error,拒絕提交。system-cluster-critical
或者system-node-critical
,那麼要求globalDefault必須爲false,即system-cluster-critical
或者system-node-critical
不能是全局默認的PriorityClass。另外,在PriorityClass進行Update時,目前是不容許其Name和Value的,也就是說只能更新Description和globalDefault。rem
pkg/apis/scheduling/helpers.go:27 // SystemPriorityClasses define system priority classes that are auto-created at cluster bootstrapping. // Our API validation logic ensures that any priority class that has a system prefix or its value // is higher than HighestUserDefinablePriority is equal to one of these SystemPriorityClasses. var systemPriorityClasses = []*PriorityClass{ { ObjectMeta: metav1.ObjectMeta{ Name: SystemNodeCritical, }, Value: SystemCriticalPriority + 1000, Description: "Used for system critical pods that must not be moved from their current node.", }, { ObjectMeta: metav1.ObjectMeta{ Name: SystemClusterCritical, }, Value: SystemCriticalPriority, Description: "Used for system critical pods that must run in the cluster, but can be moved to another node if necessary.", }, }
所以DeamonSetController及PriorityClass Validate時,對CriticalPod的特殊處理總結以下:get
OutOfDisk:NoSchedule
Toleration。system-cluster-critical
或者system-node-critical
之一。不然就會Validate Error,拒絕提交。system-cluster-critical
或者system-node-critical
,那麼要求globalDefault必須爲false,即system-cluster-critical
或者system-node-critical
不能是全局默認的PriorityClass。