你們在Kubernetes集羣中部署核心組件時,常常會用到Critical Pod,那麼你知道Critical Pod到底有何特別嗎?要完整的瞭解這一點,其實並非那麼簡單,它關係到調度、Kubelet Eviction Manager、DaemonSet Controller、Kubelet Preemption等,我將分4個系列爲你們剖析。這一篇先介紹Critical Pod在Predicate in Schedule階段的行爲,以及用戶指望的行爲等。node
官方宣佈Rescheduler is deprecated as of Kubernetes 1.10 and will be removed in version 1.12,因此本文將不討論Rescheduler對Critical Pod的處理邏輯。app
規則1:ide
ExperimentalCriticaPodAnnotation
kube-system
namespace;scheduler.alpha.kubernetes.io/critical-pod=""
規則2:spa
Enable Feature Gate ExperimentalCriticaPodAnnotation, PodPriority
code
Pod的Priority不爲空,且不小於2 * 10^9
;資源
system-node-critical priority = 10^9 + 1000;
system-cluster-critical priority = 10^9;rem
知足規則1或規則2之一,就認爲該Pod爲Critical Pod;部署
在default scheduler進行pod調度的predicate階段,會註冊GeneralPredicates
爲default predicates之一,並無判斷critical Pod使用EssentialPredicates
來對critical Pod進行predicate process。這意味着什麼呢?kubernetes
咱們看看GeneralPredicates和EssentialPredicates的關係就知道了。GeneralPredicates中,先調用noncriticalPredicates,再調用EssentialPredicates。所以若是你給Deployment/StatefulSet等(DeamonSet除外)標識爲Critical,那麼在scheduler調度時,仍然走GeneralPredicates的流程,會調用noncriticalPredicates,而你卻但願它直接走EssentialPredicates。it
// GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates // that only non-critical pods need and EssentialPredicates are the predicates that all pods, including critical pods, need func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason fit, reasons, err := noncriticalPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
noncriticalPredicates原意是想對non-critical pod作的額外predicate邏輯,這個邏輯就是PodFitsResources檢查。
pkg/scheduler/algorithm/predicates/predicates.go:1076 // noncriticalPredicates are the predicates that only non-critical pods need func noncriticalPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails []algorithm.PredicateFailureReason fit, reasons, err := PodFitsResources(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
PodFitsResources就作如下檢查資源是否知足要求:
也就是說,若是你給Deployment/StatefulSet等(DeamonSet除外)標識爲Critical,那麼對應的Pod調度時仍然會檢查Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
是否足夠,若是不知足則會觸發預選失敗,而且在Preempt階段也只是根據對應的PriorityClass進行正常的搶佔邏輯,並無針對Critical Pod進行特殊處理,所以最終可能會由於找不到知足資源要求的Node,致使該Critical Pod調度失敗,一直處於Pending狀態。
而用戶設置Critical Pod是不想由於資源不足致使調度失敗的。那若是我就是想使用Deployment/StatefulSet等(DeamonSet除外)標識爲Critical Pod來部署關鍵服務呢?有如下兩個辦法:
system-cluster-critical
或system-node-critical
Priority Class,這樣就會在scheduler正常的Preempt流程中搶佔到資源完成調度。GeneralPredicates
的代碼以下,檢測是否爲Critical Pod,若是是,則不執行noncriticalPredicates邏輯,也就是說predicate階段不對Allowed Pod Number, CPU, Memory, EphemeralStorage,Extended Resources
資源進行檢查。func GeneralPredicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) { var predicateFails, resons []algorithm.PredicateFailureReason var fit bool var err error // **Modify**: check whether the pod is a Critical Pod, don't invoke noncriticalPredicates if false. isCriticalPod := utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCriticalPod(newPod) if !isCriticalPod { fit, reasons, err = noncriticalPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } } if !fit { predicateFails = append(predicateFails, reasons...) } fit, reasons, err = EssentialPredicates(pod, meta, nodeInfo) if err != nil { return false, predicateFails, err } if !fit { predicateFails = append(predicateFails, reasons...) } return len(predicateFails) == 0, predicateFails, nil }
方法1,其實Kubernetes在Admission Priority檢查時已經幫你作了。
// admitPod makes sure a new pod does not set spec.Priority field. It also makes sure that the PriorityClassName exists if it is provided and resolves the pod priority from the PriorityClassName. func (p *priorityPlugin) admitPod(a admission.Attributes) error { ... if utilfeature.DefaultFeatureGate.Enabled(features.PodPriority) { var priority int32 if len(pod.Spec.PriorityClassName) == 0 && utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) && kubelettypes.IsCritical(a.GetNamespace(), pod.Annotations) { pod.Spec.PriorityClassName = scheduling.SystemClusterCritical } ... }
在Admission時候會對Pod的Priority進行檢查,若是發現您已經:
那麼,Admisson Priority階段會自動給Pod添加SystemClusterCritical(system-cluster-critical) PriorityClass;
經過上面的分析,給出以下最佳實踐:在Kubernetes集羣中,經過非DeamonSet方式(好比Deployment、RS等)部署關鍵服務時,爲了在集羣資源不足時仍能保證搶佔調度成功,請確保以下事宜:
本文介紹了標識一個關鍵服務爲Critical服務的兩種方法,並介紹了Critical Pod(DaemonSet部署方式除外)在Predicate in Schedule階段的行爲,給出了最佳實踐。