轉載請聲明出處哦~,本篇文章發佈於luozhiyun的博客:https://www.luozhiyun.comnode
源碼版本是1.19nginx
上一篇咱們將了獲取node成功的狀況,若是是一個優先pod獲取node失敗,那麼就會進入到搶佔環節中,那麼搶佔環節k8s會作什麼呢,搶佔是如何發生的,哪些資源會被搶佔這些都是咱們這篇要研究的內容。git
正常狀況下,當一個 Pod 調度失敗後,它就會被暫時「擱置」起來,直到 Pod 被更新,或者集羣狀態發生變化,調度器纔會對這個 Pod 進行從新調度。可是咱們能夠經過PriorityClass優先級來避免這種狀況。經過設置優先級一些優先級比較高的pod,若是pod 調度失敗,那麼並不會被」擱置」,而是會」擠走」某個 node 上的一些低優先級的 pod,這樣就能夠保證高優先級的 pod 調度成功。github
要使用PriorityClass,首先咱們要定義一個PriorityClass對象,例如:api
apiVersion: v1 kind: PriorityClass metadata: name: high-priority value: 1000000 globalDefault: false description: "This priority class should be used for XYZ service pods only."
value越高則優先級越高;globalDefault 被設置爲 true 的話,那就意味着這個 PriorityClass 的值會成爲系統的默認值,若是是false則表示咱們只但願聲明使用該 PriorityClass 的 Pod 擁有值爲 1000000 的優先級,而對於沒有聲明 PriorityClass 的 Pod 來講,它們的優先級就是 0。數組
Pod 就能夠聲明使用它了:app
apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent priorityClassName: high-priority
高優先級的 Pod 調度失敗的時候,調度器的搶佔能力就會被觸發。調度器就會試圖從當前集羣裏尋找一個節點,使得當這個節點上的一個或者多個低優先級 Pod 被刪除後,待調度的高優先級 Pod 就能夠被調度到這個節點上。ide
高優先級Pod進行搶佔的時候會將pod的 nominatedNodeName 字段,設置爲被搶佔的 Node 的名字。而後,在下一週期中決定是否是要運行在被搶佔的節點上,當這個Pod在等待的時候,若是有其餘更高優先級的 Pod 也要搶佔同一個節點,那麼調度器就會清空原搶佔者的 spec.nominatedNodeName 字段,從而容許更高優先級的搶佔者執行搶佔。post
這裏我依舊拿出這張圖來進行講解,上一篇咱們將了獲取node成功的狀況,若是是一個優先pod獲取node失敗,那麼就會進入到搶佔環節中。this
經過上一篇的分析,咱們知道,在scheduleOne方法中執行sched.Algorithm.Schedule會選擇一個合適的node節點,若是獲取node失敗,那麼就會進入到一個if邏輯中執行搶佔。
代碼路徑:pkg/scheduler/scheduler.go
func (sched *Scheduler) scheduleOne(ctx context.Context) { ... //爲pod資源對象選擇一個合適的節點 scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, prof, state, pod) //獲取node失敗,搶佔邏輯 if err != nil { //上面調用失敗以後,下面會根據pod執行搶佔 nominatedNode := "" if fitError, ok := err.(*core.FitError); ok { if !prof.HasPostFilterPlugins() { klog.V(3).Infof("No PostFilter plugins are registered, so no preemption will be performed.") } else { result, status := prof.RunPostFilterPlugins(ctx, state, pod, fitError.FilteredNodesStatuses) if status.Code() == framework.Error { klog.Errorf("Status after running PostFilter plugins for pod %v/%v: %v", pod.Namespace, pod.Name, status) } else { klog.V(5).Infof("Status after running PostFilter plugins for pod %v/%v: %v", pod.Namespace, pod.Name, status) } //搶佔成功後,將nominatedNodeName設置爲被搶佔的 Node 的名字,而後從新進入下一個調度週期 if status.IsSuccess() && result != nil { nominatedNode = result.NominatedNodeName } } metrics.PodUnschedulable(prof.Name, metrics.SinceInSeconds(start)) } else if err == core.ErrNoNodesAvailable { metrics.PodUnschedulable(prof.Name, metrics.SinceInSeconds(start)) } else { klog.ErrorS(err, "Error selecting node for pod", "pod", klog.KObj(pod)) metrics.PodScheduleError(prof.Name, metrics.SinceInSeconds(start)) } sched.recordSchedulingFailure(prof, podInfo, err, v1.PodReasonUnschedulable, nominatedNode) return } ... }
在這個方法裏面RunPostFilterPlugins會執行具體的搶佔邏輯,而後返回被搶佔的node節點。搶佔者並不會馬上被調度到被搶佔的 node 上,調度器只會將搶佔者的 status.nominatedNodeName 字段設置爲被搶佔的 node 的名字。而後,搶佔者會從新進入下一個調度週期,在新的調度週期裏來決定是否是要運行在被搶佔的節點上,固然,即便在下一個調度週期,調度器也不會保證搶佔者必定會運行在被搶佔的節點上。
這樣設計的一個重要緣由是調度器只會經過標準的 DELETE API 來刪除被搶佔的 pod,因此,這些 pod 必然是有必定的「優雅退出」時間(默認是 30s)的。而在這段時間裏,其餘的節點也是有可能變成可調度的,或者直接有新的節點被添加到這個集羣中來。
而在搶佔者等待被調度的過程當中,若是有其餘更高優先級的 pod 也要搶佔同一個節點,那麼調度器就會清空原搶佔者的 status.nominatedNodeName 字段,從而容許更高優先級的搶佔者執行搶佔,而且,這也使得原搶佔者自己也有機會去從新搶佔其餘節點。
接着咱們繼續看,RunPostFilterPlugins會遍歷全部的postFilterPlugins,而後執行runPostFilterPlugin方法:
func (f *frameworkImpl) RunPostFilterPlugins(ctx context.Context, state *framework.CycleState, pod *v1.Pod, filteredNodeStatusMap framework.NodeToStatusMap) (_ *framework.PostFilterResult, status *framework.Status) { startTime := time.Now() defer func() { metrics.FrameworkExtensionPointDuration.WithLabelValues(postFilter, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime)) }() statuses := make(framework.PluginToStatus) //postFilterPlugins裏面只有一個defaultpreemption for _, pl := range f.postFilterPlugins { r, s := f.runPostFilterPlugin(ctx, pl, state, pod, filteredNodeStatusMap) if s.IsSuccess() { return r, s } else if !s.IsUnschedulable() { // Any status other than Success or Unschedulable is Error. return nil, framework.NewStatus(framework.Error, s.Message()) } statuses[pl.Name()] = s } return nil, statuses.Merge() }
根據咱們上一節看的scheduler的初始化能夠知道設置的PostFilter以下:
代碼路徑:pkg/scheduler/algorithmprovider/registry.go
PostFilter: &schedulerapi.PluginSet{ Enabled: []schedulerapi.Plugin{ {Name: defaultpreemption.Name}, }, },
可見,目前只有一個defaultpreemption來執行搶佔邏輯,在postFilterPlugins循環裏面會調用到runPostFilterPlugin而後運行defaultpreemption的PostFilter方法,最後執行到preempt執行具體搶佔邏輯。
代碼路徑:pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
func (pl *DefaultPreemption) PostFilter(...) (*framework.PostFilterResult, *framework.Status) { ... //執行搶佔 nnn, err := pl.preempt(ctx, state, pod, m) ... return &framework.PostFilterResult{NominatedNodeName: nnn}, framework.NewStatus(framework.Success) }
搶佔的執行流程圖以下:
代碼路徑:pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go
func (pl *DefaultPreemption) preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, m framework.NodeToStatusMap) (string, error) { cs := pl.fh.ClientSet() ph := pl.fh.PreemptHandle() //返回node列表 nodeLister := pl.fh.SnapshotSharedLister().NodeInfos() pod, err := util.GetUpdatedPod(cs, pod) if err != nil { klog.Errorf("Error getting the updated preemptor pod object: %v", err) return "", err } //確認搶佔者是否可以進行搶佔,若是對應的node節點上的pod正在優雅退出(Graceful Termination ),那麼就不該該進行搶佔 if !PodEligibleToPreemptOthers(pod, nodeLister, m[pod.Status.NominatedNodeName]) { klog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name) return "", nil } // 查找全部搶佔候選者 candidates, err := FindCandidates(ctx, cs, state, pod, m, ph, nodeLister, pl.pdbLister) if err != nil || len(candidates) == 0 { return "", err } //如有 extender 則執行 candidates, err = CallExtenders(ph.Extenders(), pod, nodeLister, candidates) if err != nil { return "", err } // 查找最佳搶佔候選者 bestCandidate := SelectCandidate(candidates) if bestCandidate == nil || len(bestCandidate.Name()) == 0 { return "", nil } // 在搶佔一個node以前作一些準備工做 if err := PrepareCandidate(bestCandidate, pl.fh, cs, pod); err != nil { return "", err } return bestCandidate.Name(), nil }
preempt方法首先會去獲取node列表,而後獲取最新的要執行搶佔的pod信息,接着分下面幾步執行搶佔:
PodEligibleToPreemptOthers
func PodEligibleToPreemptOthers(pod *v1.Pod, nodeInfos framework.NodeInfoLister, nominatedNodeStatus *framework.Status) bool { if pod.Spec.PreemptionPolicy != nil && *pod.Spec.PreemptionPolicy == v1.PreemptNever { klog.V(5).Infof("Pod %v/%v is not eligible for preemption because it has a preemptionPolicy of %v", pod.Namespace, pod.Name, v1.PreemptNever) return false } //查看搶佔者是否已經搶佔過 nomNodeName := pod.Status.NominatedNodeName if len(nomNodeName) > 0 { if nominatedNodeStatus.Code() == framework.UnschedulableAndUnresolvable { return true } //獲取被搶佔的node節點 if nodeInfo, _ := nodeInfos.Get(nomNodeName); nodeInfo != nil { //查看是否存在正在被刪除而且優先級比搶佔者pod低的pod podPriority := podutil.GetPodPriority(pod) for _, p := range nodeInfo.Pods { if p.Pod.DeletionTimestamp != nil && podutil.GetPodPriority(p.Pod) < podPriority { // There is a terminating pod on the nominated node. return false } } } } return true }
這個方法會檢查該pod是否已經搶佔過其餘node節點,若是是的話就遍歷節點上的全部pod對象,若是發現節點上有pod資源對象的優先級小於待調度pod資源對象並處於終止狀態,則返回false,不會發生搶佔。
接下來看FindCandidates方法:
FindCandidates
func FindCandidates(ctx context.Context, cs kubernetes.Interface, state *framework.CycleState, pod *v1.Pod, m framework.NodeToStatusMap, ph framework.PreemptHandle, nodeLister framework.NodeInfoLister, pdbLister policylisters.PodDisruptionBudgetLister) ([]Candidate, error) { allNodes, err := nodeLister.List() if err != nil { return nil, err } if len(allNodes) == 0 { return nil, core.ErrNoNodesAvailable } //找 predicates 階段失敗可是經過搶佔也許可以調度成功的 nodes potentialNodes := nodesWherePreemptionMightHelp(allNodes, m) if len(potentialNodes) == 0 { klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name) if err := util.ClearNominatedNodeName(cs, pod); err != nil { klog.Errorf("Cannot clear 'NominatedNodeName' field of pod %v/%v: %v", pod.Namespace, pod.Name, err) } return nil, nil } if klog.V(5).Enabled() { var sample []string for i := 0; i < 10 && i < len(potentialNodes); i++ { sample = append(sample, potentialNodes[i].Node().Name) } klog.Infof("%v potential nodes for preemption, first %v are: %v", len(potentialNodes), len(sample), sample) } //獲取PDB對象,PDB可以限制同時終端的pod資源對象的數量,以保證集羣的高可用性 pdbs, err := getPodDisruptionBudgets(pdbLister) if err != nil { return nil, err } //尋找符合條件的node,並封裝成candidate數組返回 return dryRunPreemption(ctx, ph, state, pod, potentialNodes, pdbs), nil }
FindCandidates方法首先會獲取node列表,而後調用nodesWherePreemptionMightHelp方法來找出predicates 階段失敗可是經過搶佔也許可以調度成功的nodes,由於並非全部的node均可以經過搶佔來調度成功。最後調用dryRunPreemption方法來獲取符合條件的node節點。
dryRunPreemption
func dryRunPreemption(ctx context.Context, fh framework.PreemptHandle, state *framework.CycleState, pod *v1.Pod, potentialNodes []*framework.NodeInfo, pdbs []*policy.PodDisruptionBudget) []Candidate { var resultLock sync.Mutex var candidates []Candidate checkNode := func(i int) { nodeInfoCopy := potentialNodes[i].Clone() stateCopy := state.Clone() //找到node上被搶佔的pod,也就是victims pods, numPDBViolations, fits := selectVictimsOnNode(ctx, fh, stateCopy, pod, nodeInfoCopy, pdbs) if fits { resultLock.Lock() victims := extenderv1.Victims{ Pods: pods, NumPDBViolations: int64(numPDBViolations), } c := candidate{ victims: &victims, name: nodeInfoCopy.Node().Name, } candidates = append(candidates, &c) resultLock.Unlock() } } parallelize.Until(ctx, len(potentialNodes), checkNode) return candidates }
這裏會開啓16個線程調用checkNode方法,checkNode方法裏面會調用selectVictimsOnNode方法來檢查這個node是否是能被執行搶佔,若是能執行搶佔返回的pods表示須要刪除的節點,而後封裝成candidate添加到candidates列表中返回。
selectVictimsOnNode
func selectVictimsOnNode( ctx context.Context, ph framework.PreemptHandle, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo, pdbs []*policy.PodDisruptionBudget, ) ([]*v1.Pod, int, bool) { var potentialVictims []*v1.Pod //移除node節點的pod removePod := func(rp *v1.Pod) error { if err := nodeInfo.RemovePod(rp); err != nil { return err } status := ph.RunPreFilterExtensionRemovePod(ctx, state, pod, rp, nodeInfo) if !status.IsSuccess() { return status.AsError() } return nil } //將node節點添加pod addPod := func(ap *v1.Pod) error { nodeInfo.AddPod(ap) status := ph.RunPreFilterExtensionAddPod(ctx, state, pod, ap, nodeInfo) if !status.IsSuccess() { return status.AsError() } return nil } // 獲取pod的優先級,並將node中全部優先級低於該pod的調用removePod方法pod移除 podPriority := podutil.GetPodPriority(pod) for _, p := range nodeInfo.Pods { if podutil.GetPodPriority(p.Pod) < podPriority { potentialVictims = append(potentialVictims, p.Pod) if err := removePod(p.Pod); err != nil { return nil, 0, false } } } //沒有優先級低的node,直接返回 if len(potentialVictims) == 0 { return nil, 0, false } if fits, _, err := core.PodPassesFiltersOnNode(ctx, ph, state, pod, nodeInfo); !fits { if err != nil { klog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err) } return nil, 0, false } var victims []*v1.Pod numViolatingVictim := 0 //將potentialVictims集合裏的pod按照優先級進行排序 sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) }) //將pdb的pod分離出來 //基於 pod 是否有 PDB 被分爲兩組 violatingVictims 和 nonViolatingVictims //PDB:https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs) reprievePod := func(p *v1.Pod) (bool, error) { if err := addPod(p); err != nil { return false, err } fits, _, _ := core.PodPassesFiltersOnNode(ctx, ph, state, pod, nodeInfo) if !fits { if err := removePod(p); err != nil { return false, err } // 加入到 victims 中 victims = append(victims, p) klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name) } return fits, nil } //刪除pod,並記錄刪除個數 for _, p := range violatingVictims { if fits, err := reprievePod(p); err != nil { klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err) return nil, 0, false } else if !fits { numViolatingVictim++ } } //刪除pod for _, p := range nonViolatingVictims { if _, err := reprievePod(p); err != nil { klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err) return nil, 0, false } } return victims, numViolatingVictim, true }
這個方法首先定義了兩個方法,一個是removePod,另外一個是addPod,這兩個方法都差很少,若是是removePod就會將pod從node中移除,而後修改node一些屬性,如將Requested.MilliCPU、Requested.Memory中減去,表示已用資源大小,將該pod從node節點的Pods列表中移除等等。
回到selectVictimsOnNode繼續往下,會遍歷node裏面的pod列表,若是找到優先級小於搶佔pod的就加入到potentialVictims集合中,並調用removePod方法,將當前被遍歷的pod從node中移除。
接着會調用PodPassesFiltersOnNode方法,這個方法會運行兩次。第一次會調用addNominatedPods方法將調度隊列中找到節點上優先級大於或等於當前pod資源對象的nominatedPods加入到nodeInfo對象中,而後執行FilterPlugin列表;第二次則直接執行FilterPlugins列表。之因此要這麼作,是因爲親和性的關係,k8s須要判斷當前調度的pod親和性是否依賴了nominatedPods。
繼續往下會對potentialVictims按照優先級進行排序,優先級高的在前面。
接着會調用filterPodsWithPDBViolation方法,將 PDB 約束的 Pod和未約束的Pod分離成兩個組,而後會分別遍歷violatingVictims和nonViolatingVictims調用reprievePod方法對pod進行移除。這裏咱們在官方文檔也能夠看其設計理念,PodDisruptionBudget 是在搶佔中被支持的,但不提供保證,而後將被移除的pod添加到victims列表中,並記錄好被刪除的刪除pod個數,最後返回。
到這裏整個FindCandidates方法就探索完畢了,仍是比較長的,咱們繼續回到preempt方法中往下看,SelectCandidate方法會查找最佳搶佔候選者。
SelectCandidate
func SelectCandidate(candidates []Candidate) Candidate { if len(candidates) == 0 { return nil } if len(candidates) == 1 { return candidates[0] } victimsMap := candidatesToVictimsMap(candidates) // 選擇1個 node 用於 schedule candidateNode := pickOneNodeForPreemption(victimsMap) for _, candidate := range candidates { if candidateNode == candidate.Name() { return candidate } } klog.Errorf("None candidate can be picked from %v.", candidates) return candidates[0] }
這個方法裏面會調用candidatesToVictimsMap方法作一個name和victims映射map,而後調用pickOneNodeForPreemption執行主要過濾邏輯。
pickOneNodeForPreemption
func pickOneNodeForPreemption(nodesToVictims map[string]*extenderv1.Victims) string { //若該 node 沒有 victims 則返回 if len(nodesToVictims) == 0 { return "" } minNumPDBViolatingPods := int64(math.MaxInt32) var minNodes1 []string lenNodes1 := 0 //尋找 PDB violations 數量最小的 node for node, victims := range nodesToVictims { numPDBViolatingPods := victims.NumPDBViolations if numPDBViolatingPods < minNumPDBViolatingPods { minNumPDBViolatingPods = numPDBViolatingPods minNodes1 = nil lenNodes1 = 0 } if numPDBViolatingPods == minNumPDBViolatingPods { minNodes1 = append(minNodes1, node) lenNodes1++ } } //若是最小的node只有一個,直接返回 if lenNodes1 == 1 { return minNodes1[0] } minHighestPriority := int32(math.MaxInt32) var minNodes2 = make([]string, lenNodes1) lenNodes2 := 0 // 找到node裏面pods 最高優先級最小的 for i := 0; i < lenNodes1; i++ { node := minNodes1[i] victims := nodesToVictims[node] highestPodPriority := podutil.GetPodPriority(victims.Pods[0]) if highestPodPriority < minHighestPriority { minHighestPriority = highestPodPriority lenNodes2 = 0 } if highestPodPriority == minHighestPriority { minNodes2[lenNodes2] = node lenNodes2++ } } if lenNodes2 == 1 { return minNodes2[0] } // 找出node裏面Victims列表優先級加和最小的 minSumPriorities := int64(math.MaxInt64) lenNodes1 = 0 for i := 0; i < lenNodes2; i++ { var sumPriorities int64 node := minNodes2[i] for _, pod := range nodesToVictims[node].Pods { sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1) } if sumPriorities < minSumPriorities { minSumPriorities = sumPriorities lenNodes1 = 0 } if sumPriorities == minSumPriorities { minNodes1[lenNodes1] = node lenNodes1++ } } if lenNodes1 == 1 { return minNodes1[0] } // 找到node列表中須要犧牲的pod數量最小的 minNumPods := math.MaxInt32 lenNodes2 = 0 for i := 0; i < lenNodes1; i++ { node := minNodes1[i] numPods := len(nodesToVictims[node].Pods) if numPods < minNumPods { minNumPods = numPods lenNodes2 = 0 } if numPods == minNumPods { minNodes2[lenNodes2] = node lenNodes2++ } } if lenNodes2 == 1 { return minNodes2[0] } //若多個 node 的 pod 數量相等,則選出高優先級 pod 啓動時間最短的 latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]]) if latestStartTime == nil { klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0]) return minNodes2[0] } nodeToReturn := minNodes2[0] for i := 1; i < lenNodes2; i++ { node := minNodes2[i] earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node]) if earliestStartTimeOnNode == nil { klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node) continue } if earliestStartTimeOnNode.After(latestStartTime.Time) { latestStartTime = earliestStartTimeOnNode nodeToReturn = node } } return nodeToReturn }
這個方法看起來很長,其實邏輯十分的清晰:
而後preempt方法往下走到調用PrepareCandidate方法:
PrepareCandidate
func PrepareCandidate(c Candidate, fh framework.FrameworkHandle, cs kubernetes.Interface, pod *v1.Pod) error { for _, victim := range c.Victims().Pods { if err := util.DeletePod(cs, victim); err != nil { klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err) return err } if waitingPod := fh.GetWaitingPod(victim.UID); waitingPod != nil { waitingPod.Reject("preempted") } fh.EventRecorder().Eventf(victim, pod, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v", pod.Namespace, pod.Name, c.Name()) } metrics.PreemptionVictims.Observe(float64(len(c.Victims().Pods))) //移除低優先級 pod 的 Nominated,更新這些 pod,移動到 activeQ 隊列中,讓調度器爲這些 pod 從新 bind node nominatedPods := getLowerPriorityNominatedPods(fh.PreemptHandle(), pod, c.Name()) if err := util.ClearNominatedNodeName(cs, nominatedPods...); err != nil { klog.Errorf("Cannot clear 'NominatedNodeName' field: %v", err) // We do not return as this error is not critical. } return nil }
這個方法會調用DeletePod刪除Victims列表裏面的pod,而後將這些pod中的Status.NominatedNodeName屬性置空。
到這裏整個搶佔過程就講解完畢了~
看完這一篇咱們對k8s的搶佔能夠說有一個全局的瞭解,內心應該很是清楚k8s在搶佔的時候會發生什麼,例如何時時候哪些pod會執行搶佔,以及爲何執行搶佔,以及搶佔了哪些pod的資源,對於被搶佔的pod會不會從新被調度等等。
https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/
https://kubernetes.io/docs/concepts/configuration/pod-overhead/
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/