深刻理解 Kubernetes CPU Mangager

時間 2019-11-13

標籤深刻理解 kubernetes cpu mangager 简体版

原文原文鏈接

Author: xidianwangtao@gmail.comnode

摘要：Kuberuntime CPU Manager在咱們生產環境中有大規模的應用，咱們必須對其有深刻理解，方能指揮若定。本文對CPU Manager的使用場景、使用方法、工做機制、可能存在的問題及解決辦法等方面都有涉及，但願對你們有所幫助。nginx

CPU Manager是幹什麼的？

熟悉docker的用戶，必定用過docker cpuset的能力，用來指定docker container啓動時綁定指定的cpu和memory node。git

--cpuset-cpus=""	CPUs in which to allow execution (0-3, 0,1)
--cpuset-mems=""	Memory nodes (MEMs) in which to allow execution (0-3, 0,1). Only effective on NUMA systems.

可是Kubernetes一直沒有提供提供的能力，直到Kubernetes 1.8開始，Kubernetes提供了CPU Manager特性來支持cpuset的能力。從Kubernetes 1.10版本開始到目前的1.12，該特性仍是Beta版。github

CPU Manager是Kubelet CM中的一個模塊，目標是經過給某些Containers綁定指定的cpus，達到綁定cpus的目標，從而提高這些cpu敏感型任務的性能。docker

什麼場景下會考慮用CPU Manager？

前面提到CPU敏感型任務，會由於使用CpuSet而大幅度提高計算性能，那到底具有哪些特色的任務是屬於CPU敏感型的呢？json

Sensitive to CPU throttling effects.
Sensitive to context switches.
Sensitive to processor cache misses.
Benefits from sharing a processor resources (e.g., data and instruction caches).
Sensitive to cross-socket memory traffic.
Sensitive or requires hyperthreads from the same physical CPU core.

Feature Highlight/ CPU Manager - Kubernetes中還列舉了一些具體的Sample對比，有興趣的能夠去了解。咱們公司的不少應用是屬於這種類型的，並且cpuset帶來的好處還有cpu資源結算的方便.固然，這幾乎必定會帶來整個集羣的cpu利用率會有所下降，這就取決於你是否把應用的性能放在第一位了。api

如何使用CPU Manager

在Kubernetes v1.8-1.9版本中，CPU Manager仍是Alpha，在v1.10-1.12是Beta。我沒關注過CPU Manager這幾個版本的Changelog，仍是建議在1.10以後的版本中使用。app

Enable CPU Manager

確保kubelet中CPUManager Feature Gate爲true(BETA - default=true)異步
目前CPU Manager支持兩種Policy，分別爲none和static，經過kubelet --cpu-manager-policy設置，將來會增長dynamic policy作Container生命週期內的cpuset動態調整。socket
- none: 爲cpu manager的默認值，至關於沒有啓用cpuset的能力。cpu request對應到cpu share，cpu limit對應到cpu quota。
- static: 目前，請設置--cpu-manager-policy=static來啓用，kubelet將在Container啓動前分配綁定的cpu set，分配時還會考慮cpu topology來提高cpu affinity，後面會提到。
確保kubelet爲--kube-reserved和--system-reserved都配置了值，能夠不是整數個cpu，最終會計算reserved cpus時會向上取整。這樣作的目的是爲了防止CPU Manager把Node上全部的cpu cores分配出去了，致使kubelet及系統進程都沒有可用的cpu了。

注意CPU Manager還有一個配置項--cpu-manager-reconcile-period，用來配置CPU Manager Reconcile Kubelet內存中CPU分配狀況到cpuset cgroups的修復週期。若是沒有配置該項，那麼將使用--node-status-update-frequency（default 10s）配置的值。

Workload選項

完成了以上配置，就啓用了Static CPU Manager，接下來就是在Workload中使用了。Kubernetes要求使用CPU Manager的Pod、Container具有如下兩個條件：

Pod QoS爲Guaranteed；
Pod中該Container的Cpu request必須爲整數CPUs；

spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
      requests:
        memory: "200Mi"
        cpu: "2"

任何其餘狀況下的Containers，CPU Manager都不會爲其分配綁定的CPUs，而是經過CFS使用Shared Pool中的CPUs。Shared Pool中的CPU集，就是Node上CPUCapacity - ReservedCPUs - ExclusiveCPUs。

CPU Manager工做流

CPU Manager爲知足條件的Container分配指定的CPUs時，會盡可能按照CPU Topology來分配，也就是考慮CPU Affinity，按照以下的優先順序進行CPUs選擇：（Logic CPUs就是Hyperthreads）

若是Container請求的Logic CPUs數量不小於單塊CPU Socket中Logci CPUs數量，那麼會優先把整塊CPU Socket中的Logic CPUs分配給該Container。
若是Container剩餘請求的Logic CPUs數量不小於單塊物理CPU Core提供的Logic CPUs數量，那麼會優先把整塊物理CPU Core上的Logic CPUs分配給該Container。
Container剩餘請求的Logic CPUs則從按照以下規則排好序的Logic CPUs列表中選擇：
- number of CPUs available on the same socket
- number of CPUs available on the same core

pkg/kubelet/cm/cpumanager/cpu_assignment.go:149

func takeByTopology(topo *topology.CPUTopology, availableCPUs cpuset.CPUSet, numCPUs int) (cpuset.CPUSet, error) {
	acc := newCPUAccumulator(topo, availableCPUs, numCPUs)
	if acc.isSatisfied() {
		return acc.result, nil
	}
	if acc.isFailed() {
		return cpuset.NewCPUSet(), fmt.Errorf("not enough cpus available to satisfy request")
	}

	// Algorithm: topology-aware best-fit
	// 1. Acquire whole sockets, if available and the container requires at
	//    least a socket's-worth of CPUs.
	for _, s := range acc.freeSockets() {
		if acc.needs(acc.topo.CPUsPerSocket()) {
			glog.V(4).Infof("[cpumanager] takeByTopology: claiming socket [%d]", s)
			acc.take(acc.details.CPUsInSocket(s))
			if acc.isSatisfied() {
				return acc.result, nil
			}
		}
	}

	// 2. Acquire whole cores, if available and the container requires at least
	//    a core's-worth of CPUs.
	for _, c := range acc.freeCores() {
		if acc.needs(acc.topo.CPUsPerCore()) {
			glog.V(4).Infof("[cpumanager] takeByTopology: claiming core [%d]", c)
			acc.take(acc.details.CPUsInCore(c))
			if acc.isSatisfied() {
				return acc.result, nil
			}
		}
	}

	// 3. Acquire single threads, preferring to fill partially-allocated cores
	//    on the same sockets as the whole cores we have already taken in this
	//    allocation.
	for _, c := range acc.freeCPUs() {
		glog.V(4).Infof("[cpumanager] takeByTopology: claiming CPU [%d]", c)
		if acc.needs(1) {
			acc.take(cpuset.NewCPUSet(c))
		}
		if acc.isSatisfied() {
			return acc.result, nil
		}
	}

	return cpuset.NewCPUSet(), fmt.Errorf("failed to allocate cpus")
}

Discovering CPU topology

CPU Manager能正常工做的前提，是發現Node上的CPU Topology，Discovery這部分工做是由cAdvisor完成的。

在cAdvisor的MachineInfo中經過Topology會記錄cpu和mem的Topology信息。其中Topology的每一個Node對象就是對應一個CPU Socket。

vendor/github.com/google/cadvisor/info/v1/machine.go

type MachineInfo struct {
	// The number of cores in this machine.
	NumCores int `json:"num_cores"`

	...

	// Machine Topology
	// Describes cpu/memory layout and hierarchy.
	Topology []Node `json:"topology"`

	...
}

type Node struct {
	Id int `json:"node_id"`
	// Per-node memory
	Memory uint64  `json:"memory"`
	Cores  []Core  `json:"cores"`
	Caches []Cache `json:"caches"`
}

cAdvisor經過GetTopology來完成信息的構建，主要是經過提取/proc/cpuinfo中信息來完成CPU Topology，經過讀取/sys/devices/system/cpu/cpu來獲取cpu cache信息。

vendor/github.com/google/cadvisor/machine/machine.go

func GetTopology(sysFs sysfs.SysFs, cpuinfo string) ([]info.Node, int, error) {
	nodes := []info.Node{}

	...
	return nodes, numCores, nil
}

下面是一個典型的NUMA CPU Topology結構：

建立容器

對於知足前面提到的知足static policy的Container建立時，kubelet會爲其按照約定的cpu affinity來爲其挑選最優的CPU Set。Container的建立時CPU Manager工做流程大體以下：

Kuberuntime調用容器運行時去建立該Container。
Kuberuntime將該Container交給CPU Manager處理。
CPU Manager爲Container按照static policy邏輯進行處理。
CPU Manager從當前Shared Pool中挑選「最佳」Set拓撲結構的CPU，對於不知足Static Policy的Contianer，則返回Shared Pool中全部CPUS組成的Set。
CPU Manager將對該Container的CPUs分配狀況記錄到Checkpoint State中，而且從Shared Pool中刪除剛分配的CPUs。
CPU Manager再從state中讀取該Container的CPU分配信息，而後經過UpdateContainerResources cRI接口將其更新到Cpuset Cgroups中，包括對於非Static Policy Container。
Kuberuntime調用容器運行時Start該容器。

func (m *manager) AddContainer(p *v1.Pod, c *v1.Container, containerID string) error {
	m.Lock()
	err := m.policy.AddContainer(m.state, p, c, containerID)
	if err != nil {
		glog.Errorf("[cpumanager] AddContainer error: %v", err)
		m.Unlock()
		return err
	}
	cpus := m.state.GetCPUSetOrDefault(containerID)
	m.Unlock()

	if !cpus.IsEmpty() {
		err = m.updateContainerCPUSet(containerID, cpus)
		if err != nil {
			glog.Errorf("[cpumanager] AddContainer error: %v", err)
			return err
		}
	} else {
		glog.V(5).Infof("[cpumanager] update container resources is skipped due to cpu set is empty")
	}

	return nil
}

刪除容器

當這些經過CPU Manager分配CPUs的Container要Delete時，CPU Manager工做流大體以下：

Kuberuntime會調用CPU Manager去按照static policy中定義邏輯處理。
CPU Manager將該Container分配的Cpu Set從新歸還到Shared Pool中。
Kuberuntime調用容器運行時Remove該容器。
CPU Manager會異步地進行Reconcile Loop，爲使用Shared Pool中的Cpus的Containers更新CPU集合。

func (m *manager) RemoveContainer(containerID string) error {
	m.Lock()
	defer m.Unlock()

	err := m.policy.RemoveContainer(m.state, containerID)
	if err != nil {
		glog.Errorf("[cpumanager] RemoveContainer error: %v", err)
		return err
	}
	return nil
}

Checkpoint

文件壞了，或者被刪除了，該如何操做?

Note: CPU Manager doesn’t support offlining and onlining of CPUs at runtime. Also, if the set of online CPUs changes on the node, the node must be drained and CPU manager manually reset by deleting the state file cpu_manager_state in the kubelet root directory.

在Container Manager建立時，會順帶完成CPU Manager的建立。咱們看看建立CPU Manager時作了什麼？咱們也就清楚了Kubelet重啓時CPU Manager作了什麼。

// NewManager creates new cpu manager based on provided policy
func NewManager(cpuPolicyName string, reconcilePeriod time.Duration, machineInfo *cadvisorapi.MachineInfo, nodeAllocatableReservation v1.ResourceList, stateFileDirectory string) (Manager, error) {
	var policy Policy

	switch policyName(cpuPolicyName) {

	case PolicyNone:
		policy = NewNonePolicy()

	case PolicyStatic:
		topo, err := topology.Discover(machineInfo)
		if err != nil {
			return nil, err
		}
		glog.Infof("[cpumanager] detected CPU topology: %v", topo)
		reservedCPUs, ok := nodeAllocatableReservation[v1.ResourceCPU]
		if !ok {
			// The static policy cannot initialize without this information.
			return nil, fmt.Errorf("[cpumanager] unable to determine reserved CPU resources for static policy")
		}
		if reservedCPUs.IsZero() {
			// The static policy requires this to be nonzero. Zero CPU reservation
			// would allow the shared pool to be completely exhausted. At that point
			// either we would violate our guarantee of exclusivity or need to evict
			// any pod that has at least one container that requires zero CPUs.
			// See the comments in policy_static.go for more details.
			return nil, fmt.Errorf("[cpumanager] the static policy requires systemreserved.cpu + kubereserved.cpu to be greater than zero")
		}

		// Take the ceiling of the reservation, since fractional CPUs cannot be
		// exclusively allocated.
		reservedCPUsFloat := float64(reservedCPUs.MilliValue()) / 1000
		numReservedCPUs := int(math.Ceil(reservedCPUsFloat))
		policy = NewStaticPolicy(topo, numReservedCPUs)

	default:
		glog.Errorf("[cpumanager] Unknown policy \"%s\", falling back to default policy \"%s\"", cpuPolicyName, PolicyNone)
		policy = NewNonePolicy()
	}

	stateImpl, err := state.NewCheckpointState(stateFileDirectory, cpuManagerStateFileName, policy.Name())
	if err != nil {
		return nil, fmt.Errorf("could not initialize checkpoint manager: %v", err)
	}

	manager := &manager{
		policy:                     policy,
		reconcilePeriod:            reconcilePeriod,
		state:                      stateImpl,
		machineInfo:                machineInfo,
		nodeAllocatableReservation: nodeAllocatableReservation,
	}
	return manager, nil
}

調用topology.Discover將cAdvisormachineInfo.Topology封裝成CPU Manager管理的CPUTopology。
而後計算reservedCPUs（KubeReservedCPUs + SystemReservedCPUs + HardEvictionThresholds），並向上取整，最終最爲reserved cpus。若是reservedCPUs爲零，將返回Error，由於咱們必須static policy必需要求System Reserved和Kube Reserved不爲空。
調用NewStaticPolicy建立static policy，建立時會調用takeByTopology爲reserved cpus按照static policy挑選cpus的邏輯選擇對應的CPU Set，最終設置到StaticPolicy.reserved中(注意，並無真正爲reserved cpu set更新到cgroups，而是添加到Default CPU Set中，而且不被static policy Containers分配，這樣Default CPU Set永遠不會爲空，它至少包含reserved CPU Set中的CPUs)。在AddContainer allocateCPUs計算assignableCPUs時，會除去這些reserved CPU Set。
接下來，調用state.NewCheckpointState，建立cpu_manager_state Checkpoint文件（若是存在，則不清空），初始Memory State，並從Checkpoint文件中restore到Memory State中。

cpu_manager_state Checkpoint文件內容就是CPUManagerCheckpoint結構體的json格式,其中Entries的key是ContainerID，value爲該Container對應的Assigned CPU Set信息。

// CPUManagerCheckpoint struct is used to store cpu/pod assignments in a checkpoint
type CPUManagerCheckpoint struct {
	PolicyName    string            `json:"policyName"`
	DefaultCPUSet string            `json:"defaultCpuSet"`
	Entries       map[string]string `json:"entries,omitempty"`
	Checksum      checksum.Checksum `json:"checksum"`
}

接下來就是CPU Manager的啓動了。

func (m *manager) Start(activePods ActivePodsFunc, podStatusProvider status.PodStatusProvider, containerRuntime runtimeService) {
	glog.Infof("[cpumanager] starting with %s policy", m.policy.Name())
	glog.Infof("[cpumanager] reconciling every %v", m.reconcilePeriod)

	m.activePods = activePods
	m.podStatusProvider = podStatusProvider
	m.containerRuntime = containerRuntime

	m.policy.Start(m.state)
	if m.policy.Name() == string(PolicyNone) {
		return
	}
	go wait.Until(func() { m.reconcileState() }, m.reconcilePeriod, wait.NeverStop)
}

啓動static policy;
啓動Reconcile Loop；

Reconcile Loop到底作了什麼？

CPU Manager Reconcile按照--cpu-manager-reconcile-period配置的週期進行Loop，Reconcile注意進行以下處理:

遍歷全部activePods中的全部Containers，注意包括InitContainers，對每一個Container繼續進行下面處理。
檢查該ContainerID是否在CPU Manager維護的Memory State assignments中，
- 若是不在Memory State assignments中：
  - 再檢查對應的Pod.Status.Phase是否爲Running且DeletionTimestamp爲nil，若是是，則調用CPU Manager的AddContainer對該Container/Pod進行QoS和cpu request檢查，若是知足static policy的條件，則調用takeByTopology爲該Container分配「最佳」CPU Set，並寫入到Memory State和Checkpoint文件(cpu_manager_sate)中，並繼續後面流程。
  - 若是對應的Pod.Status.Phase是否爲Running且DeletionTimestamp爲nil爲false，則跳過該Container，該Container處理結束。不知足static policy的Containers由於不在Memory State assignments中，因此對它們的處理流程也到此結束。
- 若是ContainerID在CPU Manager assignments維護的Memory State中，繼續後面流程。
而後從Memory State中獲取該ContainerID對應的CPU Set。
最後調用CRI UpdateContainerCPUSet更新到cpuset cgroups中。

pkg/kubelet/cm/cpumanager/cpu_manager.go:219

func (m *manager) reconcileState() (success []reconciledContainer, failure []reconciledContainer) {
	success = []reconciledContainer{}
	failure = []reconciledContainer{}

	for _, pod := range m.activePods() {
		allContainers := pod.Spec.InitContainers
		allContainers = append(allContainers, pod.Spec.Containers...)
		for _, container := range allContainers {
			status, ok := m.podStatusProvider.GetPodStatus(pod.UID)
			if !ok {
				glog.Warningf("[cpumanager] reconcileState: skipping pod; status not found (pod: %s, container: %s)", pod.Name, container.Name)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, ""})
				break
			}

			containerID, err := findContainerIDByName(&status, container.Name)
			if err != nil {
				glog.Warningf("[cpumanager] reconcileState: skipping container; ID not found in status (pod: %s, container: %s, error: %v)", pod.Name, container.Name, err)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, ""})
				continue
			}

			// Check whether container is present in state, there may be 3 reasons why it's not present:
			// - policy does not want to track the container
			// - kubelet has just been restarted - and there is no previous state file
			// - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
			if _, ok := m.state.GetCPUSet(containerID); !ok {
				if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
					glog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
					err := m.AddContainer(pod, &container, containerID)
					if err != nil {
						glog.Errorf("[cpumanager] reconcileState: failed to add container (pod: %s, container: %s, container id: %s, error: %v)", pod.Name, container.Name, containerID, err)
						failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
						continue
					}
				} else {
					// if DeletionTimestamp is set, pod has already been removed from state
					// skip the pod/container since it's not running and will be deleted soon
					continue
				}
			}

			cset := m.state.GetCPUSetOrDefault(containerID)
			if cset.IsEmpty() {
				// NOTE: This should not happen outside of tests.
				glog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
				continue
			}

			glog.V(4).Infof("[cpumanager] reconcileState: updating container (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
			err = m.updateContainerCPUSet(containerID, cset)
			if err != nil {
				glog.Errorf("[cpumanager] reconcileState: failed to update container (pod: %s, container: %s, container id: %s, cpuset: \"%v\", error: %v)", pod.Name, container.Name, containerID, cset, err)
				failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID})
				continue
			}
			success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
		}
	}
	return success, failure
}

Validate State

CPU Manager啓動時，除了會啓動一個goruntime進行Reconcile之外，還會對State進行validate處理:

當Memory State中Shared(Default) CPU Set爲空時，CPU Assginments也必須爲空，而後對Memory State中的Shared Pool進行初始化，並寫入到Checkpoint文件中（初始化Checkpoint）。
只要咱們沒有手動去刪Checkpoint文件，那麼在前面提到的state.NewCheckpointState中會根據Checkpoint文件restore到Memory State中，所以以前Assgned CPU Set、Default CPU Set都還在。
當檢測到Memory State已經成功初始化（根據Checkpoint restore），則檢查此次啓動時reserved cpu set是否都在Default CPU Set中，若是不是（好比kube/system reserved cpus增長了），則報錯返回，由於這意味着reserved cpu set中有些cpus被Assigned到了某些Container中了，這可能會致使這些容器啓動失敗，此時須要用戶本身手動的去修正Checkpoint文件。
檢測reserved cpu set經過後，再檢測Default CPU Set和Assigned CPU Set是否有交集，若是有交集，說明Checkpoint文件restore到Memory State的數據有錯，報錯返回。
最後檢查此次啓動時從cAdvisor中獲取到的CPU Topology中的全部CPUs是否與Memory State（從Checkpoint中restore）中記錄的全部CPUs（Default CPU Set + Assigned CPU Set）相同，若是不一樣，則報錯返回。可能由於上次CPU Manager中止到此次啓動這個時間內，Node上的可用CPUs發生變化。

pkg/kubelet/cm/cpumanager/policy_static.go:116

func (p *staticPolicy) validateState(s state.State) error {
	tmpAssignments := s.GetCPUAssignments()
	tmpDefaultCPUset := s.GetDefaultCPUSet()

	// Default cpuset cannot be empty when assignments exist
	if tmpDefaultCPUset.IsEmpty() {
		if len(tmpAssignments) != 0 {
			return fmt.Errorf("default cpuset cannot be empty")
		}
		// state is empty initialize
		allCPUs := p.topology.CPUDetails.CPUs()
		s.SetDefaultCPUSet(allCPUs)
		return nil
	}

	// State has already been initialized from file (is not empty)
	// 1. Check if the reserved cpuset is not part of default cpuset because:
	// - kube/system reserved have changed (increased) - may lead to some containers not being able to start
	// - user tampered with file
	if !p.reserved.Intersection(tmpDefaultCPUset).Equals(p.reserved) {
		return fmt.Errorf("not all reserved cpus: \"%s\" are present in defaultCpuSet: \"%s\"",
			p.reserved.String(), tmpDefaultCPUset.String())
	}

	// 2. Check if state for static policy is consistent
	for cID, cset := range tmpAssignments {
		// None of the cpu in DEFAULT cset should be in s.assignments
		if !tmpDefaultCPUset.Intersection(cset).IsEmpty() {
			return fmt.Errorf("container id: %s cpuset: \"%s\" overlaps with default cpuset \"%s\"",
				cID, cset.String(), tmpDefaultCPUset.String())
		}
	}

	// 3. It's possible that the set of available CPUs has changed since
	// the state was written. This can be due to for example
	// offlining a CPU when kubelet is not running. If this happens,
	// CPU manager will run into trouble when later it tries to
	// assign non-existent CPUs to containers. Validate that the
	// topology that was received during CPU manager startup matches with
	// the set of CPUs stored in the state.
	totalKnownCPUs := tmpDefaultCPUset.Clone()
	for _, cset := range tmpAssignments {
		totalKnownCPUs = totalKnownCPUs.Union(cset)
	}
	if !totalKnownCPUs.Equals(p.topology.CPUDetails.CPUs()) {
		return fmt.Errorf("current set of available CPUs \"%s\" doesn't match with CPUs in state \"%s\"",
			p.topology.CPUDetails.CPUs().String(), totalKnownCPUs.String())
	}

	return nil
}

思考

某個CPU在Shared Pool中被非Guaranteed Pod Containers使用時，後來被CPU Manager分配給某個Static Policy Container,那麼原來這個CPU上的任務會怎麼樣？馬上被調度到其餘Shared Pool中的CPUs嗎？

因爲Static Policy Container Add的時候，除了爲本身挑選最佳CPU Set外，還會把挑選的CPU Set從Shared Pool CPU Set中刪除，所以上面這種狀況下，原來的這個CPU上的任務會繼續執行等cpu scheduler下次調度任務時，由於cpuset cgroups的生效，將致使他們看不到原來的那塊CPU了。

Static Policy Container從頭至尾都必定是綁定分配的CPUs嗎？

從前面分析的工做流可知，當某Static Policy Container被分配了某些CPUs後，經過每10s（默認）一次的Reconcile將Memory State中分配狀況更新到cpuset cgroups中，所以最壞會有10s時間這個Static Policy Container將和非Static Policy Container共享這個CPU。

CPU Manager的Checkpoint文件被破壞，與實際的CPU Assigned狀況不一致，該如何修復？

經過對CPU Manager的分析，咱們知道Reconcile並不能本身修復這個差別。能夠經過如下方法修復：

方法1：從新生成Checkpoint文件：刪除Checkpoint文件，並重啓Kubelet，CPU Manager的Reconcile機制會遍歷全部Containers，並從新爲這些知足Static Policy條件的Containers分配CPUs，並更新到cpuset cgroups中。這可能會致使運行中的Container從新被分配到不一樣的CPU Set中而出現短期的應用抖動。

方法2：Drain這個node，將Pod驅逐走，讓Pod在其餘正常Checkpoint的Node上調度，而後清空或者刪除Checkpoint文件。這個方法也會對應用形成一點的影響，畢竟Pod須要在其餘Node上recreate。

CPU Manager的不足

基於當前cAdvisor對CPU Topology的Discover能力，目前CPU Manager在爲Container挑選CPUs考慮cpu socket是否靠近某些PCI Bus。

CPU Manager還不支持對isolcpus Linux kernel boot parameter的兼容，CPU Manager須要（經過cAdvisor或者直接讀）獲取isolcpus配置的isolate CPUs，並在給Static Policy Contaienrs分配時排除這些isolate CPUs。
還不支持Dynamic分配，在Container運行過程當中直接更改起cpuset cgroups。

總結

經過對Kubelet CPU Manager的深刻分析，咱們對CPU Manager的工做機制有了充分的理解，包括其Reconcile Loop、啓動時的Validate Sate機制、Checkpoint的機制及其修復方法、CPU Manager當前不足等。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。