Author: xidianwangtao@gmail.comnode
摘要:Kuberuntime CPU Manager在咱們生產環境中有大規模的應用,咱們必須對其有深刻理解,方能指揮若定。本文對CPU Manager的使用場景、使用方法、工做機制、可能存在的問題及解決辦法等方面都有涉及,但願對你們有所幫助。nginx
熟悉docker的用戶,必定用過docker cpuset的能力,用來指定docker container啓動時綁定指定的cpu和memory node。git
--cpuset-cpus="" CPUs in which to allow execution (0-3, 0,1) --cpuset-mems="" Memory nodes (MEMs) in which to allow execution (0-3, 0,1). Only effective on NUMA systems.
可是Kubernetes一直沒有提供提供的能力,直到Kubernetes 1.8開始,Kubernetes提供了CPU Manager特性來支持cpuset的能力。從Kubernetes 1.10版本開始到目前的1.12,該特性仍是Beta版。github
CPU Manager是Kubelet CM中的一個模塊,目標是經過給某些Containers綁定指定的cpus,達到綁定cpus的目標,從而提高這些cpu敏感型任務的性能。docker
前面提到CPU敏感型任務,會由於使用CpuSet而大幅度提高計算性能,那到底具有哪些特色的任務是屬於CPU敏感型的呢?json
Feature Highlight/ CPU Manager - Kubernetes中還列舉了一些具體的Sample對比,有興趣的能夠去了解。咱們公司的不少應用是屬於這種類型的,並且cpuset帶來的好處還有cpu資源結算的方便.固然,這幾乎必定會帶來整個集羣的cpu利用率會有所下降,這就取決於你是否把應用的性能放在第一位了。api
在Kubernetes v1.8-1.9版本中,CPU Manager仍是Alpha,在v1.10-1.12是Beta。我沒關注過CPU Manager這幾個版本的Changelog,仍是建議在1.10以後的版本中使用。app
確保kubelet中CPUManager Feature Gate爲true(BETA - default=true)異步
目前CPU Manager支持兩種Policy,分別爲none和static,經過kubelet --cpu-manager-policy
設置,將來會增長dynamic policy作Container生命週期內的cpuset動態調整。socket
none: 爲cpu manager的默認值,至關於沒有啓用cpuset的能力。cpu request對應到cpu share,cpu limit對應到cpu quota。
static: 目前,請設置--cpu-manager-policy=static
來啓用,kubelet將在Container啓動前分配綁定的cpu set,分配時還會考慮cpu topology來提高cpu affinity,後面會提到。
確保kubelet爲--kube-reserved
和--system-reserved
都配置了值,能夠不是整數個cpu,最終會計算reserved cpus時會向上取整。這樣作的目的是爲了防止CPU Manager把Node上全部的cpu cores分配出去了,致使kubelet及系統進程都沒有可用的cpu了。
注意CPU Manager還有一個配置項
--cpu-manager-reconcile-period
,用來配置CPU Manager Reconcile Kubelet內存中CPU分配狀況到cpuset cgroups的修復週期。若是沒有配置該項,那麼將使用--node-status-update-frequency(default 10s)
配置的值。
完成了以上配置,就啓用了Static CPU Manager,接下來就是在Workload中使用了。Kubernetes要求使用CPU Manager的Pod、Container具有如下兩個條件:
spec: containers: - name: nginx image: nginx resources: limits: memory: "200Mi" cpu: "2" requests: memory: "200Mi" cpu: "2"
任何其餘狀況下的Containers,CPU Manager都不會爲其分配綁定的CPUs,而是經過CFS使用Shared Pool中的CPUs。Shared Pool中的CPU集,就是Node上CPUCapacity - ReservedCPUs - ExclusiveCPUs
。
CPU Manager爲知足條件的Container分配指定的CPUs時,會盡可能按照CPU Topology來分配,也就是考慮CPU Affinity,按照以下的優先順序進行CPUs選擇:(Logic CPUs就是Hyperthreads)
若是Container請求的Logic CPUs數量不小於單塊CPU Socket中Logci CPUs數量,那麼會優先把整塊CPU Socket中的Logic CPUs分配給該Container。
若是Container剩餘請求的Logic CPUs數量不小於單塊物理CPU Core提供的Logic CPUs數量,那麼會優先把整塊物理CPU Core上的Logic CPUs分配給該Container。
Container剩餘請求的Logic CPUs則從按照以下規則排好序的Logic CPUs列表中選擇:
pkg/kubelet/cm/cpumanager/cpu_assignment.go:149 func takeByTopology(topo *topology.CPUTopology, availableCPUs cpuset.CPUSet, numCPUs int) (cpuset.CPUSet, error) { acc := newCPUAccumulator(topo, availableCPUs, numCPUs) if acc.isSatisfied() { return acc.result, nil } if acc.isFailed() { return cpuset.NewCPUSet(), fmt.Errorf("not enough cpus available to satisfy request") } // Algorithm: topology-aware best-fit // 1. Acquire whole sockets, if available and the container requires at // least a socket's-worth of CPUs. for _, s := range acc.freeSockets() { if acc.needs(acc.topo.CPUsPerSocket()) { glog.V(4).Infof("[cpumanager] takeByTopology: claiming socket [%d]", s) acc.take(acc.details.CPUsInSocket(s)) if acc.isSatisfied() { return acc.result, nil } } } // 2. Acquire whole cores, if available and the container requires at least // a core's-worth of CPUs. for _, c := range acc.freeCores() { if acc.needs(acc.topo.CPUsPerCore()) { glog.V(4).Infof("[cpumanager] takeByTopology: claiming core [%d]", c) acc.take(acc.details.CPUsInCore(c)) if acc.isSatisfied() { return acc.result, nil } } } // 3. Acquire single threads, preferring to fill partially-allocated cores // on the same sockets as the whole cores we have already taken in this // allocation. for _, c := range acc.freeCPUs() { glog.V(4).Infof("[cpumanager] takeByTopology: claiming CPU [%d]", c) if acc.needs(1) { acc.take(cpuset.NewCPUSet(c)) } if acc.isSatisfied() { return acc.result, nil } } return cpuset.NewCPUSet(), fmt.Errorf("failed to allocate cpus") }
CPU Manager能正常工做的前提,是發現Node上的CPU Topology,Discovery這部分工做是由cAdvisor完成的。
在cAdvisor的MachineInfo中經過Topology會記錄cpu和mem的Topology信息。其中Topology的每一個Node對象就是對應一個CPU Socket。
vendor/github.com/google/cadvisor/info/v1/machine.go type MachineInfo struct { // The number of cores in this machine. NumCores int `json:"num_cores"` ... // Machine Topology // Describes cpu/memory layout and hierarchy. Topology []Node `json:"topology"` ... } type Node struct { Id int `json:"node_id"` // Per-node memory Memory uint64 `json:"memory"` Cores []Core `json:"cores"` Caches []Cache `json:"caches"` }
cAdvisor經過GetTopology來完成信息的構建,主要是經過提取/proc/cpuinfo
中信息來完成CPU Topology,經過讀取/sys/devices/system/cpu/cpu
來獲取cpu cache信息。
vendor/github.com/google/cadvisor/machine/machine.go func GetTopology(sysFs sysfs.SysFs, cpuinfo string) ([]info.Node, int, error) { nodes := []info.Node{} ... return nodes, numCores, nil }
下面是一個典型的NUMA CPU Topology結構:
對於知足前面提到的知足static policy的Container建立時,kubelet會爲其按照約定的cpu affinity來爲其挑選最優的CPU Set。Container的建立時CPU Manager工做流程大體以下:
func (m *manager) AddContainer(p *v1.Pod, c *v1.Container, containerID string) error { m.Lock() err := m.policy.AddContainer(m.state, p, c, containerID) if err != nil { glog.Errorf("[cpumanager] AddContainer error: %v", err) m.Unlock() return err } cpus := m.state.GetCPUSetOrDefault(containerID) m.Unlock() if !cpus.IsEmpty() { err = m.updateContainerCPUSet(containerID, cpus) if err != nil { glog.Errorf("[cpumanager] AddContainer error: %v", err) return err } } else { glog.V(5).Infof("[cpumanager] update container resources is skipped due to cpu set is empty") } return nil }
當這些經過CPU Manager分配CPUs的Container要Delete時,CPU Manager工做流大體以下:
func (m *manager) RemoveContainer(containerID string) error { m.Lock() defer m.Unlock() err := m.policy.RemoveContainer(m.state, containerID) if err != nil { glog.Errorf("[cpumanager] RemoveContainer error: %v", err) return err } return nil }
文件壞了,或者被刪除了,該如何操做?
Note: CPU Manager doesn’t support offlining and onlining of CPUs at runtime. Also, if the set of online CPUs changes on the node, the node must be drained and CPU manager manually reset by deleting the state file cpu_manager_state in the kubelet root directory.
在Container Manager建立時,會順帶完成CPU Manager的建立。咱們看看建立CPU Manager時作了什麼?咱們也就清楚了Kubelet重啓時CPU Manager作了什麼。
// NewManager creates new cpu manager based on provided policy func NewManager(cpuPolicyName string, reconcilePeriod time.Duration, machineInfo *cadvisorapi.MachineInfo, nodeAllocatableReservation v1.ResourceList, stateFileDirectory string) (Manager, error) { var policy Policy switch policyName(cpuPolicyName) { case PolicyNone: policy = NewNonePolicy() case PolicyStatic: topo, err := topology.Discover(machineInfo) if err != nil { return nil, err } glog.Infof("[cpumanager] detected CPU topology: %v", topo) reservedCPUs, ok := nodeAllocatableReservation[v1.ResourceCPU] if !ok { // The static policy cannot initialize without this information. return nil, fmt.Errorf("[cpumanager] unable to determine reserved CPU resources for static policy") } if reservedCPUs.IsZero() { // The static policy requires this to be nonzero. Zero CPU reservation // would allow the shared pool to be completely exhausted. At that point // either we would violate our guarantee of exclusivity or need to evict // any pod that has at least one container that requires zero CPUs. // See the comments in policy_static.go for more details. return nil, fmt.Errorf("[cpumanager] the static policy requires systemreserved.cpu + kubereserved.cpu to be greater than zero") } // Take the ceiling of the reservation, since fractional CPUs cannot be // exclusively allocated. reservedCPUsFloat := float64(reservedCPUs.MilliValue()) / 1000 numReservedCPUs := int(math.Ceil(reservedCPUsFloat)) policy = NewStaticPolicy(topo, numReservedCPUs) default: glog.Errorf("[cpumanager] Unknown policy \"%s\", falling back to default policy \"%s\"", cpuPolicyName, PolicyNone) policy = NewNonePolicy() } stateImpl, err := state.NewCheckpointState(stateFileDirectory, cpuManagerStateFileName, policy.Name()) if err != nil { return nil, fmt.Errorf("could not initialize checkpoint manager: %v", err) } manager := &manager{ policy: policy, reconcilePeriod: reconcilePeriod, state: stateImpl, machineInfo: machineInfo, nodeAllocatableReservation: nodeAllocatableReservation, } return manager, nil }
cpu_manager_state
Checkpoint文件(若是存在,則不清空),初始Memory State,並從Checkpoint文件中restore到Memory State中。cpu_manager_state
Checkpoint文件內容就是CPUManagerCheckpoint結構體的json格式,其中Entries的key是ContainerID,value爲該Container對應的Assigned CPU Set信息。
// CPUManagerCheckpoint struct is used to store cpu/pod assignments in a checkpoint type CPUManagerCheckpoint struct { PolicyName string `json:"policyName"` DefaultCPUSet string `json:"defaultCpuSet"` Entries map[string]string `json:"entries,omitempty"` Checksum checksum.Checksum `json:"checksum"` }
接下來就是CPU Manager的啓動了。
func (m *manager) Start(activePods ActivePodsFunc, podStatusProvider status.PodStatusProvider, containerRuntime runtimeService) { glog.Infof("[cpumanager] starting with %s policy", m.policy.Name()) glog.Infof("[cpumanager] reconciling every %v", m.reconcilePeriod) m.activePods = activePods m.podStatusProvider = podStatusProvider m.containerRuntime = containerRuntime m.policy.Start(m.state) if m.policy.Name() == string(PolicyNone) { return } go wait.Until(func() { m.reconcileState() }, m.reconcilePeriod, wait.NeverStop) }
CPU Manager Reconcile按照--cpu-manager-reconcile-period
配置的週期進行Loop,Reconcile注意進行以下處理:
cpu_manager_sate
)中,並繼續後面流程。pkg/kubelet/cm/cpumanager/cpu_manager.go:219 func (m *manager) reconcileState() (success []reconciledContainer, failure []reconciledContainer) { success = []reconciledContainer{} failure = []reconciledContainer{} for _, pod := range m.activePods() { allContainers := pod.Spec.InitContainers allContainers = append(allContainers, pod.Spec.Containers...) for _, container := range allContainers { status, ok := m.podStatusProvider.GetPodStatus(pod.UID) if !ok { glog.Warningf("[cpumanager] reconcileState: skipping pod; status not found (pod: %s, container: %s)", pod.Name, container.Name) failure = append(failure, reconciledContainer{pod.Name, container.Name, ""}) break } containerID, err := findContainerIDByName(&status, container.Name) if err != nil { glog.Warningf("[cpumanager] reconcileState: skipping container; ID not found in status (pod: %s, container: %s, error: %v)", pod.Name, container.Name, err) failure = append(failure, reconciledContainer{pod.Name, container.Name, ""}) continue } // Check whether container is present in state, there may be 3 reasons why it's not present: // - policy does not want to track the container // - kubelet has just been restarted - and there is no previous state file // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set) if _, ok := m.state.GetCPUSet(containerID); !ok { if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil { glog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID) err := m.AddContainer(pod, &container, containerID) if err != nil { glog.Errorf("[cpumanager] reconcileState: failed to add container (pod: %s, container: %s, container id: %s, error: %v)", pod.Name, container.Name, containerID, err) failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID}) continue } } else { // if DeletionTimestamp is set, pod has already been removed from state // skip the pod/container since it's not running and will be deleted soon continue } } cset := m.state.GetCPUSetOrDefault(containerID) if cset.IsEmpty() { // NOTE: This should not happen outside of tests. glog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name) failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID}) continue } glog.V(4).Infof("[cpumanager] reconcileState: updating container (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset) err = m.updateContainerCPUSet(containerID, cset) if err != nil { glog.Errorf("[cpumanager] reconcileState: failed to update container (pod: %s, container: %s, container id: %s, cpuset: \"%v\", error: %v)", pod.Name, container.Name, containerID, cset, err) failure = append(failure, reconciledContainer{pod.Name, container.Name, containerID}) continue } success = append(success, reconciledContainer{pod.Name, container.Name, containerID}) } } return success, failure }
CPU Manager啓動時,除了會啓動一個goruntime進行Reconcile之外,還會對State進行validate處理:
當Memory State中Shared(Default) CPU Set爲空時,CPU Assginments也必須爲空,而後對Memory State中的Shared Pool進行初始化,並寫入到Checkpoint文件中(初始化Checkpoint)。
只要咱們沒有手動去刪Checkpoint文件,那麼在前面提到的state.NewCheckpointState中會根據Checkpoint文件restore到Memory State中,所以以前Assgned CPU Set、Default CPU Set都還在。
當檢測到Memory State已經成功初始化(根據Checkpoint restore),則檢查此次啓動時reserved cpu set是否都在Default CPU Set中,若是不是(好比kube/system reserved cpus增長了),則報錯返回,由於這意味着reserved cpu set中有些cpus被Assigned到了某些Container中了,這可能會致使這些容器啓動失敗,此時須要用戶本身手動的去修正Checkpoint文件。
檢測reserved cpu set經過後,再檢測Default CPU Set和Assigned CPU Set是否有交集,若是有交集,說明Checkpoint文件restore到Memory State的數據有錯,報錯返回。
最後檢查此次啓動時從cAdvisor中獲取到的CPU Topology中的全部CPUs是否與Memory State(從Checkpoint中restore)中記錄的全部CPUs(Default CPU Set + Assigned CPU Set)相同,若是不一樣,則報錯返回。可能由於上次CPU Manager中止到此次啓動這個時間內,Node上的可用CPUs發生變化。
pkg/kubelet/cm/cpumanager/policy_static.go:116 func (p *staticPolicy) validateState(s state.State) error { tmpAssignments := s.GetCPUAssignments() tmpDefaultCPUset := s.GetDefaultCPUSet() // Default cpuset cannot be empty when assignments exist if tmpDefaultCPUset.IsEmpty() { if len(tmpAssignments) != 0 { return fmt.Errorf("default cpuset cannot be empty") } // state is empty initialize allCPUs := p.topology.CPUDetails.CPUs() s.SetDefaultCPUSet(allCPUs) return nil } // State has already been initialized from file (is not empty) // 1. Check if the reserved cpuset is not part of default cpuset because: // - kube/system reserved have changed (increased) - may lead to some containers not being able to start // - user tampered with file if !p.reserved.Intersection(tmpDefaultCPUset).Equals(p.reserved) { return fmt.Errorf("not all reserved cpus: \"%s\" are present in defaultCpuSet: \"%s\"", p.reserved.String(), tmpDefaultCPUset.String()) } // 2. Check if state for static policy is consistent for cID, cset := range tmpAssignments { // None of the cpu in DEFAULT cset should be in s.assignments if !tmpDefaultCPUset.Intersection(cset).IsEmpty() { return fmt.Errorf("container id: %s cpuset: \"%s\" overlaps with default cpuset \"%s\"", cID, cset.String(), tmpDefaultCPUset.String()) } } // 3. It's possible that the set of available CPUs has changed since // the state was written. This can be due to for example // offlining a CPU when kubelet is not running. If this happens, // CPU manager will run into trouble when later it tries to // assign non-existent CPUs to containers. Validate that the // topology that was received during CPU manager startup matches with // the set of CPUs stored in the state. totalKnownCPUs := tmpDefaultCPUset.Clone() for _, cset := range tmpAssignments { totalKnownCPUs = totalKnownCPUs.Union(cset) } if !totalKnownCPUs.Equals(p.topology.CPUDetails.CPUs()) { return fmt.Errorf("current set of available CPUs \"%s\" doesn't match with CPUs in state \"%s\"", p.topology.CPUDetails.CPUs().String(), totalKnownCPUs.String()) } return nil }
因爲Static Policy Container Add的時候,除了爲本身挑選最佳CPU Set外,還會把挑選的CPU Set從Shared Pool CPU Set中刪除,所以上面這種狀況下,原來的這個CPU上的任務會繼續執行等cpu scheduler下次調度任務時,由於cpuset cgroups的生效,將致使他們看不到原來的那塊CPU了。
從前面分析的工做流可知,當某Static Policy Container被分配了某些CPUs後,經過每10s(默認)一次的Reconcile將Memory State中分配狀況更新到cpuset cgroups中,所以最壞會有10s時間這個Static Policy Container將和非Static Policy Container共享這個CPU。
經過對CPU Manager的分析,咱們知道Reconcile並不能本身修復這個差別。能夠經過如下方法修復:
方法1:從新生成Checkpoint文件:刪除Checkpoint文件,並重啓Kubelet,CPU Manager的Reconcile機制會遍歷全部Containers,並從新爲這些知足Static Policy條件的Containers分配CPUs,並更新到cpuset cgroups中。這可能會致使運行中的Container從新被分配到不一樣的CPU Set中而出現短期的應用抖動。
方法2:Drain這個node,將Pod驅逐走,讓Pod在其餘正常Checkpoint的Node上調度,而後清空或者刪除Checkpoint文件。這個方法也會對應用形成一點的影響,畢竟Pod須要在其餘Node上recreate。
CPU Manager還不支持對isolcpus
Linux kernel boot parameter的兼容,CPU Manager須要(經過cAdvisor或者直接讀)獲取isolcpus
配置的isolate CPUs,並在給Static Policy Contaienrs分配時排除這些isolate CPUs。
還不支持Dynamic分配,在Container運行過程當中直接更改起cpuset cgroups。
經過對Kubelet CPU Manager的深刻分析,咱們對CPU Manager的工做機制有了充分的理解,包括其Reconcile Loop、啓動時的Validate Sate機制、Checkpoint的機制及其修復方法、CPU Manager當前不足等。