在 kubernetes 使用過程當中,根據集羣的配置不一樣,每每會由於以下狀況的一種或幾種致使節點 NotReady:node
當出現這種狀況的時候,會出現節點 NotReady,進而當kube-controller-manager 中的--pod-eviction-timeout
定義的值,默認 5 分鐘後,將觸發 Pod eviction 動做。api
對於不一樣類型的 workloads,其對應的 pod 處理方式由於 controller-manager 中各個控制器的邏輯不通而不一樣。總結以下:網絡
deployment
: 節點 NotReady 觸發 eviction 後,pod 將會在新節點重建(若是有 nodeSelector 或者親和性要求,會處於 Pending 狀態),故障節點的 Pod 仍然會保留處於 Unknown 狀態,因此此時看到的 pod 數多於副本數。statefulset
: 節點 NotReady 一樣會對 StatefulSet 觸發 eviction 操做,可是用戶看到的 Pod 會一直處於 Unknown 狀態沒有變化。daemonSet
: 節點 NotReady 對 DaemonSet 不會有影響,查詢 pod 處於 NodeLost 狀態並一直保持。這裏說到,對於 deployment
和 statefulSet
類型資源,當節點 NotReady 後顯示的 pod 狀態爲 Unknown。 這裏實際上 etcd 保存的狀態爲 NodeLost,只是顯示時作了處理,與 daemonSet
作了區分。對應代碼中的邏輯爲:app
### node controller // 觸發 NodeEviction 操做時會 DeletePods,這個刪除爲 GracefulDelete, // apiserver rest 接口對 PodObj 添加了 DeletionTimestamp func DeletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore extensionslisters.DaemonSetLister) (bool, error) { ... for _, pod := range pods.Items { ... // Set reason and message in the pod object. if _, err = SetPodTerminationReason(kubeClient, &pod, nodeName); err != nil { if apierrors.IsConflict(err) { updateErrList = append(updateErrList, fmt.Errorf("update status failed for pod %q: %v", format.Pod(&pod), err)) continue } } // if the pod has already been marked for deletion, we still return true that there are remaining pods. if pod.DeletionGracePeriodSeconds != nil { remaining = true continue } // if the pod is managed by a daemonset, ignore it _, err := daemonStore.GetPodDaemonSets(&pod) if err == nil { // No error means at least one daemonset was found continue } glog.V(2).Infof("Starting deletion of pod %v/%v", pod.Namespace, pod.Name) recorder.Eventf(&pod, v1.EventTypeNormal, "NodeControllerEviction", "Marking for deletion Pod %s from Node %s", pod.Name, nodeName) if err := kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, nil); err != nil { return false, err } remaining = true } ... } ### staging apiserver REST 接口 // 對於優雅刪除,到這裏其實已經中止,再也不進一步刪除,剩下的交給 kubelet watch 到變化後去作 delete func (e *Store) Delete(ctx genericapirequest.Context, name string, options *metav1.DeleteOptions) (runtime.Object, bool, error) { ... if graceful || pendingFinalizers || shouldUpdateFinalizers { err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, obj) } // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof. if !deleteImmediately || err != nil { return out, false, err } ... } // stagging/apiserver中的 rest 接口調用,設置了 DeletionTimestamp 和 DeletionGracePeriodSeconds func (e *Store) updateForGracefulDeletionAndFinalizers(ctx genericapirequest.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) { ... if options.GracePeriodSeconds != nil { period := int64(*options.GracePeriodSeconds) if period >= *objectMeta.GetDeletionGracePeriodSeconds() { return false, true, nil } newDeletionTimestamp := metav1.NewTime( objectMeta.GetDeletionTimestamp().Add(-time.Second * time.Duration(*objectMeta.GetDeletionGracePeriodSeconds())). Add(time.Second * time.Duration(*options.GracePeriodSeconds))) objectMeta.SetDeletionTimestamp(&newDeletionTimestamp) objectMeta.SetDeletionGracePeriodSeconds(&period) return true, false, nil } ... } ### node controller // SetPodTerminationReason 嘗試設置 Pod狀態和緣由到 Pod 對象中 func SetPodTerminationReason(kubeClient clientset.Interface, pod *v1.Pod, nodeName string) (*v1.Pod, error) { if pod.Status.Reason == nodepkg.NodeUnreachablePodReason { return pod, nil } pod.Status.Reason = nodepkg.NodeUnreachablePodReason pod.Status.Message = fmt.Sprintf(nodepkg.NodeUnreachablePodMessage, nodeName, pod.Name) var updatedPod *v1.Pod var err error if updatedPod, err = kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod); err != nil { return nil, err } return updatedPod, nil } ### 命令行輸出 // 打印輸出時狀態的切換,若是 "DeletionTimestamp 不爲空" 且 "podStatus 爲 NodeLost 狀態"時, // 顯示的狀態爲 Unknown func printPod(pod *api.Pod, options printers.PrintOptions) ([]metav1alpha1.TableRow, error) { ... if pod.DeletionTimestamp != nil && pod.Status.Reason == node.NodeUnreachablePodReason { reason = "Unknown" } else if pod.DeletionTimestamp != nil { reason = "Terminating" } ... }
當節點恢復後,不一樣的 workload 對應的 pod 狀態變化也是不一樣的。less
deployment
: 根據上一節描述,此時 pod 已經有正確的 pod 在其餘節點 running,此時故障節點恢復後,kubelet 執行優雅刪除,刪除舊的 PodObj。statefulset
: statefulset 會從Unknown 狀態變爲 Terminating 狀態,執行優雅刪除,detach PV,而後執行從新調度與重建操做。daemonset
: daemonset 會從 NodeLost 狀態直接變成 Running 狀態,不涉及重建。spa
咱們每每會考慮下面兩個問題,statefulset 爲何沒有重建? 如何保持單副本 statefulset 的高可用呢?插件
關於爲何沒重建命令行
首先簡單介紹下 statefulset 控制器的邏輯。rest
Statefulset 控制器經過 StatefulSetControl
以及 StatefulPodControl
2個模塊協調完成對 statefulSet 類型 workload 的狀態管理(StatefulSetStatusUpdater)和擴縮控制(StatefulPodControl)。實際上,StatefulsetControl是對 StatefulPodControl 的調用來增刪改 Pod。code
StatefulSet 在 podManagementPolicy
爲默認值 OrderedReady
時,會按照整數順序單調遞增的依次建立 Pod,不然在 Parallel
時,雖然是按整數,可是 Pod 是同時調度與建立。
具體的邏輯在覈心方法 UpdateStatefulSet
中,見圖:
咱們看到的 Stateful Pod 一直處於 Unknown
狀態的緣由就是由於這個控制器屏蔽了對該 Pod 的操做。由於在第一節介紹了,NodeController 的 Pod Eviction 機制已經把 Pod 標記刪除,PodObj 中包含的 DeletionTimestamp
被設置,StatefulSet Controller 代碼檢查 IsTerminating
符合條件,便直接 return 了。
// updateStatefulSet performs the update function for a StatefulSet. This method creates, updates, and deletes Pods in // the set in order to conform the system to the target state for the set. The target state always contains // set.Spec.Replicas Pods with a Ready Condition. If the UpdateStrategy.Type for the set is // RollingUpdateStatefulSetStrategyType then all Pods in the set must be at set.Status.CurrentRevision. // If the UpdateStrategy.Type for the set is OnDeleteStatefulSetStrategyType, the target state implies nothing about // the revisions of Pods in the set. If the UpdateStrategy.Type for the set is PartitionStatefulSetStrategyType, then // all Pods with ordinal less than UpdateStrategy.Partition.Ordinal must be at Status.CurrentRevision and all other // Pods must be at Status.UpdateRevision. If the returned error is nil, the returned StatefulSetStatus is valid and the // update must be recorded. If the error is not nil, the method should be retried until successful. func (ssc *defaultStatefulSetControl) updateStatefulSet( ... for i := range replicas { ... // If we find a Pod that is currently terminating, we must wait until graceful deletion // completes before we continue to make progress. if isTerminating(replicas[i]) && monotonic { glog.V(4).Infof( "StatefulSet %s/%s is waiting for Pod %s to Terminate", set.Namespace, set.Name, replicas[i].Name) return &status, nil } ... } } // isTerminating returns true if pod's DeletionTimestamp has been set func isTerminating(pod *v1.Pod) bool { return pod.DeletionTimestamp != nil }
那麼如何保證單副本高可用?
每每應用中有一些 pod 無法實現多副本,可是又要保證集羣可以自愈,那麼這種某個節點 Down 掉或者網卡壞掉等狀況,就會有很大影響,要如何可以實現自愈呢?
對於這種 Unknown
狀態的 Stateful Pod ,能夠經過 force delete
方式去刪除。關於 ForceDelete,社區是不推薦的,由於可能會對惟一的標誌符(單調遞增的序列號)產生影響,若是發生,對 StatefulSet 是致命的,可能會致使數據丟失(多是應用集羣腦裂,也多是對 PV 多寫致使)。
kubectl delete pods <pod> --grace-period=0 --force
可是這樣刪除仍然須要一些保護措施,以 Ceph RBD 存儲插件爲例,當執行force delete 前,根據經驗,用戶應該先設置 ceph osd blacklist
,防止當遷移過程當中網絡恢復後,容器繼續向 PV 寫入數據將文件系統弄壞。由於 force delete
是將 PodObj 直接從 ETCD 強制清理,這樣 StatefulSet Controller
將會新建新的 Pod 在其餘節點, 可是故障節點的 Kubelet 清理這個舊容器須要時間,此時勢必存在 2 個容器mount 了同一塊 PV(故障節點Pod 對應的容器與新遷移Pod 建立的容器),可是若是此時網絡恢復,那麼2 個容器可能同時寫入數據,後果將是嚴重的