kubernetes故障現場一之Orphaned pod

系列目錄html

問題描述:週五寫字樓總體停電,週一再來的時候發現不少pod的狀態都是Terminating,經排查是由於測試環境kubernetes集羣中的有些節點是PC機,停電後須要手動開機才能起來.起來之後節點恢復正常,可是經過journalctl -fu kubelet查看日誌不斷有如下錯誤node

[root@k8s-node4 pods]# journalctl -fu kubelet
-- Logs begin at 二 2019-05-21 08:52:08 CST. --
5月 21 14:48:48 k8s-node4 kubelet[2493]: E0521 14:48:48.748460    2493 kubelet_volumes.go:140] Orphaned pod "d29f26dc-77bb-11e9-971b-0050568417a2" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.

咱們經過cd進入/var/lib/kubelet/pods目錄,使用ls查看git

[root@k8s-node4 pods]# ls
36e224e2-7b73-11e9-99bc-0050568417a2  42e8cd65-76b1-11e9-971b-0050568417a2  42eaca2d-76b1-11e9-971b-0050568417a2
36e30462-7b73-11e9-99bc-0050568417a2  42e94e29-76b1-11e9-971b-0050568417a2  d29f26dc-77bb-11e9-971b-0050568417a2

能夠看到,錯誤信息裏的pod的ID在這裏面,咱們cd進入它(d29f26dc-77bb-11e9-971b-0050568417a2),能夠看到裏面有如下文件github

[root@k8s-node4 d29f26dc-77bb-11e9-971b-0050568417a2]# ls
containers  etc-hosts  plugins  volumes

咱們查看etc-hosts文件docker

[root@k8s-node4 d29f26dc-77bb-11e9-971b-0050568417a2]# cat etc-hosts
# Kubernetes-managed hosts file.
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
10.244.7.7      sagent-b4dd8b5b9-zq649

咱們在主節點上執行kubectl get pod|grep sagent-b4dd8b5b9-zq649發現這個pod已經不存在了.bash

問題的討論查看這裏有人在pr裏提交了來解決這個問題,截至目前PR仍然是未合併狀態.oop

目前解決辦法是先在問題節點上進入/var/lib/kubelet/pods目錄,刪除報錯的pod對應的hash(rm -rf 名稱),而後從集羣主節點刪除此節點(kubectl delete node),而後在問題節點上執行測試

kubeadm reset
systemctl stop kubelet
systemctl stop docker
systemctl start docker
systemctl start kubelet

執行完成之後此節點從新加入集羣this

相關文章
相關標籤/搜索