某客戶kubernetes集羣新加了一個節點,新節點部署應用後,應用會間歇性unavaliable
,用戶訪問報503,沒有事件消息,主機狀態也正常。node
初步懷疑是新節點問題,在系統日誌/var/log/message
和dmesg
中都未發現相關錯誤信息,在kubelet中發現如下日誌docker
kubernetes集羣時經過rke進行安裝,能夠在節點上直接執行命令
docker logs -f --tail=30 kubelet
查看kubelet日誌
E0602 03:18:27.766726 1301 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: an error on the server ("") has prevented the request from succeeding (get leases.coordination.k8s.io k8s-node-dev-6) E0602 03:18:34.847254 1301 reflector.go:178] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.CSIDriver: an error on the server ("") has prevented the request from succeeding (get csidrivers.storage.k8s.io) I0602 03:18:39.176996 1301 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF E0602 03:18:43.771023 1301 controller.go:136] failed to ensure node lease exists, will retry in 7s, error: an error on the server ("") has prevented the request from succeeding (get leases.coordination.k8s.io k8s-node-dev-6)
其中比較關注failed to ensure node lease exists
這個錯誤信息,從字面上理解應該就是沒法註冊主機信息,可是kubectl get nodes
獲得的狀態都是Ready
。聯想到應用間歇性的不可用,懷疑多是短期節點不可用而後快速恢復,所以可能在執行命令的時候是正常的,爲了驗證猜測,在後臺一直執行kubectl get nodes
命令,終於捕捉到NotReady
狀態api
而且也捕捉到不可用時的詳細信息bash
根據kubelet stopped posting node status
做爲關鍵字搜索,在stackoverflow上找到相似的問題,高贊回答建議設置kube-apiserver參數--http2-max-streams-per-connection
,由於最近集羣部署了prometheus,而且增長了多個節點,對apiserver的請求量忽然變大可能致使apiserver的鏈接數不夠,該參數能夠增長鏈接數,因爲集羣是經過rke進行安裝,所以須要更改rke配置文件後從新執行rke up
命令,更改以下post
kube-api: service_node_port_range: "1-65535" extra_args: http2-max-streams-per-connection: 1000
從新執行rke up
後進行集羣更新,更新完畢有重啓kubelet,問題解決spa
若是是rke安裝能夠在節點上執行命令
docker restart kubelet
重啓kubelet