咱們的博客系統是部署在用阿里雲服務器本身搭建的 Kubernetes 集羣上,故障在 k8s 部署更新 pod 的過程當中就出現了,昨天發佈時,咱們特意觀察一下,在這1集中分享一下。html
在部署過程當中,k8s 會進行3個階段的 pod 更新操做:node
正常發佈狀況下,整個部署操做一般在5-8分鐘左右完成(這與livenessProbe和readinessProbe的配置有關),下面是部署期間的控制檯輸出web
Waiting for deployment "blog-web" rollout to finish: 4 out of 8 new replicas have been updated... Waiting for deployment spec update to be observed... Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated... Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated... Waiting for deployment "blog-web" rollout to finish: 4 out of 15 new replicas have been updated... Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated... Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated... Waiting for deployment "blog-web" rollout to finish: 8 out of 15 new replicas have been updated... ... Waiting for deployment "blog-web" rollout to finish: 4 old replicas are pending termination... ... Waiting for deployment "blog-web" rollout to finish: 14 of 15 updated replicas are available... deployment "blog-web" successfully rolled out
而在故障場景下,整個部署操做須要在15分鐘左右才能完成,3個階段的 pod 更新都比正常狀況下慢,尤爲是"old replicas are pending termination"階段。docker
在部署期間經過 kubectl get pods -l app=blog-web -o wide
命令查看 pod 的狀態,新部署的 pod 處於 Running 狀態,說明 livenessProbe 健康檢查成功,但多數 pod 沒有進入 ready 狀態,說明這些 pod 的 readinessProbe 健康檢查失敗,restarts 大於0 說明 livenessProbe 健康檢查失敗對 pod 進行了重啓。服務器
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES blog-web-55d5677cf-2854n 0/1 Running 1 5m1s 192.168.107.213 k8s-node3 <none> <none> blog-web-55d5677cf-7vkqb 0/1 Running 2 6m17s 192.168.228.33 k8s-n9 <none> <none> blog-web-55d5677cf-8gq6n 0/1 Running 2 5m29s 192.168.102.235 k8s-n19 <none> <none> blog-web-55d5677cf-g8dsr 0/1 Running 2 5m54s 192.168.104.78 k8s-node11 <none> <none> blog-web-55d5677cf-kk9mf 0/1 Running 2 6m9s 192.168.42.3 k8s-n13 <none> <none> blog-web-55d5677cf-kqwzc 0/1 Pending 0 4m44s <none> <none> <none> <none> blog-web-55d5677cf-lmbvf 0/1 Running 2 5m54s 192.168.201.123 k8s-n14 <none> <none> blog-web-55d5677cf-ms2tk 0/1 Pending 0 6m9s <none> <none> <none> <none> blog-web-55d5677cf-nkjrd 1/1 Running 2 6m17s 192.168.254.129 k8s-n7 <none> <none> blog-web-55d5677cf-nnjdx 0/1 Pending 0 4m48s <none> <none> <none> <none> blog-web-55d5677cf-pqgpr 0/1 Pending 0 4m33s <none> <none> <none> <none> blog-web-55d5677cf-qrjr5 0/1 Pending 0 2m38s <none> <none> <none> <none> blog-web-55d5677cf-t5wvq 1/1 Running 3 6m17s 192.168.10.100 k8s-n12 <none> <none> blog-web-55d5677cf-w52xc 1/1 Running 3 6m17s 192.168.73.35 k8s-node10 <none> <none> blog-web-55d5677cf-zk559 0/1 Running 1 5m21s 192.168.118.6 k8s-n4 <none> <none> blog-web-5b57b7fcb6-7cbdt 1/1 Running 2 18m 192.168.168.77 k8s-n6 <none> <none> blog-web-5b57b7fcb6-cgfr4 1/1 Running 4 19m 192.168.89.250 k8s-n8 <none> <none> blog-web-5b57b7fcb6-cz278 1/1 Running 3 19m 192.168.218.99 k8s-n18 <none> <none> blog-web-5b57b7fcb6-hvzwp 1/1 Running 3 18m 192.168.195.242 k8s-node5 <none> <none> blog-web-5b57b7fcb6-rhgkq 1/1 Running 1 16m 192.168.86.126 k8s-n20 <none> <none>
在咱們的 k8e deployment 配置中 livenessProbe 與 readinessProbe 檢查的是同一個地址,具體配置以下併發
livenessProbe: httpGet: path: / port: 80 httpHeaders: - name: X-Forwarded-Proto value: https - name: Host value: www.cnblogs.com initialDelaySeconds: 30 periodSeconds: 3 successThreshold: 1 failureThreshold: 5 timeoutSeconds: 5 readinessProbe: httpGet: path: / port: 80 httpHeaders: - name: X-Forwarded-Proto value: https - name: Host value: www.cnblogs.com initialDelaySeconds: 40 periodSeconds: 5 successThreshold: 1 failureThreshold: 5 timeoutSeconds: 5
因爲潛藏的併發問題形成 livenessProbe 與 readinessProbe 健康檢查頻繁失敗,形成 k8s 更新 pod 的過程跌跌撞撞,在這個過程當中,因爲有部分舊 pod 分擔負載,新 pod 出現問題會暫停更新,等正在部署的 pod 恢復正常,因此這時故障的影響侷限在必定範圍內,訪問網站的表現是時好時壞。app
這個跌跌撞撞的艱難部署過程最終會完成,而部署完成之際,就是故障全面爆發之時。部署完成後,新 pod 全面接管負載,存在併發問題的新 pod 在併發請求的重壓下潰不成軍,多個 pod 因 livenessProbe 健康檢查失敗被重啓,重啓後由於 readinessProbe 健康檢查失敗很難進入 ready 狀態分擔負載,僅剩的 pod 不堪重負,CrashLoopBackOff 此起彼伏,在源源不斷的併發請求的衝擊下,始終沒有足夠的 pod 應付當前的負載,故障就一直沒法恢復。ide