強大的自愈能力是Kubernetes這一類容器編排管理引擎的一個重要特性。一般狀況下,Kubernetes經過重啓發生故障的容器來實現自愈。除此以外,咱們還有其餘方式來實現基於Kubernetes編排的容器的健康檢查嗎?Liveness和Readiness就是不錯的選擇。git
2.1 系統默認的健康檢查。github
apiVersion: v1 kind: Pod metadata: labels: test: healthcheck name: healthcheck spec: restartPolicy: OnFailure containers: - name: healthcheck image: busybox args: - /bin/sh - -c - sleep 10;exit 1
建立一個內容如上所述的yaml文件,命名爲HealthCheck.yaml,apply:web
[root@k8s-m health-check]# kubectl apply -f HealthCheck.yaml pod/healthcheck created [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE healthcheck 0/1 CrashLoopBackOff 3 4m52s
咱們能夠看到,這個pod並未正常運行,重啓了3次。具體的重啓日誌咱們能夠經過describe命令來查看,此處再也不贅述。咱們來執行一下如下命令:api
[root@k8s-m health-check]# sh -c "sleep 2;exit 1" [root@k8s-m health-check]# echo $? 1
咱們能夠看到,命令正常執行,返回值爲1。默認狀況下,Linux命令執行以後返回值爲0說明命令執行成功。由於執行成功後的返回值不爲0,Kubernetes默認爲容器發生故障,不斷重啓。然而,也有很多狀況是服務實際發生了故障,可是進程未退出。這種狀況下,重啓每每是簡單而有效的手段。例如:訪問web服務時顯示500服務器內部錯誤,不少緣由會形成這樣的故障,重啓可能就能迅速修復故障。bash
2.2 在Kubernetes中,能夠經過Liveness探測告訴kebernetes何時實現重啓自愈。服務器
apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness spec: restartPolicy: OnFailure containers: - name: liveness image: busybox args: - /bin/sh - -c - touch /tmp/healthcheck;sleep 30; rm -rf /tmp/healthcheck;sleep 600 livenessProbe: exec: command: - cat - /tmp/healthcheck initialDelaySeconds: 10 periodSeconds: 5
建立名爲Liveness.yaml的文件,建立Pod:app
[root@k8s-m health-check]# kubectl apply -f Liveness.yaml pod/liveness created [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE liveness 1/1 Running 1 5m50s
從yaml文件中,咱們能夠看出,容器啓動後建立/tmp/healthcheck文件,30s後刪除,刪除後sleep該進程600s。經過cat /tmp/healthcheck來探測容器是否發生故障。若是該文件存在,則說明容器正常,該文件不存在,則殺該容器並重啓。負載均衡
initialDelaySeconds:10指定容器啓動10s以後執行探測。通常該值要大於容器的啓動時間。periodSeconds:5表示每5s執行一次探測,若是連續三次執行Liveness探測均失敗,那麼會殺死該容器並重啓。tcp
2.3 Readiness則能夠告訴Kubenentes何時能夠將容器加入到Service的負載均衡池中,對外提供服務。ide
apiVersion: v1 kind: Pod metadata: labels: test: readiness name: readiness spec: restartPolicy: OnFailure containers: - name: readiness image: busybox args: - /bin/sh - -c - touch /tmp/healthcheck;sleep 30; rm -rf /tmp/healthcheck;sleep 600 readinessProbe: exec: command: - cat - /tmp/healthcheck initialDelaySeconds: 10 periodSeconds: 5
apply該文件:
[root@k8s-m health-check]# kubectl apply -f Readiness.yaml pod/readiness created [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE readiness 0/1 Running 0 84s [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE readiness 0/1 Completed 0 23m
從yaml文件中咱們能夠看出,Readiness和Liveness兩種探測的配置基本是同樣的,只需稍加改動就能夠套用。經過kubectl get pod咱們發現這兩種Health Check主要不一樣在於輸出的第二列和第三列。Readiness第三列一直都是running,第二列一段時間後由1/1變爲0/1。當第二列爲0/1時,則說明容器不可用。具體能夠經過如下命令來查看一下:
[root@k8s-m health-check]# while true;do kubectl describe pod readiness;done
Liveness和Readiness是兩種Health Check機制,不互相依賴,能夠同時使用。
3.1 Health Check在Scale Up中的應用。
apiVersion: apps/v1beta1 kind: Deployment metadata: name: web spec: replicas: 3 template: metadata: labels: run: web spec: containers: - name: web image: httpd ports: - containerPort: 8080 readinessProbe: httpGet: scheme: HTTP path: /health-check port: 8080 initialDelaySeconds: 10 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: web-svc spec: selector: run: web ports: - protocol: TCP port: 8080 targetPort: 80
經過以上yaml,建立了一個名爲web-svc的服務和名爲web的Deployment。
[root@k8s-m health-check]# kubectl apply -f HealthCheck-web-deployment.yaml deployment.apps/web unchanged service/web-svc created [root@k8s-m health-check]# kubectl get service web-svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE web-svc ClusterIP 10.101.1.6 <none> 8080/TCP 2m20s [root@k8s-m health-check]# kubectl get deployment web NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE web 3 3 3 0 3m26s [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE web-7d96585f7f-q5p4d 0/1 Running 0 3m35s web-7d96585f7f-w6tqx 0/1 Running 0 3m35s web-7d96585f7f-xrqwm 0/1 Running 0 3m35s
重點關注一下17-23行,第17行指出本案例中使用的Health Check機制爲Readiness,探測方法爲httpGet。Kubernetes對於該方法探測成功的判斷條件時http請求返回值在200-400之間。schema指定了協議,能夠爲http(默認)和https。path指定訪問路徑,port指定端口。
容器啓動10s後開始探測,若是 http://container_ip:8080/health-check 的返回值不是200-400,表示容器沒有準備就緒,不接收Service web-svc的請求。/health-check則是咱們實現探測的代碼。探測結果示例以下:
[root@k8s-m health-check]# kubectl describe pod web Warning Unhealthy 57s (x219 over 19m) kubelet, k8s-n2 Readiness probe failed: Get http://10.244.2.61:8080/healthy: dial tcp 10.244.2.61:8080: connect: connection refused
3.2 Health Check在滾動更新(Rolling Update)中的應用。
默認狀況下,在Rolling Update過程當中,Kubernetes會認爲容器已經準備就緒,進而會逐步替換舊副本。若是新版本的容器出現故障,那麼在版本更新完成以後可能致使整個應用沒法處理請求,沒法對外提供服務。此類事件若發生在生產環境中,後果會很是嚴重。正確配置了Health Check,只有經過了Readiness探測的新副本才能添加到Service,若是沒有經過探測,現有副本就不會唄替換,業務依然正常運行。
接下來,咱們分別建立yaml文件app.v1.yaml和app.v2.yaml:
apiVersion: apps/v1beta1 kind: Deployment metadata: name: app spec: replicas: 8 template: metadata: labels: run: app spec: containers: - name: app image: busybox args: - /bin/sh - -c - sleep 10;touch /tmp/health-check;sleep 30000 readinessProbe: exec: command: - cat - /tmp/health-check initialDelaySeconds: 10 periodSeconds: 5
apiVersion: apps/v1beta1 kind: Deployment metadata: name: app spec: replicas: 8 template: metadata: labels: run: app spec: containers: - name: app image: busybox args: - /bin/sh - -c - sleep 3000 readinessProbe: exec: command: - cat - /tmp/health-check initialDelaySeconds: 10 periodSeconds: 5
apply文件app.v1.yaml:
[root@k8s-m health-check]# kubectl apply -f app.v1.yaml --record deployment.apps/app created [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE app-844b9b5bf-9nnrb 1/1 Running 0 2m52s app-844b9b5bf-b8tw2 1/1 Running 0 2m52s app-844b9b5bf-j2n9c 1/1 Running 0 2m52s app-844b9b5bf-ml8c5 1/1 Running 0 2m52s app-844b9b5bf-mtgr9 1/1 Running 0 2m52s app-844b9b5bf-n4dn8 1/1 Running 0 2m52s app-844b9b5bf-ppzh6 1/1 Running 0 2m52s app-844b9b5bf-z55d4 1/1 Running 0 2m52s
更新到app.v2.yaml:
[root@k8s-m health-check]# kubectl apply -f app.v2.yaml --record deployment.apps/app configured [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE app-844b9b5bf-9nnrb 1/1 Running 0 3m30s app-844b9b5bf-b8tw2 1/1 Running 0 3m30s app-844b9b5bf-j2n9c 1/1 Running 0 3m30s app-844b9b5bf-ml8c5 1/1 Terminating 0 3m30s app-844b9b5bf-mtgr9 1/1 Running 0 3m30s app-844b9b5bf-n4dn8 1/1 Running 0 3m30s app-844b9b5bf-ppzh6 1/1 Terminating 0 3m30s app-844b9b5bf-z55d4 1/1 Running 0 3m30s app-cd49b84-bxvtc 0/1 ContainerCreating 0 6s app-cd49b84-gkkj8 0/1 ContainerCreating 0 6s app-cd49b84-jfzcm 0/1 ContainerCreating 0 6s app-cd49b84-xl8ws 0/1 ContainerCreating 0 6s
稍後再觀察:
[root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE app-844b9b5bf-9nnrb 1/1 Running 0 4m59s app-844b9b5bf-b8tw2 1/1 Running 0 4m59s app-844b9b5bf-j2n9c 1/1 Running 0 4m59s app-844b9b5bf-mtgr9 1/1 Running 0 4m59s app-844b9b5bf-n4dn8 1/1 Running 0 4m59s app-844b9b5bf-z55d4 1/1 Running 0 4m59s app-cd49b84-bxvtc 0/1 Running 0 95s app-cd49b84-gkkj8 0/1 Running 0 95s app-cd49b84-jfzcm 0/1 Running 0 95s app-cd49b84-xl8ws 0/1 Running 0 95s
此刻狀態所有爲running,可是依然有4個Pod的READY爲0/1,再看一下:
[root@k8s-m health-check]# kubectl get deployment app NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE app 8 10 4 6 7m20s
DESIRED表示期待的副本數爲8,CURRENT表示當前副本數爲10,UP-TO-DATE表示升級了的副本數爲4,AVAILABLE表示可用的副本數爲6。若是不進行更改,該狀態將一直保持下去。在此,須要注意的是,Rolling Update中刪除了2箇舊副本,建立建了4個新副本。這裏留到最後再討論。
版本回滾到v1:
[root@k8s-m health-check]# kubectl rollout history deployment app deployment.extensions/app REVISION CHANGE-CAUSE 1 kubectl apply --filename=app.v1.yaml --record=true 2 kubectl apply --filename=app.v2.yaml --record=true [root@k8s-m health-check]# kubectl rollout undo deployment app --to-revision=1 deployment.extensions/app [root@k8s-m health-check]# kubectl get pod NAME READY STATUS RESTARTS AGE app-844b9b5bf-8qqhk 1/1 Running 0 2m37s app-844b9b5bf-9nnrb 1/1 Running 0 18m app-844b9b5bf-b8tw2 1/1 Running 0 18m app-844b9b5bf-j2n9c 1/1 Running 0 18m app-844b9b5bf-mtgr9 1/1 Running 0 18m app-844b9b5bf-n4dn8 1/1 Running 0 18m app-844b9b5bf-pqpm5 1/1 Running 0 2m37s app-844b9b5bf-z55d4 1/1 Running 0 18m
4.1 Liveness和Readiness是Kubernetes中兩種不一樣的Health Check方式,他們很是相似,但又有區別。能夠二者同時使用,也能夠單獨使用。具體差別在上文已經說起。
4.2 在上一篇關於Rolling Update的文章中,我曾經提到滾動更新過程當中的替換規則。在本文中咱們依然使用了默認方式進行更新。maxSurge和maxUnavailable兩個參數決定了更新過程當中各個狀態下的副本個數,這兩個參數的默認值都是25%。更新後,總副本數=8+8*0.25=10;可用副本數:8-8*0.25=6。此過程當中,銷燬了2個副本,建立了4個新副本。
4.3 在通常生產環境上線時,儘可能使用Health Check來確保業務不受影響。這個過程的實現手段多樣化,須要根據實際狀況進行總結和選用。
5.2 官方文檔:關於maxSurge和maxUnavailable
5.3 文中涉及到的代碼