[TOC]html
pod經過兩類探針來檢查容器的健康狀態。分別是LivenessProbe(存活性探測)和ReadinessProbe(就緒型探測)node
用於判斷容器是否健康(Running狀態)並反饋給kubelet。有很多應用程序長時間持續運行後會逐漸轉爲不可用的狀態,而且僅能經過重啓操做恢復,kubernetes的容器存活性探測機制可發現諸如此類問題,並依據探測結果結合重啓策略觸發後的行爲。存活性探測是隸屬於容器級別的配置,kubelet可基於它斷定什麼時候須要重啓一個容器。若是一個容器不包含LivenessProbe探針,那麼kubelet認爲該容器的LivenessProbe探針返回的值永遠是Success。Pod spec爲容器列表中的相應容器定義其專用的探針便可啓用存活性檢測,目前,kubernetes的容器支持存活性檢測的方法包含如下三種:ExecAction、TCPSocketAction和HTTPGetAction。nginx
用於判斷容器服務是否可用(Ready狀態),達到Ready狀態的Pod才能夠接收請求。對於被Service管理的Pod,Service與Pod Endpoint的關聯關係也將基於Pod是否Ready進行設置。Pod對象啓動後,容器應用一般須要一段時間才能完成其初始化的過程,例如加載配置或數據,甚至有些程序須要運行某類的預熱過程,若在此階段完成以前即接入客戶端的請求,勢必會由於等待過久而影響用戶體驗。所以應該避免於Pod對象啓動後當即讓其處理客戶端請求。而是等待容器初始化工做執行完成並轉爲Ready狀態。尤爲是存在其餘提供相同服務的Pod對象的場景更是如此。若是在運行過程當中Ready狀態變爲False,則系統自動將其從Service的後端Endpoint列表中隔離出去,後續再把恢復到Ready狀態的Pod加回後端Endpoint列表。這樣就能保證客戶端在訪問Service時不會被轉發到服務不可用的Pod示例上。web
LivenessProbe
和ReadinessProbe
都可配置如下三種探針實現方式: 可參考官方文檔:https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/docker
經過在目標容器中執行由用戶自定義的命令來斷定容器的健康狀態,即在容器內部執行一個命令,若是改命令的返回碼爲0,則代表容器健康。spec.containers.LivenessProbe
字段用於定義此類檢測,它只有一個可用屬性command
,用於指定要執行的命令,下面是在資源清單文件中使用liveness-exec
方式的示例:數據庫
apiVersion: v1 kind: Pod metadata: labels: test: liveness-exec name: liveness-exec spec: containers: - name: liveness-demo image: busybox args: - /bin/sh - -c - touch /tmp/healthy; sleep 60; rm -rf /tmp/healthy; sleep 600 livenessProbe: exec: command: - test - -e - /tmp/healthy initialDelaySeconds: 15 timeoutSeconds: 1
上面的資源清單中定義了一個Pod
對象,基於busybox
鏡像啓動一個運行「touch /tmp/healthy; sleep 60; rm -rf /tmp/healthy; sleep 600」命令的容器,此命令在容器啓動時建立/tmp/healthy文件,並於60秒以後將其刪除,存活性探針運行「test -e /tmp/healthy」命令檢查文件的存在性,若文件存在則返回狀態碼爲0,表示成功經過測試。 建立該資源,並經過kubectl describe pods liveness-exec
查看詳細信息後端
Containers: liveness-demo: Container ID: docker://a2974585905bdeef4ab39ba9a87bf710a61beae5180f31907ba33c8725c0bf79 Image: busybox Image ID: docker-pullable://busybox@sha256:895ab622e92e18d6b461d671081757af7dbaa3b00e3e28e12505af7817f73649 Port: <none> Host Port: <none> Args: /bin/sh -c touch /tmp/healthy; sleep 60; rm -rf /tmp/healthy; sleep 600 State: Running Started: Wed, 14 Aug 2019 11:26:09 +0800 Last State: Terminated Reason: Error Exit Code: 137 Started: Wed, 14 Aug 2019 11:24:10 +0800 Finished: Wed, 14 Aug 2019 11:26:08 +0800 Ready: True Restart Count: 2 Liveness: exec [test -e /tmp/healthy] delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> ...... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m26s default-scheduler Successfully assigned default/liveness-exec to 172.16.1.66 Warning Unhealthy 57s (x3 over 77s) kubelet, 172.16.1.66 Liveness probe failed: Normal Pulling 27s (x2 over 2m23s) kubelet, 172.16.1.66 pulling image "busybox" Normal Pulled 27s (x2 over 2m23s) kubelet, 172.16.1.66 Successfully pulled image "busybox" Normal Killing 27s kubelet, 172.16.1.66 Killing container with id docker://liveness-demo:Container failed liveness probe.. Container will be killed and recreated. Normal Created 26s (x2 over 2m22s) kubelet, 172.16.1.66 Created container Normal Started 25s (x2 over 2m21s) kubelet, 172.16.1.66 Started container
輸出信息中清晰的顯示了容器健康狀態檢測變化的相關信息:容器當前處於Running
狀態,但前一次是Terminated
,緣由是退出碼爲137的錯誤信息,它表示進程是被外部信號所終止的。137事實上是由兩部分數字之和生成的:128+signum,其中signum是致使進程終止信號的數字標識。9標識SIGKILL,這意味着進程是被強行終止的 待容器重啓完成後再次查看,容器已經處於正常運行狀態,直到文件再次被刪除,存活性探測失敗而重啓,從下面結果能夠看出,名爲liveness-exec
的pod在10分鐘內重啓了5次api
[root@master01 demo]# kubectl get pods liveness-exec NAME READY STATUS RESTARTS AGE liveness-exec 1/1 Running 5 10m
須要注意的是,exec
指定的命令運行於容器中,會消耗容器的可用資源配額,另外,考慮到探針操做的效率自己等因素,探針操做的命令應該儘量簡單和輕量。緩存
經過容器的ip
地址,端口號及路徑調用HTTPGet
方法,若是響應的狀態碼大於等於200且小於400,則認爲容器健康,spec.containers.livenessProbe.httpGet
字段用於定義此類檢測,它的可用配置字段包括以下幾個:架構
Pod IP
;也能夠在httpHeaders
中使用Host:
來定義URL path
HTTP
或HTTPS
,默認爲HTTP
下面是在資源清單文件中使用liveness-http
方式的示例,經過lifecycle
最後那個的postStart hook
建立了一個專用於httpGet
測試的頁面文件healthz
:apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-http spec: containers: - name: liveness-demo image: nginx:1.12-alpine ports: - name: http containerPort: 80 lifecycle: postStart: exec: command: - /bin/sh - -c - 'echo Healty > /usr/share/nginx/html/healthz' livenessProbe: httpGet: path: /healthz port: http scheme: HTTP initialDelaySeconds: 30 timeoutSeconds: 1
上面的清單文件中定義的httpGet
測試中,請求的資源路徑爲/healthz
,地址默認爲Pod IP,端口使用了容器中定義的端口名稱HTTP
,這也是明確爲容器指明要暴露的端口的用途之一,經過kubectl describe pods liveness-http
查看容器正常運行,健康狀態檢測爲正常
Containers: liveness-demo: Container ID: docker://bf05e0a9e6e1ac95f67b91f0b167b9fc2e3ad0bd0ffa4336debcc4b3c24978a7 Image: nginx:1.12-alpine Image ID: docker-pullable://nginx@sha256:3a7edf11b0448f171df8f4acac8850a55eff30d1d78c46cd65e7bc8260b0be5d Port: 80/TCP Host Port: 0/TCP State: Running Started: Wed, 14 Aug 2019 12:01:31 +0800 Ready: True Restart Count: 0 Liveness: http-get http://:http/healthz delay=30s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-g7ls6 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: default-token-g7ls6: Type: Secret (a volume populated by a Secret) SecretName: default-token-g7ls6 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6m51s default-scheduler Successfully assigned default/liveness-http to 172.16.1.66 Normal Pulling 6m48s kubelet, 172.16.1.66 pulling image "nginx:1.12-alpine" Normal Pulled 6m43s kubelet, 172.16.1.66 Successfully pulled image "nginx:1.12-alpine" Normal Created 6m43s kubelet, 172.16.1.66 Created container Normal Started 6m42s kubelet, 172.16.1.66 Started container
經過kubectl exec
命令刪除經由postStart hook建立的測試頁面healthz
kubectl exec liveness-http rm /usr/share/nginx/html/healthz
再次查看資源詳細信息,事件輸出中的信息代表探測失敗,容器被殺掉後被從新建立
...... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 12m default-scheduler Successfully assigned default/liveness-http to 172.16.1.66 Normal Pulling 12m kubelet, 172.16.1.66 pulling image "nginx:1.12-alpine" Normal Pulled 12m kubelet, 172.16.1.66 Successfully pulled image "nginx:1.12-alpine" Warning Unhealthy 3s (x3 over 23s) kubelet, 172.16.1.66 Liveness probe failed: HTTP probe failed with statuscode: 404 Normal Created 2s (x2 over 12m) kubelet, 172.16.1.66 Created container Normal Killing 2s kubelet, 172.16.1.66 Killing container with id docker://liveness-demo:Container failed liveness probe.. Container will be killed and recreated. Normal Pulled 2s kubelet, 172.16.1.66 Container image "nginx:1.12-alpine" already present on machine Normal Started 1s (x2 over 12m) kubelet, 172.16.1.66 Started container
通常來講,HTTP
類型的探測操做應該針對專用的URL路徑進行,例如,示例中爲其準備的/healthz
,另外,此URL路徑對應的web資源應該以輕量化的方式在內部對應用程序的各關鍵組件進行全面檢測以確保它們能夠正常向客戶端提供完整的服務。 這種檢測方式僅對分層架構中的當前一層有效,例如,它能檢測應用程序工做正常與否的狀態,但重啓操做卻沒法解決其後端服務(如數據庫或緩存服務)致使的故障,此時,容器可能會被一次次重啓,直到後端服務恢復正常爲止。
經過容器的IP地址和端口號進行TCP
檢查,若是可以創建TCP
鏈接,則代表容器健康。相比較來講,它比基於HTTP
的探測要更高效,更節約資源,但精準度略低,畢竟創建鏈接成功未必意味着頁面資源可用,spec.containers.livenessProbe.tcpSocket
字段用於定義此類檢測,它主要包含如下兩個可用的屬性:
IP
地址,默認爲Pod IP
liveness-tcp
方式的示例,它向Pod IP
的80/tcp端口發起鏈接請求,並根據鏈接創建的狀態斷定測試結果:apiVersion: v1 kind: Pod metadata: labels: test: liveness name: liveness-tcp spec: containers: - name: liveness-tcp-demo image: nginx:1.12-alpine ports: - name: http containerPort: 80 livenessProbe: tcpSocket: port: 80 initialDelaySeconds: 30 timeoutSeconds: 1
經過kubectl describe pods liveness-http
查看容器正常運行,健康狀態檢測爲正常
Containers: liveness-tcp-demo: Container ID: docker://816b27781aeb384e1305e0a5badebd5ea21ea98c834e62179cd1dac2a704ccd7 Image: nginx:1.12-alpine Image ID: docker-pullable://nginx@sha256:3a7edf11b0448f171df8f4acac8850a55eff30d1d78c46cd65e7bc8260b0be5d Port: 80/TCP Host Port: 0/TCP State: Running Started: Wed, 14 Aug 2019 12:29:12 +0800 Ready: True Restart Count: 0 Liveness: tcp-socket :80 delay=30s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-g7ls6 (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: default-token-g7ls6: Type: Secret (a volume populated by a Secret) SecretName: default-token-g7ls6 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 62s default-scheduler Successfully assigned default/liveness-tcp to 172.16.1.66 Normal Pulled 60s kubelet, 172.16.1.66 Container image "nginx:1.12-alpine" already present on machine Normal Created 60s kubelet, 172.16.1.66 Created container Normal Started 59s kubelet, 172.16.1.66 Started container
使用kubectl describe
命令查看配置了存活性探測或者就緒型探測對象的詳細信息時,其相關內容中會包含以下內容:
Liveness: exec [test -e /tmp/healthy] delay=15s timeout=1s period=10s #success=1 #failure=3
它給出了探測方式及其額外的配置屬性delay
、timeout
、period
、success
、failure
及其各自的相關屬性值,用戶沒有明肯定義這些屬性字段時,他們會使用各自的默認值,這些屬性值可經過spec.containers.livenessProbe
的以下屬性字段來給出:
initialDelaySeconds <integer>
:存活性探測延遲時長,即容器啓動多久後再開始第一次探測操做,顯示爲delay
屬性,默認爲0秒,即容器啓動後馬上開始進行探測。timeoutSeconds <integer>
:存活性探測的超時時長,顯示爲timeout屬性,默認爲1s,最小值也爲1s。periodSeconds <integer>
:存活性探測的頻度,顯示爲period屬性,默認爲10s,最小值爲1s;太高的頻率會對pod
對象帶來較大的額外開銷,而太低的頻率又會使得對錯誤的反應不及時。successThreshold <integer>
:處於失敗狀態時,探測操做至少連續多少次的成功才被認爲是經過檢測,顯示爲#success
屬性,默認值爲1,最小值也爲1。failureThreshold
:處於成功狀態時,探測操做至少連續多少次的失敗才被視爲是檢測不經過,顯示爲#failure
屬性,默認值爲3,最小值爲1。kubernetes
的ReadinessProbe
機制可能沒法知足某些複雜應用對容器內服務可用狀態的判斷,因此kubernetes
從1.11
版本開始引入了Pod Ready++
特性對Readiness
探測機制進行擴展,在1.14
版本時達到GA
穩定版本,稱其爲Pod Readiness Gates
。 經過Pod Readiness Gates
機制,用戶能夠將自定義的ReadinessProbe
探測方式設置在Pod
上,輔助kubernetes
設置Pod
什麼時候達到服務可用狀態Ready
,爲了使自定義的ReadinessProbe
生效,用戶須要提供一個外部的控制器Controller
來設置相應的Condition
狀態。Pod
的Readiness Gates
在pod
定義中的ReadinessGates
字段進行設置,以下示例設置了一個類型爲www.example.com/feature-1的新Readiness Gates
:
Kind: Pod ...... spec: readinessGates: - conditionType: "www.example.com/feature-1" status: conditions: - type: Ready # kubernetes系統內置的名爲Ready的Condition status: "True" lastProbeTime: null lastTransitionTime: 2018-01-01T00:00:00Z - type: "www.example.com/feature-1" # 用戶定義的Condition status: "False" lastProbeTime: null lastTransitionTime: 2018-01-01T00:00:00Z containerStatuses: - containerID: docker://abcd... ready: true ......
新增的自定義Condition
的狀態status
將由用戶自定義的外部控制器設置,默認值爲False
,kubernetes
將在判斷所有readinessGates
條件都爲True
時,才設置pod
爲服務可用狀態(Ready或True) 參考地址:https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-readiness-gate