開啓Kubernetes的搶佔模式

時間 2019-11-08

標籤開啓 kubernetes 搶佔模式简体版

原文原文鏈接

Pod優先級、搶佔html

Pod優先級、搶佔功能，在Kubernetes v1.8引入，在v1.11版本進入beta狀態，並在v1.14版本進入GA階段，已是一個成熟的特性了。node

顧名思義，Pod優先級、搶佔功能，經過將應用細分爲不一樣的優先級，將資源優先提供給高優先級的應用，從而提升了資源可用率，同時保障了高優先級的服務質量。nginx

咱們先來簡單使用下Pod優先級、搶佔功能。json

集羣版本是v1.14，所以feature PodPriority默認是開啓的。搶佔模式的使用分爲兩步：api

定義PriorityClass，不一樣PriorityClass的value不一樣，value越大優先級越高。bash

建立Pod，並設置Pod的priorityClassName字段爲期待的PriorityClass。app

建立PriorityClasscurl

以下，先建立兩個PriorityClass：high-priority和low-priority，其value分別爲1000000、10。ide

須要注意的是，要將low-priority的globalDefault設置爲了true，所以low-priority即爲集羣默認的PriorityClass，任何沒有配置priorityClassName字段的Pod，其優先級都將設置爲low-priority的10。一個集羣只能有一個默認的PriorityClass。若是沒有設置默認PriorityClass，則沒有配置PriorityClassName的Pod的優先級爲0。url

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "for high priority pod"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 10
globalDefault: true
description: "for low priority pod"

建立後查看下系統當前的PriorityClass。

kubectl get priorityclasses.scheduling.k8s.io
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             1000000      false            47m
low-priority              10           true             47m
system-cluster-critical   2000000000   false            254d
system-node-critical      2000001000   false            254d

能夠看到，除了上面建立的兩個PriorityClass，默認系統還內置了system-cluster-critical、system-node-critical用於高優先級的系統任務。

設置Pod的PriorityClassName

爲了方便驗證，這裏使用了擴展資源。爲節點x1設置了擴展資源example.com/foo的容量爲1。

curl -k --header "Authorization: Bearer ${token}" --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "1"}]' \
https://{apiServerIP}:{apiServerPort}/api/v1/nodes/x1/status

查看下x1的allocatable和capacity，能夠看到x1上有1個example.com/foo資源。

Capacity:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             4040056Ki
 pods:               110
Allocatable:
 cpu:                2
 example.com/foo:    1
 hugepages-2Mi:      0
 memory:             3937656Ki
 pods:               110

咱們先建立Deployment nginx，它會請求1個example.com/foo資源，可是咱們沒有設置PriorityClassName，所以Pod的優先級將是默認的low-priority指定的10。

template:
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"

而後再建立Deployment debian，它並無請求example.com/foo資源。

  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "0"
          requests:
            example.com/foo: "0"
        priorityClassName: high-priority

此時兩個Pod均可以正常啓動。

開始搶佔

咱們將Deployment debian的example.com/foo請求量改成1，並將priorityClassName設置爲high-priority。

  template:
    spec:
      containers:
      - args:
        - bash
        image: debian
        name: debian
        resources:
          limits:
            example.com/foo: "1"
          requests:
            example.com/foo: "1"
        priorityClassName: high-priority

此時，因爲集羣中只有x1上有1個example.com/foo資源，並且debian的優先級更高，所以scheduler會開始搶佔。以下是觀察到的Pod過程。

kubectl get pods -o wide -w
NAME                      READY   STATUS    AGE     IP             NODE       NOMINATED NODE
debian-55d94c54cb-pdfmd   1/1     Running   3m53s   10.244.4.178   x201       <none>
nginx-58dc57fbff-g5fph    1/1     Running   2m4s    10.244.3.28    x1         <none>
// 此時Deployment debian開始Recreate
debian-55d94c54cb-pdfmd   1/1     Terminating   4m49s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m21s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
debian-55d94c54cb-pdfmd   0/1     Terminating   5m22s   10.244.4.178   x201       <none>
// example.com/foo不知足，阻塞
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     <none>
// scheduler判斷將x1上的Pod擠出後能夠知足debian Pod的需求，設置NOMINATED爲x1
debian-5bc46885dd-rvtwv   0/1     Pending       0s      <none>         <none>     x1
// sheduler開始擠出Pod nginx
nginx-58dc57fbff-g5fph    1/1     Terminating   3m33s   10.244.3.28    x1         <none>
// Pod nginx等待。優先級低啊，沒辦法。
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
nginx-58dc57fbff-29rzw    0/1     Pending       0s      <none>         <none>     <none>
// graceful termination period，優雅退出
nginx-58dc57fbff-g5fph    0/1     Terminating   3m34s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
nginx-58dc57fbff-g5fph    0/1     Terminating   3m37s   10.244.3.28    x1         <none>
// debian NODE綁定爲x1
debian-5bc46885dd-rvtwv   0/1     Pending       5s      <none>         x1         x1
// 搶佔到資源，啓動
debian-5bc46885dd-rvtwv   0/1     ContainerCreating   5s      <none>         x1         <none>
debian-5bc46885dd-rvtwv   1/1     Running             14s     10.244.3.29    x1         <none>

君子：Non-preempting PriorityClasses

Kubernetes v1.15爲PriorityClass添加了一個字段PreemptionPolicy，當設置爲Never時，該Pod將不會搶佔比它優先級低的Pod，只是調度的時候，會優先調度（參照PriorityClass的value）。

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false

因此我把這種PriorityClass叫作「君子」，由於他只是默默憑本事（Priority）排隊，不會強搶別人的資源。官網給出一個適合的例子是 data science workload。

對比Cluster Autoscaler

雲上Kubernetes在集羣資源不足時，能夠經過Cluster Autoscaler自動對Node擴容，即向雲廠商申請更多的Node，並添加到集羣中，從而提供更多資源。

但這種作法不足的地方是：