跟我一塊兒學Knative(4)--Serving 自動擴縮容

時間 2020-07-04

標籤一塊兒 knative serving 自動简体版

原文原文鏈接

Knative共用單個共享自動縮放器。默認狀況下，這是Knative Pod自動縮放器（KPA），開箱即用便可提供基於請求的快速自動縮放功能。算法

您還能夠配置Knative使用Horizontal Pod Autoscaler（HPA）或使用定義的autoscaler。api

KPA配置

KPA的配置文件位於knative-serving 命名空間中的configmap config-autoscaler 。咱們執行下面的命令，查看一下默認的內容：網絡

kubectl -n knative-serving describe cm config-autoscaler

能夠看到默認的配置內容爲：併發

====
_example:
----
################################
#                              #
#    EXAMPLE CONFIGURATION     #
#                              #
################################

# This block is not actually functional configuration,
# but serves to illustrate the available configuration
# options and document them in a way that is accessible
# to users that `kubectl edit` this config map.
#
# These sample configuration options may be copied out of
# this example block and unindented to be in the data block
# to actually change the configuration.

# The Revision ContainerConcurrency field specifies the maximum number
# of requests the Container can handle at once. Container concurrency
# target percentage is how much of that maximum to use in a stable
# state. E.g. if a Revision specifies ContainerConcurrency of 10, then
# the Autoscaler will try to maintain 7 concurrent connections per pod
# on average.
# Note: this limit will be applied to container concurrency set at every
# level (ConfigMap, Revision Spec or Annotation).
# For legacy and backwards compatibility reasons, this value also accepts
# fractional values in (0, 1] interval (i.e. 0.7 ⇒ 70%).
# Thus minimal percentage value must be greater than 1.0, or it will be
# treated as a fraction.
# NOTE: that this value does not affect actual number of concurrent requests
#       the user container may receive, but only the average number of requests
#       that the revision pods will receive.
container-concurrency-target-percentage: "70"

# The container concurrency target default is what the Autoscaler will
# try to maintain when concurrency is used as the scaling metric for the
# Revision and the Revision specifies unlimited concurrency.
# When revision explicitly specifies container concurrency, that value
# will be used as a scaling target for autoscaler.
# When specifying unlimited concurrency, the autoscaler will
# horizontally scale the application based on this target concurrency.
# This is what we call "soft limit" in the documentation, i.e. it only
# affects number of pods and does not affect the number of requests
# individual pod processes.
# The value must be a positive number such that the value multiplied
# by container-concurrency-target-percentage is greater than 0.01.
# NOTE: that this value will be adjusted by application of
#       container-concurrency-target-percentage, i.e. by default
#       the system will target on average 70 concurrent requests
#       per revision pod.
# NOTE: Only one metric can be used for autoscaling a Revision.
container-concurrency-target-default: "100"

# The requests per second (RPS) target default is what the Autoscaler will
# try to maintain when RPS is used as the scaling metric for a Revision and
# the Revision specifies unlimited RPS. Even when specifying unlimited RPS,
# the autoscaler will horizontally scale the application based on this
# target RPS.
# Must be greater than 1.0.
# NOTE: Only one metric can be used for autoscaling a Revision.
requests-per-second-target-default: "200"

# The target burst capacity specifies the size of burst in concurrent
# requests that the system operator expects the system will receive.
# Autoscaler will try to protect the system from queueing by introducing
# Activator in the request path if the current spare capacity of the
# service is less than this setting.
# If this setting is 0, then Activator will be in the request path only
# when the revision is scaled to 0.
# If this setting is > 0 and container-concurrency-target-percentage is
# 100% or 1.0, then activator will always be in the request path.
# -1 denotes unlimited target-burst-capacity and activator will always
# be in the request path.
# Other negative values are invalid.
target-burst-capacity: "200"

# When operating in a stable mode, the autoscaler operates on the
# average concurrency over the stable window.
# Stable window must be in whole seconds.
stable-window: "60s"

# When observed average concurrency during the panic window reaches
# panic-threshold-percentage the target concurrency, the autoscaler
# enters panic mode. When operating in panic mode, the autoscaler
# scales on the average concurrency over the panic window which is
# panic-window-percentage of the stable-window.
# When computing the panic window it will be rounded to the closest
# whole second.
panic-window-percentage: "10.0"

# The percentage of the container concurrency target at which to
# enter panic mode when reached within the panic window.
panic-threshold-percentage: "200.0"

# Max scale up rate limits the rate at which the autoscaler will
# increase pod count. It is the maximum ratio of desired pods versus
# observed pods.
# Cannot be less or equal to 1.
# I.e with value of 2.0 the number of pods can at most go N to 2N
# over single Autoscaler period (see tick-interval), but at least N to
# N+1, if Autoscaler needs to scale up.
max-scale-up-rate: "1000.0"

# Max scale down rate limits the rate at which the autoscaler will
# decrease pod count. It is the maximum ratio of observed pods versus
# desired pods.
# Cannot be less or equal to 1.
# I.e. with value of 2.0 the number of pods can at most go N to N/2
# over single Autoscaler evaluation period (see tick-interval), but at
# least N to N-1, if Autoscaler needs to scale down.
max-scale-down-rate: "2.0"

# Scale to zero feature flag
enable-scale-to-zero: "true"

# Tick interval is the time between autoscaling calculations.
tick-interval: "2s"

# Scale to zero grace period is the time an inactive revision is left
# running before it is scaled to zero (min: 6s).
scale-to-zero-grace-period: "30s"

# Enable graceful scaledown feature flag.
# Once enabled, it allows the autoscaler to prioritize pods processing
# fewer (or zero) requests for removal when scaling down.
enable-graceful-scaledown: "false"

# pod-autoscaler-class specifies the default pod autoscaler class
# that should be used if none is specified. If omitted, the Knative
# Horizontal Pod Autoscaler (KPA) is used by default.
pod-autoscaler-class: "kpa.autoscaling.knative.dev"

# The capacity of a single activator task.
# The `unit` is one concurrent request proxied by the activator.
# activator-capacity must be at least 1.
# This value is used for computation of the Activator subset size.
# See the algorithm here: http://bit.ly/38XiCZ3.
# TODO(vagababov): tune after actual benchmarking.
activator-capacity: "100.0"

接下來咱們詳細介紹一下每個配置項的含義。app

enable-scale-to-zero: 若是須要縮放到零，請確保將enable-scale-to-zero設置爲true。默認是開啓。
scale-to-zero-grace-period: 指定將非活動修訂版本縮放到零（最小：6s）以前保持運行的時間。默認爲30s。
stable-window :在穩定模式下運行時，autoscaler將在穩定窗口上的平均併發性數下操做（最小：6s）。默認爲30s。固然也能夠在Revision的模板中經過annotation設置。好比 autoscaling.knative.dev/window: "60s"。
container-concurrency-target-percentage:
activator-capacity: 單個activator任務的容量。單位是activator代理的一個併發請求。activator容量必須至少爲1。該值用於計算activator子集大小。
pod-autoscaler-class: 指定使用的pod autoscaler類。若是省略，默認狀況下使用Knative Horizontal Pod Autoscaler（KPA）。
enable-graceful-scaledown: 啓用優雅的按比例縮小功能標誌。啓用後，它容許autoscaler優先縮容請求更少或沒有請求的Pod。
縮小時減小＃個（或零個）刪除請求。
tick-interval: 自動縮放計算之間的時間，默認是2s。
max-scale-down-rate: 最大縮放比例限制了自動縮放器的縮容Pod速率。其值不能小於或等於1。當其值爲2.0時，原來的Pod數目爲N，在單個Autoscaler週期內（請參閱刻度間隔），Pod的數量最多能夠縮容到N / 2，但若是Autoscaler須要縮小，則至少縮容到N-1。
max-scale-up-rate : 最大擴展速率限制了autoscaler的擴容Pod速率。其值不能小於或等於1。當其值爲2.0時，原來的Pod數目爲N，在單個Autoscaler週期內（請參閱刻度間隔），Pod的數量最多能夠擴容到2N，但至少爲N + 1，若是Autoscaler須要放大。
panic-threshold-percentage: 容器併發目標要達到的百分比，此時在緊急狀況窗口內進入緊急狀態。
panic-window-percentage: 當觀察到緊急窗口期間的平均併發達到目標併發的緊急閾值百分比，自動縮放進入緊急模式。在緊急模式下運行時，autoscaler在緊急狀況窗口上按平均併發縮放穩定窗口的緊急窗口百分比。計算恐慌窗口時，它將四捨五入到最接近的值
整秒。
target-burst-capacity: 指定併發中突發的請求大小。若是當前服務的備用容量小於設定值，Autoscaler將經過引入請求器路徑中的Activator來嘗試保護系統免於排隊。若是此設置爲 0，則只當修訂版縮放爲0時，Activator位於請求路徑中。若是此設置 > 0，而且container-concurrency-target-percentage爲100％或1.0，則Activator將始終位於請求路徑中。-1 表示無限的目標爆發容量，Activator將始終在請求路徑中。其餘負值無效。
requests-per-second-target-default: 當將每秒請求數（RPS）用做修訂的縮放指標，以及修訂版指定了無限制的RPS，autoscaler 將嘗試去維護。即便指定了無限制的RPS，autoscaler將基於此目標RPS水平縮放應用程序。該值必須大於1.0。注意：僅一個度量標準可用於自動縮放修訂。

Termination period

Termination period（終止時間）是 POD 在最後一個請求完成後關閉的時間。POD 的終止週期等於穩定窗口值和縮放至零寬限期參數的總和。在本例中，Termination period 爲 90 秒。less

配置併發

可使用如下方法配置 Autoscaler 的併發數：socket

targetide

target 定義在給定時間（軟限制）須要多少併發請求，是 Knative 中 Autoscaler 的推薦配置。測試

在 ConfigMap 中默認配置的併發 target 爲 100。this

`container-concurrency-target-default: 100

這個值能夠經過 Revision 中的 autoscaling.knative.dev/target 註釋進行修改：

autoscaling.knative.dev/target: "50"

containerConcurrency

注意：只有在明確須要限制在給定時間有多少請求到達應用程序時，才應該使用 containerConcurrency (容器併發)。只有當應用程序須要強制的併發約束時，才建議使用 containerConcurrency。

containerConcurrency 限制在給定時間容許併發請求的數量（硬限制），並在 Revision 模板中配置。

containerConcurrency: 0 | 1 | 2-N

1: 將確保一次只有一個請求由 Revision 給定的容器實例處理；
2-N: 請求的併發值限制爲 2 或更多；
0: 表示不做限制，有系統自身決定。

配置擴縮容邊界（minScale 和 maxScale）

經過 minScale 和 maxScale 能夠配置應用程序提供服務的最小和最大 Pod 數量。經過這兩個參數配置能夠控制服務冷啓動或者控制計算成本。

minScale 和 maxScale 能夠在 Revision 模板中按照如下方式進行配置：

spec:
 template:
  metadata:
   annotations:
    autoscaling.knative.dev/minScale: "2"
    autoscaling.knative.dev/maxScale: "10"

默認行爲

若是未設置minScaleannotation，則容器將縮放爲零（若是根據上述ConfigMap，若是enable-scale-to-zero爲false，則縮放爲1）。

若是未設置maxScale annotation，則建立的Pod數將沒有上限。

KPA原理

其實關於伸縮，無非就是兩個問題，第一個是參照的指標是什麼？CPU？內存？RPS？另一個問題是伸縮的策略，也就是伸縮的數目。

用到的組件

自動縮放系統由一些在此簡要定義的「物理」和邏輯組件組成。瞭解它們是什麼，它們在哪裏部署以及它們在作什麼，將極大地有助於理解控制和數據流。提到的組件可能作的事情比這裏概述的要多。本文檔將僅遵循影響自動縮放系統的細節。

Queue Proxy

隊列代理是一個sidecar容器，與每一個用戶容器中的用戶容器一塊兒部署。發送到應用程序實例的每一個請求都首先經過隊列代理，所以其名稱爲「代理」。

隊列代理的主要目的是測量並限制用戶應用程序的併發性。若是修訂將併發限制定義爲5，則隊列代理可確保一次到達應用程序實例的請求不超過5個。若是發送給它的請求更多，它將在本地將它們排隊，所以是其名稱中的「隊列」。隊列代理還測量傳入的請求負載，並在單獨的端口上報告平均併發和每秒請求數。

Autoscaler

自動縮放器是一個獨立的Pod，包含三個主要組件：

PodAutoscaler reconciler
Collector
Decider

PodAutoscaler協調程序可確保正確獲取對PodAutoscalers的任何更改（請參閱API部分），並將其反映在Decider，Collector或二者中。

Collector負責從應用程序實例上的隊列代理收集度量。爲此，它會刮擦其內部指標端點並對其求和，以獲得表明整個系統的指標。爲了實現可伸縮性，僅會抓取全部應用程序實例的一個樣本，並將接收到的指標外推到整個集羣。

Decider得到全部可用指標，並決定應將應用程序部署擴展到多少個Pod。基本上，要作的事情就是want = concurrencyInSystem/targetConcurrencyPerInstance。

除此以外，它還會針對修訂版的最大縮放比例和最小實例數和最大實例數設置值進行調整。它還計算當前部署中還剩下多少突發容量，從而肯定是否能夠從數據路徑中刪除Activator。

Activator

Activator是全局共享的部署，具備很高的可伸縮性。其主要目的是緩衝請求並向autoscaler報告指標。

Activator主要涉及從零到零的規模擴展以及容量感知負載平衡。當修訂版本縮放到零實例時，Activator將被放置到數據路徑中，而不是修訂版本的實例中。若是請求將達到此修訂版，則Activator將緩衝這些請求，並使用指標戳autoscaler並保留請求，直到出現應用程序實例。在這種狀況下，Activator會當即將其緩衝的請求轉發到新實例，同時當心避免使應用程序的現有實例過載。Activator在這裏有效地充當負載平衡器。當它們可用時，它將負載分配到全部Pod上，而且不會在併發設置方面使它們過載。在系統認爲合適的狀況下，將Activator放置在數據路徑上或從數據路徑上取下，以使其充當如上所述的負載平衡器。若是當前部署具備足夠的空間以使其不太容易過載，則將Activator從數據路徑中刪除，以將網絡開銷降至最低。

與隊列代理不一樣，激活器經過Websocket鏈接主動將指標發送到autoscaler，以最大程度地減少從零開始的延遲。

算法

autoscaler是基於每一個Pod（併發）的運行中請求的平均數量。系統的默認目標併發性爲100，可是咱們爲服務使用了10。咱們爲服務加載了50個併發請求，所以自動縮放器建立了5個容器（50個併發請求/目標10 = 5個容器）。

算法中有兩種模式，分別是panic和stable模式，一個是短期，一個是長時間，爲了解決短期內請求突增的場景，須要快速擴容。

Stable Mode（穩定模式）

在穩定模式下，Autoscaler 根據每一個pod指望的併發來調整Deployment的副本個數。根據每一個pod在60秒窗口內的平均併發來計算，而不是根據現有副本個數計算，由於pod的數量增長和pod變爲可服務和提供指標數據有必定時間間隔。

Panic Mode （恐慌模式）

KPA會在60秒的窗口內計算平均併發性，所以系統須要一分鐘時間才能穩定在所需的併發性級別。可是，自動縮放器還會計算一個6秒的緊急窗口，若是該窗口達到目標併發性的2倍，它將進入緊急模式。在緊急模式下，自動縮放器在較短，更敏感的緊急窗口上運行。一旦在60秒內再也不知足緊急狀況，autoscaler將返回到最初的60秒穩定窗口。

|
                                  Panic Target--->  +--| 20
                                                    |  |
                                                    | <------Panic Window
                                                    |  |
       Stable Target--->  +-------------------------|--| 10   CONCURRENCY
                          |                         |  |
                          |                      <-----------Stable Window
                          |                         |  |
--------------------------+-------------------------+--+ 0
120                       60                           0
                     TIME

數據流向

穩定模式下的擴縮

在穩定狀態下，autoascaler會不斷抓取當前活動的修訂包，以不斷調整修訂的規模。當請求流入系統時，被刮擦的值將發生變化，而且自動縮放器將指示修訂版的部署遵循給定的縮放比例。

SKS經過私有服務跟蹤部署規模的變化。它將相應地更新公共服務。

scale 到 0

一旦系統中再也不有任何請求，修訂版本就會縮放爲零。從autoscaler到修訂版容器的全部刮擦都返回0併發性，而且activator報告的併發性相同（1）。

在實際刪除修訂的最後一個pod以前，系統應確保activator在路徑中而且可路由。首先決定將比例縮放爲零的autoscaler會指示SKS使用代理模式，所以全部流量都將定向到activator（4.1）。如今將檢查SKS的公共服務，直到確保它返回activator的響應爲止。在這種狀況下，若是已通過去了寬限期（可經過_scale-to-zero-grace-period_進行配置），則修訂的最後一個pod將被刪除，而且修訂已成功縮放爲零（5）。

從 0 擴容

若是修訂版本縮放爲零，而且系統中有一個試圖達到該修訂版本的請求，則系統須要將其擴展。當SKS處於代理模式時，請求將到達activator（1），activator將對其進行計數並將其報告給autoscaler（2.1）。而後，activator將緩衝請求，並監視SKS的專用服務以查看端點的出現（2.2）。

Aujtoscaler從activator獲取度量，並當即運行自動縮放循環（3）。該過程將肯定至少須要一個pod（4），autoscaler將指示修訂的部署擴展到N> 0個副本（5.1）。它還將SKS置於「服務」模式，一旦流量上升（5.2），流量就會直接流到修訂版的Pod。

activator最終會看到端點出現並開始對其進行探測。一旦探測成功經過，相應的地址將被認爲是健康的，並用於路由咱們緩衝的請求以及在此期間到達的全部其餘請求（8.2）。

該修訂版已成功從零開始縮放。

KPA示例

咱們使用官方autoscale-go來進行演示。service.yaml以下：

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: autoscale-go
  namespace: default
spec:
  template:
    metadata:
      annotations:
        # Target 10 in-flight-requests per pod.
        autoscaling.knative.dev/target: "10"
        autoscaling.knative.dev/minScale: "1" 
        autoscaling.knative.dev/maxScale: "3"
    spec:
      containers:
      - image: gcr.io/knative-samples/autoscale-go:0.1

部署完成之後，咱們能夠看到因爲咱們設置最小scale爲1，因此即便在沒有流量訪問的狀況下，也會保持一個實例。

kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
autoscale-go-h9x5z-deployment-84d57876-5mjzt   2/2     Running   0          12s

根據併發數來做爲擴縮的參照指標，30s內發起50個併發請求，minScale 最小保留實例數爲 1，maxScale 最大擴容實例數爲 3。

咱們經過hey測試，執行如下命令：

hey -z 30s -c 50 "http://autoscale-go.default.serverless.ushareit.me?sleep=100&prime=10000&bloat=5"

執行完畢，hey輸出一些統計內容：

Summary:
  Total:    30.1853 secs
  Slowest:    0.4866 secs
  Fastest:    0.1753 secs
  Average:    0.1838 secs
  Requests/sec:    271.6219

  Total data:    819814 bytes
  Size/request:    99 bytes

Response time histogram:
  0.175 [1]    |
  0.206 [8044]    |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.238 [63]    |
  0.269 [39]    |
  0.300 [0]    |
  0.331 [1]    |
  0.362 [1]    |
  0.393 [7]    |
  0.424 [17]    |
  0.455 [11]    |
  0.487 [15]    |


Latency distribution:
  10% in 0.1782 secs
  25% in 0.1794 secs
  50% in 0.1808 secs
  75% in 0.1828 secs
  90% in 0.1863 secs
  95% in 0.1910 secs
  99% in 0.2502 secs

Details (average, fastest, slowest):
  DNS+dialup:    0.0012 secs, 0.1753 secs, 0.4866 secs
  DNS-lookup:    0.0007 secs, 0.0000 secs, 0.1098 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0024 secs
  resp wait:    0.1824 secs, 0.1752 secs, 0.3321 secs
  resp read:    0.0001 secs, 0.0000 secs, 0.0071 secs

Status code distribution:
  [200]    8199 responses

查看具體POD擴縮狀況以下：

kubectl get pods
NAME                                           READY   STATUS    RESTARTS   AGE
autoscale-go-h9x5z-deployment-84d57876-5mjzt   2/2     Running   0          5m15s
autoscale-go-h9x5z-deployment-84d57876-64b2f   2/2     Running   0          21s
autoscale-go-h9x5z-deployment-84d57876-pf2c9   2/2     Running   0          21s

原本應該擴容到5個實例，可是因爲設置了maxscale爲3，因此最大實例爲3。

您能夠將Knative自動縮放配置爲與默認KPA或基於CPU的指標（即「水平Pod自動縮放器」（HPA））一塊兒使用。經過在修訂模板中添加或修改autoscaling.knative.dev/class和autoscaling.knative.dev/metric值做爲註釋，能夠將Knative配置爲使用基於CPU的自動縮放，而不使用基於默認請求的度量。

spec:
 template:
  metadata:
   annotations:
    autoscaling.knative.dev/metric: cpu
    autoscaling.knative.dev/target: "70"
    autoscaling.knative.dev/class: hpa.autoscaling.knative.dev

若是你已經在knative-monitoring 命名空間部署了對應的監控，那麼觀察grafana能夠看到更直觀的變化：