Kubernetes 親和性調度

時間 2019-12-08

標籤 kubernetes 親和性調度简体版

原文原文鏈接

1、概述

前一篇文章 Kubernetes 調度器淺析，大體講述了調度器的工做原理及相關調度策略。這一章會繼續深刻調度器，介紹下「親和性調度」。html

Kubernetes 支持限制 Pod 在指定的 Node 上運行，或者指定更傾向於在某些特定 Node 上運行。
有幾種方式能夠實現這個功能：node

NodeName: 最簡單的節點選擇方式，直接指定節點，跳過調度器。
NodeSelector: 早期的簡單控制方式，直接經過鍵—值對將 Pod 調度到具備特定 label 的 Node 上。
NodeAffinity: NodeSelector 的升級版，支持更豐富的配置規則，使用更靈活。(NodeSelector 將被淘汰.)
PodAffinity: 根據已在節點上運行的 Pod 標籤來約束 Pod 能夠調度到哪些節點，而不是根據 node label。

2、NodeName

nodeName 是 PodSpec 的一個字段，用於直接指定調度節點，並運行該 pod。調度器在工做時，實際選擇的是 nodeName 爲空的 pod 並進行調度而後再回填該 nodeName，因此直接指定 nodeName 實際是直接跳過了調度器。換句話說，指定 nodeName 的方式是優於其餘節點選擇方法。nginx

方法很簡單，直接來個官方示例：git

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-01

固然若是選擇的節點不存在，或者資源不足，那該 pod 必然就會運行失敗。github

3、NodeSelector

nodeSelector 也是 PodSpec 中的一個字段，指定鍵—值對的映射。
若是想要將 pod 運行到對應的 node 上，須要先給這些 node 打上 label，而後在 podSpec.NodeSelector 指定對應 node labels 便可。web

步驟以下：redis

設置標籤到 node 上：

kubectl label nodes kubernetes-node type=gpu算法

pod 配置添加 nodeSelector 字段：

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    type: gpu

內置 Node 標籤

Kubernetes 內置了一些節點標籤：

kubernetes.io/hostname
beta.kubernetes.io/instance-type
beta.kubernetes.io/os
beta.kubernetes.io/arch
failure-domain.beta.kubernetes.io/zone
failure-domain.beta.kubernetes.io/region

有些標籤是對雲提供商使用。

還有些表示 node role 的 labels（能夠指定 master、lb 等）:

kubernetes.io/role
node-role.kubernetes.io

4、NodeAffinity

nodeSelector 經過 k-v 的方式很是簡單的支持了 pod 調度限制到具備特定標籤的節點上。而 nodeAffinity 根據親和力 & 反親和力極大地擴展了可以表達的約束信息。segmentfault

nodeAffinity 特性的設計初衷就是爲了替代 nodeSelector。

nodeAffinity 當前支持的匹配符號包括：In、NotIn、Exists、DoesNotExists、Gt、Lt 。api

nodeAffinity 當前支持兩種調度模式:

requiredDuringSchedulingIgnoredDuringExecution: 必定要知足的條件，若是沒有找到知足條件的節點，則 Pod 建立失敗。全部也稱爲hard 模式。
preferredDuringSchedulingIgnoredDuringExecution: 優先選擇知足條件的節點，若是沒有找到知足條件的節點，則在其餘節點中擇優建立 Pod。全部也稱爲 soft 模式。

兩種模式的名字特長，這是 k8s 的命名風格。其中IgnoredDuringExecution的意義就跟 nodeSelector 的實現同樣，即便 node label 發生變動，也不會影響以前已經部署且又不知足 affinity rules 的 pods，這些 pods 還會繼續在該 node 上運行。換句話說，親和性選擇節點僅在調度 Pod 時起做用。

k8s 社區正在計劃提供 requiredDuringSchedulingRequiredDuringExecution 模式，便於驅逐 node 上不知足 affinity rules 的 pods。

來個官方示例，看下怎麼玩:

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      # 必須選擇 node label key 爲 kubernetes.io/e2e-az-name,
      # value 爲 e2e-az1 或 e2e-az2.
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      # 過濾掉上面的必選項後，再優先選擇 node label key 爲 another-node-label-key
      # value 爲 another-node-label-value.
      preferredDuringSchedulingIgnoredDuringExecution:
      # 若是知足節點親和，積分加權重(優選算法，會對 nodes 打分)
      # weight: 0 - 100
      - weight: 1
        preference:
          matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0

簡單看下 NodeAffinity 的結構體，下面介紹注意事項時會涉及：

type NodeAffinity struct {
    RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector
    PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm
}

type NodeSelector struct {
    NodeSelectorTerms []NodeSelectorTerm
}

type NodeSelectorTerm struct {
    MatchExpressions []NodeSelectorRequirement
    MatchFields []NodeSelectorRequirement
}

配置相關的注意點：

若是 nodeSelector 和 nodeAffinity 二者都指定，那 node 須要兩個條件都知足，pod 才能調度。
若是指定了多個 NodeSelectorTerms，那 node 只要知足其中一個條件，pod 就能夠進行調度。
若是指定了多個 MatchExpressions，那必需要知足全部條件，才能將 pod 調度到該 node。

5、PodAffinity

nodeSelector & nodeAffinity 都是基於 node label 進行調度。而有時候咱們但願調度的時候能考慮 pod 之間的關係，而不僅是 pod 和 node 的關係。

舉個例子，會有需求但願服務 A 和 B 部署在同一個機房、機架或機器上，由於這些服務可能會對網路延遲比較敏感，須要低延時；再好比，但願服務 C 和 D 又但願儘可能分開部署，即便一臺主機甚至一個機房出了問題，也不會致使兩個服務一塊兒掛而影響服務可用性，提高故障容災的能力。

podAffinity 會基於節點上已經運行的 pod label 來約束新 pod 的調度。
其規則就是「若是 X 已經運行了一個或者多個符合規則 Y 的 Pod，那麼這個 Pod 應該（若是是反親和性，則是不該該）調度到 X 上」。
這裏的 Y 是關聯 namespace 的 labelSelector，固然 namespace 也能夠是 all。和 node 不一樣，pod 是隸屬於 namespace 下的資源，因此基於 pod labelSelector 必須指定具體的 namespace；而 X 則能夠理解爲一個拓撲域，相似於 node、rack、zone、cloud region 等等，就是前面提到的 內置 Node 標籤 ，固然也能夠自定義。

看下 pod affinity 涉及的結構體，便於進行功能介紹：

type Affinity struct {
    // NodeAffinity 前面介紹了
    NodeAffinity *NodeAffinity
    // pod 親和性
    PodAffinity *PodAffinity
    // pod 反親和性
    PodAntiAffinity *PodAntiAffinity
}

type PodAffinity struct {
    // hard 模式, 必選項
    RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm
    // soft 模式, 進行 node 優先
    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm
}

type PodAffinityTerm struct {
    LabelSelector *metav1.LabelSelector
    Namespaces []string
    TopologyKey string
}

type WeightedPodAffinityTerm struct {
    Weight int32
    PodAffinityTerm PodAffinityTerm
}

podAffinity 和 nodeAffinity 有類似的地方，使用了 labelSelector 進行匹配，支持的匹配符號包括：In、NotIn、Exists、DoesNotExists；
也支持兩種調度模式 requiredDuringSchedulingIgnoredDuringExecution 和 preferredDuringSchedulingIgnoredDuringExecution, 功能和 nodeAffinity 同樣，這裏就不在累述。

podAffinity 和 nodeAffinity 也有較大的差別，前面講了 pod 是 namespace 資源，因此必然會須要配置 namespaces，支持配置多個 namespace。若是省略的話，默認爲待調度 pod 所屬的 namespace；若是定義了可是值爲空，則表示使用「all」 namespaces。

還有一個較大的差異 TopologyKey, 便於理解進行單獨介紹。

TopologyKey

TopologyKey 用於定義 in the same place，前面也介紹了是拓撲域的概念。

看下面的圖，這兩個 pod 到底該如何算在一個拓撲域？

若是咱們使用k8s.io/hostname，in the same place 則意味着在同一個 node，那下圖的 pods 就不在一個 place：

若是咱們使用failure-domain.k8s.io/zone 來表示一個 place，那下圖的 pods 就表示在一個 zone:

固然咱們也能夠自定義 node labels 做爲 TopologyKey。好比咱們能夠給一組 node 打上 rack 標籤，那下圖的 pods 表示在同一個 place:

原則上，topologyKey 能夠是任何合法的 label key。可是出於性能和安全考慮，topologyKey 存在一些限制：

對於親和性和反親和性的 requiredDuringSchedulingIgnoredDuringExecution 模式，topologyKey 不能爲空
pod 反親和性 requiredDuringSchedulingIgnoredDuringExecution 模式下，LimitPodHardAntiAffinityTopology 權限控制器會限制 topologyKey 只能設置爲 kubernetes.io/hostname。固然若是你想要使用自定義 topology，那能夠簡單禁用便可。
pod 反親和性 preferredDuringSchedulingIgnoredDuringExecution 模式下，topologyKey 爲空則表示全部的拓撲域。截止 v1.12 版本，全部的拓撲域還只能是 kubernetes.io/hostname, failure-domain.beta.kubernetes.io/zone 和 failure-domain.beta.kubernetes.io/region 的組合。
除此以外，topologyKey 能夠是任何合法的 label key。

示例

來個官方示例，有三節點集羣，須要分別部署 3 份 web 和 redis 服務。但願 web 與 redis 服務共存，但須要保證各個服務的副本分散部署。
先建立 redis 集羣：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-cache
spec:
  selector:
    matchLabels:
      app: store
  replicas: 3
  template:
    metadata:
      labels:
        app: store
    spec:
      affinity:
        // pod 反親和性, 打散 redis 各個副本
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: redis-server
        image: redis:3.2-alpine

再部署 web 服務，須要打散而且與 redis 服務共存，配置以下：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
spec:
  selector:
    matchLabels:
      app: web-store
  replicas: 3
  template:
    metadata:
      labels:
        app: web-store
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-store
            topologyKey: "kubernetes.io/hostname"
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - store
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: web-app
        image: nginx:1.12-alpine

注意1: pod affinity 須要進行大量處理，因此會明顯減慢大型集羣的調度時間，不建議在大於幾百個節點的集羣中使用該功能。
注意2: pod antiAffinity 要求對節點進行一致標誌，即集羣中的全部節點都必須具備適當的標籤用於配置給 topologyKey，若是節點缺乏指定的 topologyKey 指定的標籤，則可能會致使意外行爲。