K8S-–集羣調度

時間 2021-08-12

標籤 node nginx web 算法 api markdown app ide 性能欄目負載均衡简体版

原文原文鏈接

K8S–集羣調度

調度說明

簡介

schedule是Kubernetes的調度器, 主要任務是把定義的Pod分配到集羣的節點上.node

公平: 保證每一個節點都能被分配資源nginx
資源高效利用: 集羣全部資源最大化被使用web
效率: 調度的性能要好, 可以儘快地對大批量的Pod完成調度任務算法
靈活: 容許用戶根據本身的需求控制調度的邏輯

Schedule是做爲單獨的程序運行的, 啓動以後會一直監聽API Server, 獲取PodSpec.NodeName爲空的Pod, 對每一個Pod都會建立一個Binding, 代表該Pod應該放到哪一個節點上api

調度過程

調度分爲幾個部分: 1. 過濾掉不知足條件的節點, 這個過程爲 predicate; 2. 對經過的節點按照優先級排序, 這個是priority; 3. 從中選擇優先級最高的節點. 若是中間發生錯誤, 就直接返回錯誤markdown

predicate有一系列算法可使用:app

PodFitsResources : 節點上剩餘的資源是否大於Pod請求的資源ide
PodFitsHost: 若是Pod指定了NodeName, 檢查節點名稱是否和NodeName匹配性能
PodFitsHostPorts : 節點上已經使用的Port是否和Pod申請的port衝突ui
PodSelectorMatches : 過濾掉和Pod指定的label不匹配的節點
NoDiskConflict: 已經mount的Volume和Pod指定的Volume不衝突, 除非都是隻讀

若是在predicate過程當中沒有合適的節點, Pod會一直在pending狀態, 不斷重試調度, 直到有節點知足條件.通過這個步驟,若是有多個節點知足條件, 就繼續Priority過程: 按照優先級大小對節點排序.

優先級由一系列鍵值對組成, 鍵是該優先級項的名稱, 值是他的權重. 優先級選項包括:

leastRequestedPriority: 經過計算CPU和Memory的使用率來決定權重, 使用率越低權重越高.這個優先級指標傾向於資源使用比例更低的節點.
BalancedResourceAllocation: 節點上CPU和Memory使用率越接近, 權重越高. 這個應該和leastRequestedPriority一塊兒使用, 不該該單獨使用
ImageLocalityPriority: 傾向於已經有要使用鏡像的節點, 鏡像總大小值越大, 權重越高

經過算法對全部的優先級項目和權重進行計算, 得出最終的結果

自定義調度器

除了Kubernetes自帶的調度器, 也能夠編寫本身的調度器.經過spec.schedulername參數指定調度器的名字, 能夠爲Pod選擇某個調度器進行調度.好比下面pod選擇my-scheduler進行調度,而不是默認的default-scheduler

apiVersion: v1
kind: Pod
metadata:
  name: annotation-second-scheduler
  labels:
    name: multischeduler-example
spec:
  schedulername: my-scheduler
  containers: 
  - name: pod-with-second-annotation-container
    image: gcr.io/google_containers/pause:2.0

調度親和性

Node親和性

pod.spec.nodeAffinity

preferredDuringSchedulingIgnoredDuringExecution: 軟策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略

requiredDuringSchedulingIgnoredDuringExecution

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions: 
          - key: kubernetes.io/hostname
            operator: NotIn
            values: 
            - xuh04
            - k8s-node02

operator: NotIn 表示不會在values節點上建立pod:

將operator: NotIn修改成operator: In values設爲04, 每次刪除建立,都會建立在values對應節點上

preferredDuringSchedulingIgnoredDuringExecution:

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 1  # 權重, 越大越親和
        preference: 
          matchExpressions: 
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - xuh02

結合使用

apiVersion: v1
kind: Pod
metadata:
  name: affinity
  labels:
    app: node-addinity-pod
spec:
  containers:
  - name: with-node-affinity
    image: busybox
  affinity:
    nodeAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
        nodeSelectorTerms:
        - matchExpressions: 
          - key: kubernetes.io/hostname
            operator: NotIn
            values: 
            - xuh04
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 1  # 權重, 越大越親和
        preference: 
          matchExpressions: 
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - xuh03

鍵值運算關係

In : label的值在某個列表中
NotIn: 不在某個列表中
Gt: 大於
Lt: 小於
Exists: 某個label存在
DoesNotExist: 某個label不存在

Pod親和性

pod.spec.affinity.podAffinity/podAntiAffinity

preferredDuringSchedulingIgnoredDuringExecution: 軟策略
requiredDuringSchedulingIgnoredDuringExecution: 硬策略

apiVersion: v1
kind: Pod
metadata:
  name: pod-3
  labels:
    app: pod-3
spec:
  containers:
  - name: pod-3
    image: busybox
  affinity:
    podAffinity: 
      requiredDuringSchedulingIgnoredDuringExecution: 
      - labelSelector:
          matchExpressions: 
          - key: app
            operator: In
            values: 
            - pod-1
        topologyKey: kubernetes.io/hostname
    podAntiAffinity: 
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1  # 權重, 越大越親和
        podAffinityTerm:
          labelSelector: 
            matchExpressions: 
            - key: app
              operator: In
              values: 
              - pod-1
          topologyKey: kubernetes.io/hostname

親和性/反親和性調度策略比較

調度策略	匹配標籤	操做符	拓撲域支持	調度目標
nodeAffinity	主機	In, NotIn, Exists, DoesNotExists, Gt, Lt	否	指定主機
podAffinity	Pod	In, NotIn, Exists, DoesNotExists	是	Pod與指定Pod同一拓撲域
podAntiAffinity	Pod	In, NotIn, Exists, DoesNotExists	是	Pod與指定Pod同一拓撲域

污點與容忍

Taint和Toleration

節點親和性, 是Pod的一種屬性(偏好或硬性要求), 他使Pod被吸引到一類特定的節點. Taint則相反, 他使節點可以排斥一類特定的Pod

Taint和toleration相互配合, 能夠用來避免Pod被分配到不合適的節點上. 每一個節點上均可以應用一個或多個taint, 這表示對於那些不能容忍這些Taint的Pod, 是不會被該節點接受的. 若是將toleration應用於Pod上, 表示這些Pod能夠(但不要求)被調度到具備匹配taint的節點上

污點(Taint)

污點(taint)的組成

使用kubectl taint命令能夠給某個Node節點設置污點, Node被設置上污點以後就和Pod之間存在了一種相斥的關係, 可讓Node拒絕Pod的調度執行, 甚至將Node已經存在的Pod驅逐出去

每一個污點的組成以下:

key=value:effect

每一個污點有一個key和value做爲污點的標籤, 其中value能夠爲空, effect描述污點的做用. 當前taint effect支持下列三種選項:

NoSchedule k8s將不會將Pod調度到具備該污點的Node上
PreferNoSchedule: k8s將盡可能避免將pod調度到具備該污點的Node上
NoExecute: k8s不會將Pod調度到具備該污點的Node上, 同時會將Node上已經存在的Pod驅逐出去

2 .污點的設置查看和去除

# 設置污點
kubectl taint nodes xuh01 key1=value1:NoSchedule

# 節點說明中, 查找Taints字段
kubectl describe pod pod-name

# 去除污點
kubectl taint nodes xuh01 key1=NoSchedule-

容忍(Toleration)

設置了污點的Node將根據Taint的effect: NoSchedule、PreferNoSchedule、NoExecute和Pod之間產生互斥的關係, Pod將在必定程度上不會被調度到Node上. 但咱們能夠在Pod上設置容忍(toleration), 意思是設置了容忍的Pod將能夠容忍污點的存在, 能夠被調度到存在污點的Node上

pod.spec.tolerations

tolerations: 
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoSchedule"
  tolerationSecond: 3600
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
- key: "key2"
  operator: "Exists"
  effect: "NoSchedule"

其中key, value, effect要與Node上設置的taint保持一致
operator的值爲Exists將會忽略value的值
tolerationSeconds用於描述當Pod須要被驅逐時能夠在Pod上繼續保留運行的時間

當不指定key值時, 表示容忍全部的污點key:

toleration:
- operator: "Exists"

2 .當不指定effect值時, 表示容忍全部的污點做用

toleration:
- key: "key"
  operator: "Exists"

3 .有多個Master存在時, 防止資源浪費, 能夠設置

kubectl taint nodes Node-Name node-role.kubernetes.io/master=:PreferNoSchedule

固定節點調度

pod.spec.nodeName將Pod直接調度到指定的Node節點上, 會跳過Scheduler的調度策略, 匹配規則是強制匹配

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myweb
spec: 
  replicas: 7
  template:
    metadata:
      labels:
        app: myweb
    spec:
      nodeName: xuh03
      containers: 
      - name: myweb
        image: nginx
        ports:
        - containerPort: 80

2 . pod.spec.nodeSelector: 經過Kubernetes的label-selector機制選擇節點, 由調度器策略匹配label, 然後調度pod到目標節點, 該匹配規則屬於強制約束

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: myweb1
spec: 
  replicas: 5
  template:
    metadata:
      labels:
        app: myweb1
    spec:
      nodeSelector:
        disk: ssd
      containers: 
      - name: myweb
        image: nginx
        ports:
        - containerPort: 80

沒有Node上有disk=ssd的標籤, pod找不到調度的節點, 一直處於pending狀態