高級調度設置機制分爲如下兩類:node
節點選擇器: nodeSelector , nodeNamelinux
節點親和角度: nodeAffinty後端
調度器的邏輯api
1 節點選擇器
app
nodeSelector 、nodeName、NodeAffinity
frontend
若是指望把pod調度到特定節點上,直接給定node名稱便可,這樣對應pod必定只能被調度到對應節點ide
若是有一類節點都符合條件,則使用nodeSeleteor,給必定的節點打上標籤,在pod的配置中去匹配節點標籤,這樣的方式能夠極大的縮小範圍ui
nodeSelector
this
例:找到gpu爲ssd的標籤節點spa
[root@master k8s]# mkdir schedule
[root@master k8s]# cd schedule/
[root@master schedule]# ll
total 0
[root@master schedule]# cp ../pod-demo.yaml .
[root@master schedule]#
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
nodeSelector: # 調用的是MatchNodeSelector 預選策略,查看ssd 標籤的node是否存在
disktype: ssd
查看node 標籤
[root@master schedule]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master.test.k8s.com Ready master 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master.test.k8s.com,node-role.kubernetes.io/master=
node1.test.k8s.com Ready <none> 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node1.test.k8s.com
node2.test.k8s.com Ready <none> 2d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2.test.k8s.com
[root@master schedule]#
若是給予其中一個加入標籤,那麼它必定在指定的node中建立pod
若是沒有任何標籤則會處於Pending狀態
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-demo 0/1 Pending 0 47s <none> <none> <none>
[root@master schedule]#
調度是沒法成功,它是一種強約束,因此必須知足其條件
describe查看信息
Events:
Type Reason Age # From Message
---- ------ ---- #---- -------
Warning FailedScheduling 18s (x25 over 1m) #default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
除非從新打上標籤
[root@master schedule]# kubectl label nodes node1.test.k8s.com disktype=ssd
node/node1.test.k8s.com labeled
再次查看
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-demo 1/1 Running 0 4m 10.244.1.153 node1.test.k8s.com <none>
[root@master schedule]#
NodeAffinity
Node affinity跟nodeSelector很像,能夠限制Pods在特定節點上運行,也能夠優先調度到特定節點
使用方式
[root@master schedule]# kubectl explain pods.spec.affinity
KIND: Pod
VERSION: v1
RESOURCE: affinity <Object>
[root@master schedule]# kubectl explain pods.spec.affinity.nodeAffinity | grep '<'
RESOURCE: nodeAffinity <Object>
preferredDuringSchedulingIgnoredDuringExecution <[]Object> # 它的值是一個對象列表
requiredDuringSchedulingIgnoredDuringExecution <Object>
NodeAffinity的親和性
requiredDuringSchedulingIgnoredDuringExecution 硬親和性好比知足條件
preferredDuringSchedulingIgnoredDuringExecution 軟親和性,儘可能知足條件,不然找其餘節點運行
定義一個硬親和,requiredDuringSchedulingIgnoredDuringExecution
經過區域斷定,若是節點中擁有此標籤則在此建立pod
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
matchExpressions:
- key: zone # 若是當前key中的value,在node上存在,則建立pod
operator: In
value:
- foo
- bar
運行時是pending,由於是硬親和性,當前沒有用於這個標籤
node 軟親和
軟親和性
[root@master schedule]# kubectl explain pods.spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution
KIND: Pod
VERSION: v1
看到使用方式
preference <Object> -required-
A node selector term, associated with the corresponding weight.
weight <integer> -required- # 給予權重和對象(定義哪些節點)
Weight associated with matching the corresponding nodeSelectorTerm, in the
range 1-100.
[root@master schedule]# cat preferred-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-demo
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: zone
operator: In
values:
- foo
- bar
weight: 60
匹配不到標籤,可是還能夠照常運行
[root@master schedule]# kubectl get pods
NAME READY STATUS RESTARTS AGE
pod-demo 1/1 Running 0 1m
[root@master schedule]#
pod親和性
與node親和性相比,pod親和性並不強制
以節點名稱爲不一樣位置,那麼很顯然每一個節點都不一樣,所以每一個節點都是獨特的位置
因此另外一種判斷標準,以標籤爲位置,一樣的標籤爲同一位置,這樣才能夠判斷哪些知足親和性,以及其餘調度屬性
pod 也有軟硬親和性,以下所示
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity
preferredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingIgnoredDuringExecution
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity.preferredDuringSchedulingIgnoredDuringExecution
podAffinityTerm <Object> -required-
weight <integer> -required-
[root@master schedule]# kubectl explain pods.spec.affinity.podAffinity.requiredDuringSchedulingIgnoredDuringExecution
labelSelector
namespaces
topologyKey
定義pod
第一個資源
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
定義多個資源
[root@master schedule]# cat pod-first.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
---
apiVersion: v1
kind: Pod
metadata:
name: pod-second
labels:
app: db
tier: db
spec:
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh","-c","sleep 360000"]
每一個節點都自動建立一個標籤,名爲當前節點的hostname
[root@master schedule]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master.test.k8s.com Ready master 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=master.test.k8s.com,node-role.kubernetes.io/master=
node1.test.k8s.com Ready <none> 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/hostname=node1.test.k8s.com
node2.test.k8s.com Ready <none> 3d v1.11.3 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2.test.k8s.com
You have new mail in /var/spool/mail/root
接下來定義affinity
topologKey 表示只要是當前hostname,則認爲是同一個位置,只要hostname則認爲同一個位置,每一個節點的hostname不一樣,hostname是一個變量
以下
[root@master schedule]# cat pod-first.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-first
namespace: default
labels:
app: myapp
tier: frontend
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v1
imagePullPolicy: IfNotPresent
---
apiVersion: v1
kind: Pod
metadata:
name: pod-second
labels:
app: backend
tier: db
spec:
containers:
- name: busybox
image: busybox:latest
imagePullPolicy: IfNotPresent
command: ["/bin/sh","-c","sleep 360000"]
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # 定義親和性
- labelSelector:
matchExpressions: # 匹配哪一個pod,要與pod標籤捆綁在一塊兒
- {key: app, operator: In, values: ["myapp"]} # 找到存在pod標籤 app:myapp 的pod 優先選擇
topologyKey: kubernetes.io/hostname
默認是一個均衡法則,兩個節點天然是同樣的,優先策略:cpu天然均衡並找到最少資源佔用的,
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 2m 10.244.2.56 node2.test.k8s.com <none>
pod-second 1/1 Running 0 35s 10.244.2.57 node2.test.k8s.com <none>
[root@master schedule]#
[root@master schedule]# kubectl describe pod pod-second
查看調度方式
---- ------ ---- ---- -------
Normal Scheduled 3m default-scheduler Successfully assigned default/pod-second to node2.test.k8s.com # 明確告知被調度到node2
Normal Pulled 3m kubelet, node2.test.k8s.com Container image "busybox:latest" already present on machine
Normal Created 3m kubelet, node2.test.k8s.com Created container
Normal Started 3m kubelet, node2.test.k8s.com Started container
若是使用軟親和性,可能會被調度到其餘節點,由於沒有那麼強制的策略
反親和
取反,key的值不能是相同的,兩者的值確定不能是相同的
更改以下
affinity:
podAntiAffinity: # 更改成反親和
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- {key: app, operator: In, values: ["myapp"]}
topologyKey: kubernetes.io/hostname
[root@master schedule]# kubectl apply -f pod-first.yaml
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 13s 10.244.1.161 node1.test.k8s.com <none>
pod-second 1/1 Running 0 13s 10.244.2.58 node2.test.k8s.com <none>
一樣的,若是pod-first 運行在這個節點,那麼pod-second 必定不能在這個節點
給node標籤
[root@master schedule]# kubectl label nodes node1.test.k8s.com zone=foo
node/node1.test.k8s.com labeled
[root@master schedule]# kubectl label nodes node2.test.k8s.com zone=foo
node/node2.test.k8s.com labeled
[root@master schedule]#
更改toplogKey
從新編輯配置清單
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- {key: app, operator: In, values: ["myapp"]}
topologyKey: zone # 排除的node 標籤
從新建立
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
pod-first 1/1 Running 0 5s 10.244.2.59 node2.test.k8s.com <none>
pod-second 0/1 Pending 0 5s <none> <none> <none>
[root@master schedule]#
pod-second爲Pending,由於啓動時檢查pod是否存在反親和性,那麼會檢查topologyKey: zone 這個標籤是否存在,若是存在,由於是反親和性,那麼則不在這個節點上運行
污點調度/容忍調度
後端傾向度,讓pod進行選擇,節點是被動選擇,給予了節點的選擇權,選擇讓那些pod進行調度到節點
污點定義
在node.spec中進行定義
[root@master schedule]# kubectl explain node.spec.taints
查看節點的說明詳細信息
[root@master schedule]# kubectl get nodes node1.test.k8s.com -o yaml
找到spec
spec:
podCIDR: 10.244.1.0/24
taints是一個對象列表,用於定義節點的污點
定義污點:
關鍵參數:
geffect 要求必需要有當pod不能容忍污點時,採起的行爲是什麼因此是effect:
分別有如下定義:
effect定義對Pod排斥效果
[root@master schedule]# kubectl explain node.spec.taints.effect
KIND: Node
VERSION: v1
FIELD: effect <string>
DESCRIPTION:
Required. The effect of the taint on pods that do not tolerate the taint.
Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
NoSchedule |
僅影響調度過程對現存的pod對象不產生影響 |
PreferNoSchedule |
不能容忍也不能調度 |
NoExecute |
即影響調度過程也影響當前的pod對象,不能容忍的pod對象將被驅逐 |
若是一個節點存在污點,那麼一個pod可否調度到這個節點,先去檢查可以被匹配的污點容忍度
好比第一個污點與第一個容忍度恰好匹配到,那麼剩下的檢查不能被容忍則檢查污點的效果,那麼若是是noschedule ,則如何,若是是noexecute 則又如何
查看node中的污點
[root@master schedule]# kubectl describe node master.test.k8s.com | grep -i taints
Taints: node-role.kubernetes.io/master:NoSchedule # pod只要不能容忍這個污點則不能被調度
master是NoSchedule,也就是說爲何master上不被調度pod的緣由
因此各類pod不少,歷來就沒調度到master之上,就是說沒有定義過它的容忍度
好比查看kube-apiserver-master的信息
[root@master schedule]# kubectl describe pod kube-apiserver-master.test.k8s.com -n kube-system
看到以下,Tolerations表示容忍度:表示全部污點,NoExecute 表示不能被調度
Tolerations: :NoExecute # 容忍度,若是能容忍則NoExecute 能夠調度過來,顯然這裏是全部
查看kube-proxy的節點信息
[root@master schedule]# kubectl describe pod kube-proxy-lncxb -n kube-system
它的容忍度比較明顯
Tolerations:
CriticalAddonsOnly # 附件
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
以上都會影響容忍度的檢查
添加一個污點容忍
[root@master ~]# kubectl taint node node1.test.k8s.com node-type=production:NoSchedule # 節點類型是不能容忍污點被調度
node/node1.test.k8s.com tainted
[root@master ~]# kubectl describe nodes node1.test.k8s.com | grep -i taint
Taints: node-type=production:NoSchedule
這樣就爲node1 加入了污點,這樣之後就不會被調度到node1上來
定義以下
清單中這3個pod,沒有污點容忍度
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: myapp
release: cancary
template:
metadata:
labels:
app: myapp
release: cancary
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v2
ports:
- name: http
containerPort: 80
因此他們都在node2上,由於他們不能容忍node1的污點,由於沒有定義pod的容忍度
[root@master daemonset]# kubectl apply -f deploy.yaml
[root@master daemonset]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-86c975f8b-7x6m7 1/1 Running 0 4s 10.244.2.62 node2.test.k8s.com <none>
myapp-deploy-86c975f8b-bk9c7 1/1 Running 0 4s 10.244.2.61 node2.test.k8s.com <none>
myapp-deploy-86c975f8b-rpd84 1/1 Running 0 4s 10.244.2.60 node2.test.k8s.com <none>
若是在node2上加入污點查看效果,這個節點類型是dev環境,並且類型是NoExecute
以下所示,pod狀態都爲pending
[root@master daemonset]# kubectl taint node node2.test.k8s.com node-type=production:NoExecute
node/node2.test.k8s.com tainted
[root@master daemonset]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-86c975f8b-4sd6c 0/1 Pending 0 11s <none> <none> <none>
myapp-deploy-86c975f8b-nf985 0/1 Pending 0 11s <none> <none> <none>
myapp-deploy-86c975f8b-vx2h2 0/1 Pending 0 11s <none> <none> <none>
[root@master daemonset]#
加入pod容忍度
只須要讓其容忍哪些污點便可,每一個容忍度都是一個列表中的元素
[root@master daemonset]# kubectl explain pods.spec.tolerations
KIND: Pod
VERSION: v1
RESOURCE: tolerations <[]Object>
tolerationSeconds 容忍時間,意思爲若是被驅逐,則等待定義的時間再去被驅逐,默認是0秒
tolerationSeconds <integer>
TolerationSeconds represents the period of time the toleration (which must
be of effect NoExecute, otherwise this field is ignored) tolerates the
taint. By default, it is not set, which means tolerate the taint forever
(do not evict). Zero and negative values will be treated as 0 (evict
immediately) by the system.
operator 參數
operator <string>
Operator represents a key's relationship to the value. Valid operators are
# Exists and Equal.
Defaults to Equal. Exists is equivalent to wildcard for
value, so that a pod can tolerate all taints of a particular category.
exists 斷定污點存在
equal 表示必須精確容忍其容忍值,等值比較
定義容忍度
節點類型 哪一個污點
[root@master schedule]# cat deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-deploy
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: myapp
release: cancary
template:
metadata:
labels:
app: myapp
release: cancary
spec:
containers:
- name: myapp
image: ikubernetes/myapp:v2
ports:
- name: http
containerPort: 80
tolerations:
- key: "node-type"
operator: "Equal"
value: "production"
effect: "NoExecute"
tolerationSeconds: 300
[root@master schedule]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
myapp-deploy-595c744cf7-6cll6 1/1 Running 0 16s 10.244.2.65 node2.test.k8s.com <none>
myapp-deploy-595c744cf7-fwgqr 1/1 Running 0 16s 10.244.2.63 node2.test.k8s.com <none>
myapp-deploy-595c744cf7-hhdfq 1/1 Running 0 16s 10.244.2.64 node2.test.k8s.com <none>