《K8s 实战》——K8s 高级调度(污点和亲缘性)

  • 节点选择器和节点亲缘性规则,是通过明确的在 Pod 中添加信息,来决定一个 Pod 可以或者不可以被调度到哪些节点上。而污点则是在不修改巳有 Pod 信息的前提下,通过在节点上添加污点信息,来拒绝 Pod 在某些节点上的部署

一、污点和容忍度

1. 介绍污点和容忍度

taint

  • 污点包含了一个 key、value,以及一个 effect,表现为 <key>=<value>:<effect>
  • 显示节点的污点信息:
1
2
3
$ kubectl describe node debian201 | grep -i taint
# key为node-role.kubernetes.io/control-plane,value为空,effect为NoSchedule
Taints: node-role.kubernetes.io/control-plane:NoSchedule
  • 污点将阻止 Pod 调度到这个节点上面,除非 Pod 能容忍这个污点
  • 显示 Pod 的污点容忍度:
1
2
3
4
5
6
7
8
9
$ kubectl describe pod -n kube-system kube-proxy-rgpn2 | grep -i toleration -A 7
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
  • 节点可以有多个污点,Pod 可以有多个污点容忍度。污点可以只有一个 key 和一个效果,而不必设置 value。污点容忍度可以通过设置 Equal 操作符匹配指定的 value,也可以设置 Exists 操作符匹配污点的 key
  • 污点的效果:
    • NoSchedule如果 Pod 没有容忍这些污点,Pod 就不能调度到该节点
    • PreferNoSchedule是 NoSchedule 的宽松版本,尽量阻止 Pod 调度到该节点,特殊情况可以调度
    • NoExecute前两者只在调度期间起作用,NoExecute 会影响正在节点上运行的 Pod。如果添加了该效果的污点,在节点上运行的 Pod 如果没有容忍污点,将会被驱逐

2. 添加自定义污点

  • 案例:隔离生产环境
1
2
3
4
5
6
7
8
9
$ kubectl taint node debian202 type=prod:NoSchedule
node/debian202 tainted
$ kubectl create deployment test --image busybox --replicas 5 -- sleep 99999
$ kubectl get pod -o wide | grep test
test-59df75d596-4hr87 1/1 Running 0 72s 10.244.7.101 debian203 <none> <none>
test-59df75d596-cl4mh 1/1 Running 0 72s 10.244.7.102 debian203 <none> <none>
test-59df75d596-fh7zs 1/1 Running 0 72s 10.244.7.103 debian203 <none> <none>
test-59df75d596-p9dn6 1/1 Running 0 72s 10.244.7.100 debian203 <none> <none>
test-59df75d596-qj6jp 1/1 Running 0 72s 10.244.7.99 debian203 <none> <none>

3. 在 Pod 上添加污点容忍度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ vim deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
spec:
replicas: 5
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- command:
- sleep
- "99999"
image: busybox
name: busybox
tolerations: # *
- effect: NoSchedule
key: type
value: prod
operator: Equal
$ kubectl apply -f deploy.yaml
$ kubectl get pod -o wide | grep test
test-5c6d4745bb-47gcp 1/1 Running 0 50s 10.244.64.212 debian202 <none> <none>
test-5c6d4745bb-724fc 1/1 Running 0 50s 10.244.7.114 debian203 <none> <none>
test-5c6d4745bb-bhqgb 1/1 Running 0 50s 10.244.64.247 debian202 <none> <none>
test-5c6d4745bb-pcjmz 1/1 Running 0 50s 10.244.7.116 debian203 <none> <none>
test-5c6d4745bb-q46c2 1/1 Running 0 50s 10.244.64.252 debian202 <none> <none>

4. 节点失效后的 Pod 重新调度等待时间

  • NoExecute 污点同时还可以设置,当 Pod 运行的节点变成 unreachable 或者 unready 状态,Pod 需要重新调度时,控制面板的最长等待时间
  • 这两个容忍度会自动添加到 Pod 上
1
2
3
4
5
6
7
8
9
10
11
$ kubectl get pod backend -o yaml
...
tolerations:
- effect: NoExecute # 允许节点NotReady状态300s
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute # 允许节点Unreachable状态300s
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300

二、Pod 的节点亲缘性

  • 污点可以让 Pod 远离特定的节点,亲缘性可以让 Pod 只调度在某些节点

1. 强制性节点亲缘性

  • 使用节点选择器将 Pod 调度到指定节点:
1
2
3
4
5
6
7
8
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeSelector:
disktype: ssd
...
  • 使用节点亲缘性将 Pod 调度到指定节点:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
...
  • requiredDuringScheduling... 表明该 Pod 能够调度的节点必须包含指定的标签
  • ...IgnoredDuringExecution 表明如果去除节点上的标签,不会影响已经在节点上运行的 Pod

2. 优先级节点亲缘性

  • 场景:Pod 调度时优先考虑某些节点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20
preference:
matchExpressions:
- key: cpucore
operator: In
values:
- 8
...
  • preferredDuringScheduling... 表明调度 Pod 时优先考虑带指定标签的节点
  • 除了节点亲缘性的优先级函数,调度器还会考虑其他的优先级函数来决定 Pod 被调度到哪。比如 TopologySpread 确保属于同一个 ReplicaSet 或者 Service 的 Pod 尽可能分散部署在不同节点上,以避免单个节点宕机导致服务不可用

三、Pod 间亲缘性

  • 上述亲缘性规则只影响了 Pod 和节点之间的亲缘性,Pod 自身之间也可以指定亲缘性。比如保证前后端 Pod 尽量部署在同一个节点

1. 多个 Pod 部署在同一节点上

  • 部署后端 Pod:
1
$ kubectl run backend -l app=backend --image busybox -- sleep 999999
  • 使用 Pod 亲缘性部署前端 Pod:必须(requiredDuringScheduling)部署到匹配 Pod 选择器(app: backend)的节点(topologyKey: kubernetes.io/hostname)上
  • 标签选择器默认匹配同一命名空间中的 Pod。可以在 labelSelector 同一级添加 namespaces 字段,实现从其他的命名空间选择 Pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-with-affinity
spec:
selector:
matchLabels:
app: frontend
replicas: 5
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname # *
labelSelector:
matchLabels:
app: backend
containers:
- name: busybox
image: busybox
command: ["sleep", "99999"]
  • 检查:
1
2
3
4
5
6
7
8
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
backend 1/1 Running 0 3m 10.244.64.254 debian202 <none> <none>
frontend-with-affinity-75c7b66445-ctc5l 1/1 Running 0 86s 10.244.64.205 debian202 <none> <none>
frontend-with-affinity-75c7b66445-f274c 1/1 Running 0 86s 10.244.64.225 debian202 <none> <none>
frontend-with-affinity-75c7b66445-nqc4q 1/1 Running 0 86s 10.244.64.223 debian202 <none> <none>
frontend-with-affinity-75c7b66445-vzsdc 1/1 Running 0 87s 10.244.64.226 debian202 <none> <none>
frontend-with-affinity-75c7b66445-wgpbd 1/1 Running 0 86s 10.244.64.193 debian202 <none> <none>
  • 如果此时删除后端 Pod,并重新创建,则后端 Pod 仍会被调度到 debian202(调度器会考虑目前运行中的 Pod 的亲缘性规则

2. 多个 Pod 部署在同一机柜、区域或地域

  • 场景:不希望前端 Pod 部署在同一节点上,但仍希望和后端 Pod 保持足够近,比如在同一个区域中
  • 如果节点运行在不同的区域中,需要将 topologyKey 属性设置为 topology.kubernetes.io/zone,以确保前后端 Pod 运行在同一个区域中。如果运行在不同的地域,需要将 topologyKey 属性设置为 topology.kubernetes.io/region
  • topologyKey 表示被调度的 Pod 和另一个 Pod 的距离,值可以自定义。比如为了让 Pod 部署在同一个机柜,可以给每个节点打上 rack=<机柜号> 的标签,定义 Pod 时将 topologyKey 的值设置为 rack。标签选择器匹配了运行在 Node 12 的后端 Pod,该节点 rack 标签的值等于 rack2。所以调度该 Pod 时,调度器只会在包含标签 rack=rack2 的节点中进行选择
  • 当调度器决定调度 Pod 时,它首先检查 Pod 的 podAffinity 配置,找出那些符合标签选择器的 Pod。接着查询这些 Pod 运行在哪些节点上,特别的是,它会寻找标签能匹配 topologyKey 的节点,并优先选择标签匹配的节点进行调度

affinity

3. 优先级 Pod 间亲缘性

  • 场景:优先将前后端 Pod 部署在同一个节点上,如果不满足需求,可以调度到其他节点上
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-with-affinity
spec:
selector:
matchLabels:
app: frontend
replicas: 5
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution: # *
- weight: 80
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: backend
containers:
- name: busybox
image: busybox
command: ["sleep", "99999"]
  • 检查:
1
2
3
4
5
6
7
8
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
backend 1/1 Running 0 3m54s 10.244.7.65 debian203 <none> <none>
frontend-with-affinity-78458c8788-7srg6 1/1 Running 0 98s 10.244.7.75 debian203 <none> <none>
frontend-with-affinity-78458c8788-dgfq4 1/1 Running 0 98s 10.244.64.221 debian202 <none> <none>
frontend-with-affinity-78458c8788-dzg2h 1/1 Running 0 98s 10.244.7.74 debian203 <none> <none>
frontend-with-affinity-78458c8788-gjttb 1/1 Running 0 98s 10.244.7.116 debian203 <none> <none>
frontend-with-affinity-78458c8788-nvkp2 1/1 Running 0 98s 10.244.7.73 debian203 <none> <none>

4. Pod 间非亲缘性

  • 场景:不让两个 Pod 部署在同一个节点上。比如两个 Pod 如果运行在同一个节点上会影响彼此的性能,或者让一组 Pod 分散到不同的区域,保证服务的高可用
  • 仅仅将 podAffinity 换成 podAntiAffinity,其他使用方法和之前类似
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend-with-antiaffinity
spec:
selector:
matchLabels:
app: frontend
replicas: 5
template:
metadata:
labels:
app: frontend
spec:
affinity:
podAntiAffinity: # *
requiredDuringSchedulingIgnoredDuringExecution: # preferred
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: frontend
containers:
...
  • 检查:
1
2
3
4
5
6
7
$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
frontend-with-antiaffinity-85964ff577-7jgtk 1/1 Running 0 30s 10.244.64.252 debian202 <none> <none>
frontend-with-antiaffinity-85964ff577-gnkts 0/1 Pending 0 30s <none> <none> <none> <none>
frontend-with-antiaffinity-85964ff577-kc4jt 1/1 Running 0 30s 10.244.7.80 debian203 <none> <none>
frontend-with-antiaffinity-85964ff577-lpfvn 0/1 Pending 0 30s <none> <none> <none> <none>
frontend-with-antiaffinity-85964ff577-s8wkc 0/1 Pending 0 30s <none> <none> <none> <none>