kube-scheduler updates pod status mistakenly during preemptionΒ #126643
Closed
Description
What happened?
When a pod with "BestEffort" of qosClass is preempted by a higher priority pod whose qolClass is "Burstable", the victim pod's qosClass will be updated to "Burstable" because the scheduler will update the victim's status with the content from higher priority pod before deleting the victim pod. The pod then will be stuck in terminating state after deleting because of this check. Supposedly, kubelet will reconcile the status back very soon. However, the deletion request stops kubelet from doing that.
What did you expect to happen?
The victim pod's qosClass shouldn't be changed and the pod should be deleted successfully.
How can we reproduce it (as minimally and precisely as possible)?
reproduced it on 1.29.4
Reproduce steps:
Scale the node number to 1
- Create victim pod (here I use a cronjob to create a pod with empty resources)
- Create a higher priority class(p1) with "Preempt" policy.
- Create extra pods(using a deployment) with priorityClassName as "p1" to make the pod number beyond the maxim pod number of the node. The pods also have cpu/memory requests set.
- The victim pod will be preempted, then stuck in "terminating" state
cronjob.txt
deployment.txt
pc.txt
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4
Cloud provider
Azure
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here