[Bug] Unexpected scheduling results due to mismatch between the inter-pod affinity rule implementation and the doc #129319
Labels
kind/bug
Categorizes issue or PR as related to a bug.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
sig/scheduling
Categorizes an issue or PR as relevant to SIG Scheduling.
What happened?
There is a special rule in the scheduler's
pod affinity
plugin for scheduling a group of pods with inter-pod affinity to themselves. However, the current implementation does not match the doc and the comment, causing unexpected scheduling results.The Inconsistency
In the current version of the documentation and the current version of the code comment, both say that "no other pod in the cluster matches the namespace and selector of this pod," which implies that the scheduler will check all pods.
However, after investigating the implementation, the scheduler is actually checking all pods on nodes with at least one topology key matched, instead of all pods. (For more details, please see "Anything else we need to know?".)
As a result, the current implementation leads to unexpected scheduling results.
The Original Intent
We have investigated the history of this special rule, and it shows:
At the very beginning, the code and the comment were consistent, both executing/stating that the scheduler would check all pods in the cluster.
Later, previous developers introduced a mechanism for pre-calculating some data structures and using them to filter pod affinity. The newly added code became inconsistent with the comment:
At this point, the scheduler had fallback logic to the original code if the pre-calculated data didn't exist. Therefore, the scheduler have two routes simultaneously—one consistent and the other inconsistent.
Finally, previous developers removed both the fallback logic and the original code. The current implementation only uses the pre-calculated data structures, which are inconsistent with the comment.
What did you expect to happen?
According to the history of this rule, we assume the original intend was checking all pods in the cluster. It's because the new added data structure, the implementation became wrong as it checks all pods on nodes with at least one topology key matched.
But we think this still need developers' help to check the original / ideal intent of this rule.
How can we reproduce it (as minimally and precisely as possible)?
Steps:
The incoming pod affinity's selector matches itself and also the existing pod.
The incoming pod's pod affinity has 2 terms with 2 different topology keys:
kubectl apply -f nodes.yaml
kubectl apply -f existing_pod.yaml
kubectl apply -f incoming_pod.yaml
kubectl delete pod --all
nodeSelector
intonode-name: node-1
, add the existing pod, it will land on node-1.(change the existing pod's nodeSelector)
kubectl apply -f existing_pod.yaml
kubectl apply -f incoming_pod.yaml
Nodes:
Existing pod:
Incoming pod:
Anything else we need to know?
Why the current implementation is checking pods on nodes with at least one topology key matched?
The
state.affinityCounts
is a map that maps "topology key-value pairs" to the "number of pods in the topology domain that match the namespace and selector." Below is the code related to this rule:In the last two function, we can see that only an existing pod on a node with at least one topology key required by the incoming pod will be counted in the
state.affinityCount
./sig scheduling
Kubernetes version
1.32.0
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: