Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Unexpected scheduling results due to mismatch between the inter-pod affinity rule implementation and the doc #129319

Open
scheduler-tester opened this issue Dec 20, 2024 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@scheduler-tester
Copy link

What happened?

There is a special rule in the scheduler's pod affinity plugin for scheduling a group of pods with inter-pod affinity to themselves. However, the current implementation does not match the doc and the comment, causing unexpected scheduling results.

[doc]

Scheduling a group of pods with inter-pod affinity to themselves

If the current pod being scheduled is the first in a series that have affinity to themselves, it is allowed to be scheduled if it passes all other affinity checks. This is determined by verifying that no other pod in the cluster matches the namespace and selector of this pod, that the pod matches its own terms, and the chosen node matches all requested topologies. This ensures that there will not be a deadlock even if all the pods have inter-pod affinity specified.

The Inconsistency

In the current version of the documentation and the current version of the code comment, both say that "no other pod in the cluster matches the namespace and selector of this pod," which implies that the scheduler will check all pods.

However, after investigating the implementation, the scheduler is actually checking all pods on nodes with at least one topology key matched, instead of all pods. (For more details, please see "Anything else we need to know?".)

As a result, the current implementation leads to unexpected scheduling results.

The Original Intent

We have investigated the history of this special rule, and it shows:

  1. At the very beginning, the code and the comment were consistent, both executing/stating that the scheduler would check all pods in the cluster.

  2. Later, previous developers introduced a mechanism for pre-calculating some data structures and using them to filter pod affinity. The newly added code became inconsistent with the comment:

    • The code was checking all pods on nodes with at least one topology key matched.
    • The comment, however, still stated all pods in the cluster.
      At this point, the scheduler had fallback logic to the original code if the pre-calculated data didn't exist. Therefore, the scheduler have two routes simultaneously—one consistent and the other inconsistent.
  3. Finally, previous developers removed both the fallback logic and the original code. The current implementation only uses the pre-calculated data structures, which are inconsistent with the comment.

What did you expect to happen?

According to the history of this rule, we assume the original intend was checking all pods in the cluster. It's because the new added data structure, the implementation became wrong as it checks all pods on nodes with at least one topology key matched.

But we think this still need developers' help to check the original / ideal intent of this rule.

  1. If the intent is "check all pods" -> the implementation should be fixed.
  2. If the intent is "only check pods on nodes with at least one topology key matched", then the comment in kubernetes/kubernetes and documentation in kubernetes/website should be updated.

How can we reproduce it (as minimally and precisely as possible)?

Steps:

The incoming pod affinity's selector matches itself and also the existing pod.
The incoming pod's pod affinity has 2 terms with 2 different topology keys:

  • node-0 has 2 of these topology keys
  • node-1 has 1 of these topology keys
  • node-2 has 0 of these topology keys
  1. add 3 nodes using below yaml file.
    kubectl apply -f nodes.yaml
  2. add the existing pod first, it will land on node-2.
    kubectl apply -f existing_pod.yaml
  3. add the incoming pod, it can be scheduled, although at this time there is a pod that also matches these selectors in the cluster. Because that pod is on node-2, and node-2 doesn't match any topology key.
    kubectl apply -f incoming_pod.yaml
  4. delete all pods.
    kubectl delete pod --all
  5. change the existing pod's nodeSelector into node-name: node-1, add the existing pod, it will land on node-1.
    (change the existing pod's nodeSelector)
    kubectl apply -f existing_pod.yaml
  6. add the incoming pod, it will not be scheduled, this time the pod that also matches these selectors is on node-1, and node-1 matches 1 topology key.
    kubectl apply -f incoming_pod.yaml

Nodes:

apiVersion: v1
kind: Node
metadata:
  labels:
    label1: value1
    label2: value2
    node-name: node-0
  name: node-0
  namespace: default
status:
  allocatable:
    cpu: '10000'
    memory: 64T
    pods: '100000'
  capacity:
    cpu: '10000'
    memory: 64T
    pods: '10000'
---
apiVersion: v1
kind: Node
metadata:
  labels:
    label2: value2
    label3: value3
    node-name: node-1
  name: node-1
  namespace: default
status:
  allocatable:
    cpu: '10000'
    memory: 64T
    pods: '100000'
  capacity:
    cpu: '10000'
    memory: 64T
    pods: '10000'
---
apiVersion: v1
kind: Node
metadata:
  labels:
    label3: value3
    node-name: node-2
  name: node-2
  namespace: default
status:
  allocatable:
    cpu: '10000'
    memory: 64T
    pods: '100000'
  capacity:
    cpu: '10000'
    memory: 64T
    pods: '10000'

Existing pod:

apiVersion: v1
kind: Pod
metadata:
  labels:
    pod-label1: value1
  name: pod-0
  namespace: default
spec:
  containers:
  - image: nginx
    name: test-container
  nodeSelector:
    node-name: node-2

Incoming pod:

apiVersion: v1
kind: Pod
metadata:
  labels:
    pod-label1: value1
  name: test-pod
  namespace: default
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: pod-label1
            operator: In
            values:
            - value1
        topologyKey: label1
      - labelSelector:
          matchExpressions:
            - key: pod-label1
              operator: In
              values:
                - value1
        topologyKey: label2
  containers:
  - image: nginx
    name: test-container

Anything else we need to know?

Why the current implementation is checking pods on nodes with at least one topology key matched?

The state.affinityCounts is a map that maps "topology key-value pairs" to the "number of pods in the topology domain that match the namespace and selector." Below is the code related to this rule:

# Check the special rule: no other pod in the cluster matches the namespace and selector of this pod.
if len(state.affinityCounts) == 0 &&...

# update state.affinityCounts
func (m topologyToMatchedTermCount) updateWithAffinityTerms(
	terms []framework.AffinityTerm, pod *v1.Pod, node *v1.Node, value int64) {
	if podMatchesAllAffinityTerms(terms, pod) {
		for _, t := range terms {
			m.update(node, t.TopologyKey, value)
		}
	}
}

func (m topologyToMatchedTermCount) update(node *v1.Node, tk string, value int64) {
	if tv, ok := node.Labels[tk]; ok {
		pair := topologyPair{key: tk, value: tv}
		m[pair] += value
		// value could be negative, hence we delete the entry if it is down to zero.
		if m[pair] == 0 {
			delete(m, pair)
		}
	}
}

In the last two function, we can see that only an existing pod on a node with at least one topology key required by the incoming pod will be counted in the state.affinityCount.
/sig scheduling

Kubernetes version

1.32.0

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@scheduler-tester scheduler-tester added the kind/bug Categorizes issue or PR as related to a bug. label Dec 20, 2024
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 20, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

2 participants