Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup: removes unnomination logic for lower preemptors on the same node #128067

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Oct 14, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

When the current scheduler runs a preemption for node1, it removes .status.nominatedNodeName from all the other lower priority Pods based on the assumption that they no longer fit the node because of a new higher priority preemptor.

But, this implementation currently doesn't work as intended actually. The comment says that

So, we should remove their nomination. Removing their
nomination updates these pods and moves them to the active queue. It
lets scheduler find another place for them.

But, this is not true. The update in Pod's status is ignored, and doesn't trigger the requeueing. So, the current implementation contains the risk that the Pods that got robbed of nominatedNodeName won't be retried.

So, we have two options here:

  1. Request requeueing all those Pods to activeQ somehow. I could show two options here:
    a. Utilize PodsToActivate. But, it requires a change of PodsToActivate usage because currently PodsToActivate is effective only when the scheduling cycle finishes successfully, meaning we currently doesn't handle it in PostFilter path (ref).
    b. Create a new function in framework.Handle that moves Pods to activeQ like PodsToActivate.
  2. Just remove this logic. Actually, this is safe because when the scheduler finds Pods don't fit on .status.nominatedNodeName, it tries to check other nodes as well later (ref).

I'm open to the discussion whether we should take though, I made this PR pick up (2) because

  1. The complexity (1) would add. Especially given KEP-4832 allows the scheduler to schedule Pods while the preemption is on-going, the simplicity of preemption matters, reduces the scenarios that we have to consider.
  2. (2) removes API calls of removing .status.nominatedNodeName, probably benefitting the performance.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 14, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 14, 2024
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 14, 2024
@sanposhiho sanposhiho marked this pull request as ready for review October 14, 2024 22:52
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 14, 2024
@k8s-ci-robot k8s-ci-robot requested a review from damemi October 14, 2024 22:52
@sanposhiho
Copy link
Member Author

/hold
To go thru the approver.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024
@sanposhiho sanposhiho marked this pull request as draft October 15, 2024 01:10
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@sanposhiho
Copy link
Member Author

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/test and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 15, 2024
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Oct 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sanposhiho sanposhiho marked this pull request as ready for review October 15, 2024 01:31
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 15, 2024
@sanposhiho
Copy link
Member Author

sanposhiho commented Oct 15, 2024

@alculquicondor What do you think about this change?

@alculquicondor
Copy link
Member

cc @dom4ha

@AxeZhan
Copy link
Member

AxeZhan commented Oct 16, 2024

So, now we know that lower preemptors don't get requeued into activeQ immediately

And, after some time, the lower preemptor will be in the scheduling cycle again.
If the preemptor still have the nomination, and the node can accepts it, we'll directly return the NominatedNode in filter.

if len(pod.Status.NominatedNodeName) > 0 {
feasibleNodes, err := sched.evaluateNominatedNode(ctx, pod, fwk, state, diagnosis)
if err != nil {
logger.Error(err, "Evaluation failed on nominated node", "pod", klog.KObj(pod), "node", pod.Status.NominatedNodeName)
}
// Nominated node passes all the filters, scheduler is good to assign this node to the pod.
if len(feasibleNodes) != 0 {
return feasibleNodes, diagnosis, nil
}
}

However, in current time. There may be other nodes which can also accept the preemptor, and will have a larger score.
Won't this affect the schedule result?

@sanposhiho sanposhiho marked this pull request as draft October 16, 2024 08:09
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024
@sanposhiho
Copy link
Member Author

Rethinking about it, the best might be make a preemption of high priority pods consider lower priority nominated pods. Let me make a deeper thought during the weekend.

1 similar comment
@sanposhiho
Copy link
Member Author

Rethinking about it, the best might be make a preemption of high priority pods consider lower priority nominated pods. Let me make a deeper thought during the weekend.

@dom4ha
Copy link
Member

dom4ha commented Nov 4, 2024

It would be good to consider how Autoscaler handles nominated node before taking the decision to remove this logic. If some of the low priority pods no longer fit the nominated node, Autoscaler should be aware that it needs to still consider them in simulation in other places, but not necessarily tied to this particular node, which would be overallocated.

Another argument for keeping this logic is that such overallocation would block scheduling of even lower priority pods which could fit. If unnominated pods still fit, they could be still scheduled in the same place. The only worry is potentially wasted ongoing preemptions and the risk unnominated pods won't reschedule, but not sure about it yet.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants