Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(KEP-4832): asynchronous preemption #128170

Merged
merged 27 commits into from
Nov 7, 2024

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Oct 18, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

The core implementation for KEP-4832, moving the part making API calls to goroutine inside the preemption.

I confirmed that this feature:

  • makes the latency of PostFilter 1/5
    • This is when each preemption preempts 3 Pods. The more Pods to kill, the bigger the improvement is, vice versa.
  • improves the throughput by double in the scenario of PreemptionAsync with multiple runs of scheduler-perf.

See: #128170 (comment)


The note on the implementation detail:

About Activate()

The preemption plugin moves the preemptor pod to activeQ when the goroutine fails.

This is because if, for example, the deletion API fails and no Pod/Delete happens,
the preemptor Pod could get stuck at the unsched pod pool. (since no event moves the Pod)

For this operation, we reuse the existing Activate() method on the queue that moves the Pod from unsched to activeQ.
And, to address the case where the goroutine fails immediately and the preemptor pod hasn't come back to the queue yet, we utilize in-flight event and moveRequestCycle like we do with cluster events.

Which issue(s) this PR fixes:

Fixes #128020
Fixes #128019

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added alpha support for asynchronous Pod preemption.
When the `SchedulerAsyncPreemption` feature gate is enabled, the scheduler now runs API calls to trigger preemptions asynchronously for better performance.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]:  http://kep.k8s.io/4832

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 18, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 18, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 18, 2024
@sanposhiho
Copy link
Member Author

/test all

This PR is still WIP as I need to add more tests.

@sanposhiho
Copy link
Member Author

/hold

need to go thru the approver's approval.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2024
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
nominatedPods := getLowerPriorityNominatedPods(logger, ev.Handler, pod, c.Name())
if err := util.ClearNominatedNodeName(ctx, ev.Handler.ClientSet(), nominatedPods...); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not clearing this in prepareCandidate(when feature is disabled) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we are.
But, as mentioned in #128067, I have to think a bit more about it through the weekend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I missed L455 in this pr before, I thought we already removed it in this pr😅.
But we should clear nominatedNodeName for low priority pods after preempt success, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do before.
Because after the moment the high priority pod gets nominated to the node, lower priority pods could be unschedulable, regardless of whether the API calls are completely done or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I still believe we don't need to clear nominated node names of all lower priority pods on the node.
Because, for the extreme example, if high priority pod only requires 2 CPU, and there are 10 lower priority Pods that require 2 CPU each and have already got nominated to the node, it doesn't make sense to remove the nomination from all.

So, I do believe we should seek other way like:
#128067

But, I'm considering such enhancement separated from this KEP (i.e., no need to implement it together), and thus I wouldn't like to include the change here.
I was thinking I'd pursue when I've got time after Kubecon. (... or you can take over if you want! :))
Also, either way, I believe that kind of scenario (there's a lower priority pod that just completed the preemption on the same node) is fairly rare, and hence we needn't actively prioritize it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for separate it from this KEP, it will be easily to roll back if sth goes wrong after the implementation.

... or you can take over if you want! :

I’ll look into this when I have some free time, but I haven't thought of a good solution yet also :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rough idea that I have is like:

  • When calculating the victims on the node, we assume the nominated Pods are on the node already.
    • Lower priority nominated Pods could be selected as victims, while higher ones shouldn't.
  • Then, if lower nominated Pods are selected, prepareCandidate (or prepareCandidateAsync) unnominates them instead of deleting them.

So, in short, the nominated info on lower pods will be un-nominated only when necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, apart from that idea, I'd like to still seek the simplest option like #128067.
We have to think thoroughly if there's any corner case though, maybe we can just do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to consider how Autoscaler handles nominated node before taking the decision to remove this logic. If some of the low priority pods no longer fit the nominated node, Autoscaler should be aware that it needs to still consider them in simulation in other places, but not necessarily tied to this particular node, which would be overallocated.

Another argument for keeping this logic is that such overallocation would block scheduling of even lower priority pods which could fit. If unominated pods still fit, they could be still scheduled in the same place. The only worry is potentially unnecessary ongoing preemptions would be wasted and the risk unnominated pods won't reschedule, but not sure about it yet.

@sanposhiho sanposhiho force-pushed the async-preemption branch 2 times, most recently from f71a80f to 8436ead Compare October 18, 2024 12:31
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 18, 2024
@sftim
Copy link
Contributor

sftim commented Oct 19, 2024

Changelog suggestion

-The scheduler now runs some expensive operations within the preemption asynchronously for better performance.
-The feature gate SchedulerAsyncPreemption gates this feature, which is disabled by default.
-
-See http://kep.k8s.io/4832 for more details.
+Added alpha support for asynchronous Pod preemption.
+When the `SchedulerAsyncPreemption` feature gate is enabled, the scheduler now runs some expensive operations asynchronously for better performance.

@sanposhiho
Copy link
Member Author

/test all

@sanposhiho
Copy link
Member Author

/test all

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Oct 23, 2024
@sanposhiho
Copy link
Member Author

/retest

1 similar comment
@sanposhiho
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-gce-cos-alpha-features 105d489 link false /test pull-kubernetes-e2e-gce-cos-alpha-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sanposhiho
Copy link
Member Author

sanposhiho commented Nov 7, 2024

The alpha CI keeps failing due to unrelated tests. #128656
Please ignore this CI failure, this PR is ready for review.
After @alculquicondor and @dom4ha 's approval, I only added the comment 105d489.

@ahg-g
Copy link
Member

ahg-g commented Nov 7, 2024

/approve

for feature gate file

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2024
@sanposhiho
Copy link
Member Author

@alculquicondor @macsko @dom4ha can anyone give it a final /lgtm?

@sanposhiho
Copy link
Member Author

/unhold

We've got an approver's review already, although the final /lgtm is still necessary.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 69c4f9641535a0eeeec0cb9f988c6a31a1bb3204

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot merged commit fb03382 into kubernetes:master Nov 7, 2024
15 of 16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet