feature(KEP-4832): asynchronous preemption #128170

sanposhiho · 2024-10-18T00:59:51Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

The core implementation for KEP-4832, moving the part making API calls to goroutine inside the preemption.

I confirmed that this feature:

makes the latency of PostFilter 1/5
- This is when each preemption preempts 3 Pods. The more Pods to kill, the bigger the improvement is, vice versa.
improves the throughput by double in the scenario of PreemptionAsync with multiple runs of scheduler-perf.

See: #128170 (comment)

The note on the implementation detail:

About Activate()

The preemption plugin moves the preemptor pod to activeQ when the goroutine fails.

This is because if, for example, the deletion API fails and no Pod/Delete happens,
the preemptor Pod could get stuck at the unsched pod pool. (since no event moves the Pod)

For this operation, we reuse the existing Activate() method on the queue that moves the Pod from unsched to activeQ.
And, to address the case where the goroutine fails immediately and the preemptor pod hasn't come back to the queue yet, we utilize in-flight event and moveRequestCycle like we do with cluster events.

Which issue(s) this PR fixes:

Fixes #128020
Fixes #128019

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added alpha support for asynchronous Pod preemption.
When the `SchedulerAsyncPreemption` feature gate is enabled, the scheduler now runs API calls to trigger preemptions asynchronously for better performance.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]:  http://kep.k8s.io/4832

k8s-ci-robot · 2024-10-18T00:59:54Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-10-18T01:00:00Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sanposhiho · 2024-10-18T01:00:31Z

/test all

This PR is still WIP as I need to add more tests.

sanposhiho · 2024-10-18T01:00:53Z

/hold

need to go thru the approver's approval.

AxeZhan · 2024-10-18T06:39:24Z

pkg/scheduler/framework/preemption/preemption.go

+		// nomination updates these pods and moves them to the active queue. It
+		// lets scheduler find another place for them.
+		nominatedPods := getLowerPriorityNominatedPods(logger, ev.Handler, pod, c.Name())
+		if err := util.ClearNominatedNodeName(ctx, ev.Handler.ClientSet(), nominatedPods...); err != nil {


We're not clearing this in prepareCandidate(when feature is disabled) ?

Currently, we are.
But, as mentioned in #128067, I have to think a bit more about it through the weekend.

Alright, I missed L455 in this pr before, I thought we already removed it in this pr😅.
But we should clear nominatedNodeName for low priority pods after preempt success, no?

We should do before.
Because after the moment the high priority pod gets nominated to the node, lower priority pods could be unschedulable, regardless of whether the API calls are completely done or not.

But, I still believe we don't need to clear nominated node names of all lower priority pods on the node.
Because, for the extreme example, if high priority pod only requires 2 CPU, and there are 10 lower priority Pods that require 2 CPU each and have already got nominated to the node, it doesn't make sense to remove the nomination from all.

So, I do believe we should seek other way like:
#128067

But, I'm considering such enhancement separated from this KEP (i.e., no need to implement it together), and thus I wouldn't like to include the change here.
I was thinking I'd pursue when I've got time after Kubecon. (... or you can take over if you want! :))
Also, either way, I believe that kind of scenario (there's a lower priority pod that just completed the preemption on the same node) is fairly rare, and hence we needn't actively prioritize it.

I'm +1 for separate it from this KEP, it will be easily to roll back if sth goes wrong after the implementation.

... or you can take over if you want! :

I’ll look into this when I have some free time, but I haven't thought of a good solution yet also :(

The rough idea that I have is like:

When calculating the victims on the node, we assume the nominated Pods are on the node already.

Lower priority nominated Pods could be selected as victims, while higher ones shouldn't.

Then, if lower nominated Pods are selected, prepareCandidate (or prepareCandidateAsync) unnominates them instead of deleting them.

So, in short, the nominated info on lower pods will be un-nominated only when necessary.

But, apart from that idea, I'd like to still seek the simplest option like #128067.
We have to think thoroughly if there's any corner case though, maybe we can just do that.

It would be good to consider how Autoscaler handles nominated node before taking the decision to remove this logic. If some of the low priority pods no longer fit the nominated node, Autoscaler should be aware that it needs to still consider them in simulation in other places, but not necessarily tied to this particular node, which would be overallocated.

Another argument for keeping this logic is that such overallocation would block scheduling of even lower priority pods which could fit. If unominated pods still fit, they could be still scheduled in the same place. The only worry is potentially unnecessary ongoing preemptions would be wasted and the risk unnominated pods won't reschedule, but not sure about it yet.

pkg/scheduler/framework/preemption/preemption.go

sftim · 2024-10-19T13:49:27Z

Changelog suggestion

-The scheduler now runs some expensive operations within the preemption asynchronously for better performance.
-The feature gate SchedulerAsyncPreemption gates this feature, which is disabled by default.
-
-See http://kep.k8s.io/4832 for more details.
+Added alpha support for asynchronous Pod preemption.
+When the `SchedulerAsyncPreemption` feature gate is enabled, the scheduler now runs some expensive operations asynchronously for better performance.

sanposhiho · 2024-10-23T10:06:14Z

/test all

sanposhiho · 2024-10-23T11:33:21Z

/test all

sanposhiho · 2024-11-07T05:24:42Z

/retest

sanposhiho · 2024-11-07T07:44:37Z

/retest

k8s-ci-robot · 2024-11-07T10:11:32Z

@sanposhiho: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-gce-cos-alpha-features	`105d489`	link	false	`/test pull-kubernetes-e2e-gce-cos-alpha-features`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sanposhiho · 2024-11-07T10:27:54Z

The alpha CI keeps failing due to unrelated tests. #128656
Please ignore this CI failure, this PR is ready for review.
After @alculquicondor and @dom4ha 's approval, I only added the comment 105d489.

ahg-g · 2024-11-07T16:54:58Z

/approve

for feature gate file

k8s-ci-robot · 2024-11-07T16:55:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, alculquicondor, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/features/OWNERS~~ [ahg-g]
~~pkg/scheduler/OWNERS~~ [ahg-g,alculquicondor,sanposhiho]
~~test/e2e/feature/OWNERS~~ [ahg-g]
~~test/e2e/scheduling/OWNERS~~ [ahg-g,alculquicondor,sanposhiho]
~~test/featuregates_linter/test_data/OWNERS~~ [ahg-g]
~~test/integration/scheduler/OWNERS~~ [ahg-g,alculquicondor,sanposhiho]
~~test/integration/scheduler_perf/OWNERS~~ [ahg-g,alculquicondor,sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2024-11-07T17:36:01Z

@alculquicondor @macsko @dom4ha can anyone give it a final /lgtm?

sanposhiho · 2024-11-07T17:43:41Z

/unhold

We've got an approver's review already, although the final /lgtm is still necessary.

dom4ha

/lgtm

k8s-ci-robot · 2024-11-07T17:44:29Z

LGTM label has been added.

Git tree hash: 69c4f9641535a0eeeec0cb9f988c6a31a1bb3204

k8s-triage-robot · 2024-11-07T19:43:57Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 18, 2024

k8s-ci-robot requested review from AxeZhan and wojtek-t October 18, 2024 01:00

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 18, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2024

AxeZhan reviewed Oct 18, 2024

View reviewed changes

sanposhiho force-pushed the async-preemption branch 2 times, most recently from f71a80f to 8436ead Compare October 18, 2024 12:31

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 18, 2024

nayihz reviewed Oct 18, 2024

View reviewed changes

pkg/scheduler/framework/preemption/preemption.go Outdated Show resolved Hide resolved

sanposhiho force-pushed the async-preemption branch from 8436ead to 48cfff6 Compare October 23, 2024 11:33

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Oct 23, 2024

sanposhiho added 10 commits November 7, 2024 14:09

fix: handle Activate event properly

623b2a2

fix: use set methods

d4d91d4

fix the broken part due to the merge

4b92f6d

fix: take QHint disable scenario into consideration

49135d6

fix: improve logs\

ce377ef

fix: move to a new way updating an extended resource

9756a64

fix: refactor TestPrepareCandidate

9b5928f

feat: set the threashold on the scheduler-perf test case

4a084d5

fix: run activate() only when fail

e601eb7

chore: wording

105d489

sanposhiho force-pushed the async-preemption branch from d776f0c to 105d489 Compare November 7, 2024 05:09

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024

dom4ha reviewed Nov 7, 2024

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2024

k8s-ci-robot merged commit fb03382 into kubernetes:master Nov 7, 2024
15 of 16 checks passed

k8s-ci-robot added this to the v1.32 milestone Nov 7, 2024

pacoxu mentioned this pull request Nov 8, 2024

[Flaking Test] unit TestPrepareCandidate #128689

Closed

sanposhiho mentioned this pull request Nov 10, 2024

gang: migrate from PodsToActivate to framework.Handle.Activate(...) pfnet/scheduler-plugins#2

Open

pacoxu mentioned this pull request Dec 24, 2024

Asynchronous preemption in the scheduler kubernetes/enhancements#4832

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(KEP-4832): asynchronous preemption #128170

feature(KEP-4832): asynchronous preemption #128170

sanposhiho commented Oct 18, 2024 •

edited

Loading

k8s-ci-robot commented Oct 18, 2024

k8s-ci-robot commented Oct 18, 2024

sanposhiho commented Oct 18, 2024

sanposhiho commented Oct 18, 2024

AxeZhan Oct 18, 2024

sanposhiho Oct 18, 2024

AxeZhan Oct 31, 2024

sanposhiho Oct 31, 2024

sanposhiho Oct 31, 2024

AxeZhan Nov 1, 2024

sanposhiho Nov 1, 2024

sanposhiho Nov 1, 2024

dom4ha Nov 4, 2024

sftim commented Oct 19, 2024

sanposhiho commented Oct 23, 2024

sanposhiho commented Oct 23, 2024

sanposhiho commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

sanposhiho commented Nov 7, 2024 •

edited

Loading

ahg-g commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

dom4ha left a comment

k8s-ci-robot commented Nov 7, 2024

k8s-triage-robot commented Nov 7, 2024

feature(KEP-4832): asynchronous preemption #128170

feature(KEP-4832): asynchronous preemption #128170

Conversation

sanposhiho commented Oct 18, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 18, 2024

k8s-ci-robot commented Oct 18, 2024

sanposhiho commented Oct 18, 2024

sanposhiho commented Oct 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sftim commented Oct 19, 2024

sanposhiho commented Oct 23, 2024

sanposhiho commented Oct 23, 2024

sanposhiho commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

sanposhiho commented Nov 7, 2024 • edited Loading

ahg-g commented Nov 7, 2024

k8s-ci-robot commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

sanposhiho commented Nov 7, 2024

dom4ha left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 7, 2024

k8s-triage-robot commented Nov 7, 2024

sanposhiho commented Oct 18, 2024 •

edited

Loading

sanposhiho commented Nov 7, 2024 •

edited

Loading