-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(KEP-4832): asynchronous preemption #128170
Conversation
Skipping CI for Draft Pull Request. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test all This PR is still WIP as I need to add more tests. |
/hold need to go thru the approver's approval. |
// nomination updates these pods and moves them to the active queue. It | ||
// lets scheduler find another place for them. | ||
nominatedPods := getLowerPriorityNominatedPods(logger, ev.Handler, pod, c.Name()) | ||
if err := util.ClearNominatedNodeName(ctx, ev.Handler.ClientSet(), nominatedPods...); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not clearing this in prepareCandidate
(when feature is disabled) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we are.
But, as mentioned in #128067, I have to think a bit more about it through the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I missed L455 in this pr before, I thought we already removed it in this pr😅.
But we should clear nominatedNodeName for low priority pods after preempt success, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should do before.
Because after the moment the high priority pod gets nominated to the node, lower priority pods could be unschedulable, regardless of whether the API calls are completely done or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I still believe we don't need to clear nominated node names of all lower priority pods on the node.
Because, for the extreme example, if high priority pod only requires 2 CPU, and there are 10 lower priority Pods that require 2 CPU each and have already got nominated to the node, it doesn't make sense to remove the nomination from all.
So, I do believe we should seek other way like:
#128067
But, I'm considering such enhancement separated from this KEP (i.e., no need to implement it together), and thus I wouldn't like to include the change here.
I was thinking I'd pursue when I've got time after Kubecon. (... or you can take over if you want! :))
Also, either way, I believe that kind of scenario (there's a lower priority pod that just completed the preemption on the same node) is fairly rare, and hence we needn't actively prioritize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1 for separate it from this KEP, it will be easily to roll back if sth goes wrong after the implementation.
... or you can take over if you want! :
I’ll look into this when I have some free time, but I haven't thought of a good solution yet also :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rough idea that I have is like:
- When calculating the victims on the node, we assume the nominated Pods are on the node already.
- Lower priority nominated Pods could be selected as victims, while higher ones shouldn't.
- Then, if lower nominated Pods are selected,
prepareCandidate
(orprepareCandidateAsync
) unnominates them instead of deleting them.
So, in short, the nominated info on lower pods will be un-nominated only when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, apart from that idea, I'd like to still seek the simplest option like #128067.
We have to think thoroughly if there's any corner case though, maybe we can just do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to consider how Autoscaler handles nominated node before taking the decision to remove this logic. If some of the low priority pods no longer fit the nominated node, Autoscaler should be aware that it needs to still consider them in simulation in other places, but not necessarily tied to this particular node, which would be overallocated.
Another argument for keeping this logic is that such overallocation would block scheduling of even lower priority pods which could fit. If unominated pods still fit, they could be still scheduled in the same place. The only worry is potentially unnecessary ongoing preemptions would be wasted and the risk unnominated pods won't reschedule, but not sure about it yet.
f71a80f
to
8436ead
Compare
Changelog suggestion -The scheduler now runs some expensive operations within the preemption asynchronously for better performance.
-The feature gate SchedulerAsyncPreemption gates this feature, which is disabled by default.
-
-See http://kep.k8s.io/4832 for more details.
+Added alpha support for asynchronous Pod preemption.
+When the `SchedulerAsyncPreemption` feature gate is enabled, the scheduler now runs some expensive operations asynchronously for better performance. |
/test all |
8436ead
to
48cfff6
Compare
/test all |
d776f0c
to
105d489
Compare
/retest |
1 similar comment
/retest |
@sanposhiho: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The alpha CI keeps failing due to unrelated tests. #128656 |
/approve for feature gate file |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, alculquicondor, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@alculquicondor @macsko @dom4ha can anyone give it a final |
/unhold We've got an approver's review already, although the final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 69c4f9641535a0eeeec0cb9f988c6a31a1bb3204
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
The core implementation for KEP-4832, moving the part making API calls to goroutine inside the preemption.
I confirmed that this feature:
PreemptionAsync
with multiple runs of scheduler-perf.See: #128170 (comment)
The note on the implementation detail:
About Activate()
The preemption plugin moves the preemptor pod to activeQ when the goroutine fails.
This is because if, for example, the deletion API fails and no Pod/Delete happens,
the preemptor Pod could get stuck at the unsched pod pool. (since no event moves the Pod)
For this operation, we reuse the existing
Activate()
method on the queue that moves the Pod from unsched to activeQ.And, to address the case where the goroutine fails immediately and the preemptor pod hasn't come back to the queue yet, we utilize in-flight event and
moveRequestCycle
like we do with cluster events.Which issue(s) this PR fixes:
Fixes #128020
Fixes #128019
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: