-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler, StatefulSets, and External Controllers #39687
Comments
@jayunit100 assigning |
I assume the problem of "treating pods as equivalent" is more related to the concern of rescheduling then anything else, right? because the use of an equivalence class to simply predicate match is inoccuous wether or not a pod is stateful... Assuming the above, then to uplevel, this issue is a good example of the general problem of scheduler logic conflicting with external controllers... and i guess there are two categories here, where the intersection of what the scheduler wants to do can either : (1) subvert other features (like statefullness) or (2) be subverted by other issues (i.e. for example #39687). Solutions to (1) are likely to conflict with solutions to (2). For example, aggressive rescheduling gaurantees scheduler policy is adhered to and fixed, but could break any number of features like StatefulSets which we otherwise care about. Now, if we assume other pod mutating processes are imperfect ( I agree on 4301 w/ @timothysc that rescheduling is a better solution then forcing a dependency on the scheduler for everyone that wants to do a one off scheduling event), then we must take rescheduling as a necessary feature to fix asymmetrical affinity or other things that can be caused by downscaling... ... if thats the case we have solved problem (2), but possibly inflamed problem (1) ... so a simple rectification is then to give pods a way to communicate, generically, to the scheduler that they are immune to rescheduling ... So, is it crazy-talk to propose a (possibly expiring) re-scheduler immunity field to the pod-spec API ? This seems like it could be used as an elegant primitive for building scheduler logic that is robust enough to always, eventually, rebalance a cluster - yet flexible enough not to trample fine grained controllers that may be required in core kubernetes or else in user applications for highly specified , high performance applications. |
Rescheduling needs to have a controller independent way to trigger safe
transformation of pod sets. That sounds like disruption budget.
I might be missing the thread of your argument, but rescheduler can break
stateful and stateless applications equally - a 2 node stateless HA router
controlling ingress to your platform is broken if rescheduler knocks them
off at the same time, just like an active/passive MySQL instance.
Since we decided StatefulSets aren't wildly different, I agree with Anirudh
we should remove this check.
On Jan 10, 2017, at 6:17 PM, jay vyas <notifications@github.com> wrote:
I assume the problem of "treating pods as equivalent" is more related to
the concern of rescheduling then anything else, right? because the use of
an equivalence class to simply predicate match is inoccuous wether or not a
pod is stateful...
Assuming the above, then to uplevel, this issue is a good example of the
general problem of scheduler logic conflicting with external controllers...
and i guess there are two categories here, where the intersection of *what
the scheduler wants to do* can either :
(1) *subvert* other features (like statefullness) or
(2) be *subverted* by other issues (i.e. for example #39687
<#39687>).
Solutions to (1) are likely to conflict with solutions to (2). For example,
aggressive rescheduling gaurantees scheduler policy is adhered to and
fixed, but could break any number of features like StatefulSets which we
otherwise care about.
Now, if we assume other pod mutating processes are imperfect ( I agree on
4301 w/ @timothysc <https://github.com/timothysc> that rescheduling is a
better solution then forcing a dependency on the scheduler for everyone
that wants to do a one off scheduling event), then we must take
rescheduling as a necessary feature to fix asymmetrical affinity or other
things that can be caused by downscaling...
... if thats the case we have solved problem (2), but possibly inflamed
problem (1) ... so a simple rectification is then to give pods a way to
communicate, generically, to the scheduler that they are immune to
rescheduling ...
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#39687 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p6ftafCCc_lYLzVwn-xhszdkhl5lks5rRBGagaJpZM4Lf2wo>
.
|
I think we all agree on the change in this context, the remaining question is the last one posed in the original issue : how should external controllers ensure the rescheduler doesnt touch certain pods/treat them as uniform ? That is : I think the implication being made above == special casing is bad, but it is a means for protecting stateful sets. So... we can agree that we don't want special casing, but in that case, we need a generic mechanism for protection of statefull pod creators (and other special controllers). I guess a Disruption budget of 0 would prevent any destabilization ; but that would disable any evictions entirely.... I assume folks would want to support eviction while still telling the rescheduler not to rebalance automagically . Hence I think we might need some more direct means of protecting from rescheduling of state dependent / non uniform pods . i.e. an expiring pod-rescheduler-immunity field in a pod spec. |
(Sorry, I totally mis-read the initial comment in the thread, so ignore what I previously wrote and now deleted.) @foxish is making a good point but I think we should not bother to discuss this until we have an actual example of a controller that generates pods that are not identical with respect to the things the scheduler cares about. cc/ @wojtek-t |
I agree. I vote for deleting the special casing of the StatefulSet from the scheduler for now. As for the more general issue, we would either want a way to use an annotation or field to specify if a controller is managing homogeneous or heterogeneous pods, or for the scheduler to figure that out from inspection of resource requirements of individual pods. But I agree with @davidopp that we can discuss that when we get there. |
Yeah, the "normal" way to identify equivalent pods is to hash (as a single unit) the fields of the pod template that the scheduler cares about, and build a map from hash to set of pods (that are all equivalent). |
Automatic merge from submit-queue (batch tested with PRs 39230, 39718) Remove special case for StatefulSets in scheduler **What this PR does / why we need it**: Removes special case for StatefulSet in scheduler code /ref: #39687 **Special notes for your reviewer**: **Release note**: ```release-note Scheduler treats StatefulSet pods as belonging to a single equivalence class. ```
@davidopp, when we're talking about scheduler, suppose it's default scheduler and for all kind of workload. If we're going to support all kind of workload in Kubernetes, I agree with @jayunit100 to add some flags which let "rescheduler doesnt touch certain pods": for some "computing" workload, e.g. MPI, it'll lost internal result which makes those "computing" workload to re-run. But for others kind of workload, I agree that re-scheduler is helpful for cluster utilization. |
I think we should move the discussion of how to ensure rescheduler operates "safely" and how it would prioritize/deprioritize which pod(s) to kill when it has a choice, to the design issue for rescheduler (#12140). But in general I think it comes down to mechanisms like PDB and priority, and appropriate quota on those so that there isn't an "arms race" where everyone puts "max disruption == 0" and "priority == infinite" on all of their pods. |
sure, will continue the discussion at #12140
Great 👍 |
Yes - we shouldn't mix rescheduler here. The real rescheduler doesn't still exists and is not even designed. So all discussions about rescheduler should be moved there. And i agree that removing StatefulSet from the check was a good thing to do for now. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
The implementation of equivalence classes currently ignores StatefulSets. The reason seems to be #32024 (comment). I think the reasoning no longer holds given the goals and the current implementation of the StatefulSet. It does not create pods with different resource shapes and this can be removed safely.
From the standpoint of fork-ability of StatefulSets and similar controllers, we do not want the scheduler to have any special casing of the StatefulSet kind. However, it exposes a different concern, one of external controllers that do manage non-uniform pods, which would need a way to tell the scheduler that they cannot each be treated as equivalent.
cc @davidopp @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc
The text was updated successfully, but these errors were encountered: