Scheduler, StatefulSets, and External Controllers #39687

foxish · 2017-01-10T20:25:10Z

The implementation of equivalence classes currently ignores StatefulSets. The reason seems to be #32024 (comment). I think the reasoning no longer holds given the goals and the current implementation of the StatefulSet. It does not create pods with different resource shapes and this can be removed safely.

Equivalence class Issue: Defining the equivalence pod #32024
Equivalence class Implementation PR: [Part 1] Implementation of equivalence pod #31605

From the standpoint of fork-ability of StatefulSets and similar controllers, we do not want the scheduler to have any special casing of the StatefulSet kind. However, it exposes a different concern, one of external controllers that do manage non-uniform pods, which would need a way to tell the scheduler that they cannot each be treated as equivalent.

cc @davidopp @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc

timothysc · 2017-01-10T22:35:41Z

@jayunit100 assigning

jayunit100 · 2017-01-10T23:17:30Z

I assume the problem of "treating pods as equivalent" is more related to the concern of rescheduling then anything else, right? because the use of an equivalence class to simply predicate match is inoccuous wether or not a pod is stateful...

Assuming the above, then to uplevel, this issue is a good example of the general problem of scheduler logic conflicting with external controllers... and i guess there are two categories here, where the intersection of what the scheduler wants to do can either :

(1) subvert other features (like statefullness) or

(2) be subverted by other issues (i.e. for example #39687).

Solutions to (1) are likely to conflict with solutions to (2). For example, aggressive rescheduling gaurantees scheduler policy is adhered to and fixed, but could break any number of features like StatefulSets which we otherwise care about.

Now, if we assume other pod mutating processes are imperfect ( I agree on 4301 w/ @timothysc that rescheduling is a better solution then forcing a dependency on the scheduler for everyone that wants to do a one off scheduling event), then we must take rescheduling as a necessary feature to fix asymmetrical affinity or other things that can be caused by downscaling...

... if thats the case we have solved problem (2), but possibly inflamed problem (1) ... so a simple rectification is then to give pods a way to communicate, generically, to the scheduler that they are immune to rescheduling ...

So, is it crazy-talk to propose a (possibly expiring) re-scheduler immunity field to the pod-spec API ? This seems like it could be used as an elegant primitive for building scheduler logic that is robust enough to always, eventually, rebalance a cluster - yet flexible enough not to trample fine grained controllers that may be required in core kubernetes or else in user applications for highly specified , high performance applications.

smarterclayton · 2017-01-11T02:56:33Z

Rescheduling needs to have a controller independent way to trigger safe transformation of pod sets. That sounds like disruption budget. I might be missing the thread of your argument, but rescheduler can break stateful and stateless applications equally - a 2 node stateless HA router controlling ingress to your platform is broken if rescheduler knocks them off at the same time, just like an active/passive MySQL instance. Since we decided StatefulSets aren't wildly different, I agree with Anirudh we should remove this check. On Jan 10, 2017, at 6:17 PM, jay vyas <notifications@github.com> wrote: I assume the problem of "treating pods as equivalent" is more related to the concern of rescheduling then anything else, right? because the use of an equivalence class to simply predicate match is inoccuous wether or not a pod is stateful... Assuming the above, then to uplevel, this issue is a good example of the general problem of scheduler logic conflicting with external controllers... and i guess there are two categories here, where the intersection of *what the scheduler wants to do* can either : (1) *subvert* other features (like statefullness) or (2) be *subverted* by other issues (i.e. for example #39687 <#39687>). Solutions to (1) are likely to conflict with solutions to (2). For example, aggressive rescheduling gaurantees scheduler policy is adhered to and fixed, but could break any number of features like StatefulSets which we otherwise care about. Now, if we assume other pod mutating processes are imperfect ( I agree on 4301 w/ @timothysc <https://github.com/timothysc> that rescheduling is a better solution then forcing a dependency on the scheduler for everyone that wants to do a one off scheduling event), then we must take rescheduling as a necessary feature to fix asymmetrical affinity or other things that can be caused by downscaling... ... if thats the case we have solved problem (2), but possibly inflamed problem (1) ... so a simple rectification is then to give pods a way to communicate, generically, to the scheduler that they are immune to rescheduling ... — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#39687 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p6ftafCCc_lYLzVwn-xhszdkhl5lks5rRBGagaJpZM4Lf2wo> .

jayunit100 · 2017-01-11T03:50:23Z

I think we all agree on the change in this context, the remaining question is the last one posed in the original issue : how should external controllers ensure the rescheduler doesnt touch certain pods/treat them as uniform ?

That is : I think the implication being made above == special casing is bad, but it is a means for protecting stateful sets. So... we can agree that we don't want special casing, but in that case, we need a generic mechanism for protection of statefull pod creators (and other special controllers).

I guess a Disruption budget of 0 would prevent any destabilization ; but that would disable any evictions entirely.... I assume folks would want to support eviction while still telling the rescheduler not to rebalance automagically . Hence I think we might need some more direct means of protecting from rescheduling of state dependent / non uniform pods . i.e. an expiring pod-rescheduler-immunity field in a pod spec.

davidopp · 2017-01-11T06:04:50Z

(Sorry, I totally mis-read the initial comment in the thread, so ignore what I previously wrote and now deleted.)

@foxish is making a good point but I think we should not bother to discuss this until we have an actual example of a controller that generates pods that are not identical with respect to the things the scheduler cares about.

cc/ @wojtek-t

foxish · 2017-01-11T06:14:20Z

I agree. I vote for deleting the special casing of the StatefulSet from the scheduler for now.

As for the more general issue, we would either want a way to use an annotation or field to specify if a controller is managing homogeneous or heterogeneous pods, or for the scheduler to figure that out from inspection of resource requirements of individual pods. But I agree with @davidopp that we can discuss that when we get there.

davidopp · 2017-01-11T06:20:09Z

Yeah, the "normal" way to identify equivalent pods is to hash (as a single unit) the fields of the pod template that the scheduler cares about, and build a map from hash to set of pods (that are all equivalent).

Automatic merge from submit-queue (batch tested with PRs 39230, 39718) Remove special case for StatefulSets in scheduler **What this PR does / why we need it**: Removes special case for StatefulSet in scheduler code /ref: #39687 **Special notes for your reviewer**: **Release note**: ```release-note Scheduler treats StatefulSet pods as belonging to a single equivalence class. ```

k82cn · 2017-01-11T12:50:49Z

@davidopp, when we're talking about scheduler, suppose it's default scheduler and for all kind of workload. If we're going to support all kind of workload in Kubernetes, I agree with @jayunit100 to add some flags which let "rescheduler doesnt touch certain pods": for some "computing" workload, e.g. MPI, it'll lost internal result which makes those "computing" workload to re-run. But for others kind of workload, I agree that re-scheduler is helpful for cluster utilization.

davidopp · 2017-01-16T00:33:49Z

I think we should move the discussion of how to ensure rescheduler operates "safely" and how it would prioritize/deprioritize which pod(s) to kill when it has a choice, to the design issue for rescheduler (#12140). But in general I think it comes down to mechanisms like PDB and priority, and appropriate quota on those so that there isn't an "arms race" where everyone puts "max disruption == 0" and "priority == infinite" on all of their pods.

k82cn · 2017-01-16T01:12:07Z

sure, will continue the discussion at #12140

"max disruption == 0" and "priority == infinite" on all of their pods

Great 👍

wojtek-t · 2017-01-16T10:35:22Z

Yes - we shouldn't mix rescheduler here. The real rescheduler doesn't still exists and is not even designed. So all discussions about rescheduler should be moved there.

And i agree that removing StatefulSet from the check was a good thing to do for now.

fejta-bot · 2017-12-21T23:26:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-21T00:14:24Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-20T00:21:00Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

foxish added area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 10, 2017

timothysc added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 10, 2017

timothysc added this to the next-candidate milestone Jan 10, 2017

timothysc assigned jayunit100 Jan 10, 2017

foxish mentioned this issue Jan 11, 2017

Remove special case for StatefulSets in scheduler #39718

Merged

timothysc assigned foxish and unassigned jayunit100 Jan 11, 2017

bgrant0607 unassigned foxish Mar 10, 2017

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018

k8s-ci-robot closed this as completed Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler, StatefulSets, and External Controllers #39687

Scheduler, StatefulSets, and External Controllers #39687

foxish commented Jan 10, 2017

timothysc commented Jan 10, 2017

jayunit100 commented Jan 10, 2017 •

edited

Loading

smarterclayton commented Jan 11, 2017 via email

jayunit100 commented Jan 11, 2017 •

edited

Loading

davidopp commented Jan 11, 2017 •

edited

Loading

foxish commented Jan 11, 2017

davidopp commented Jan 11, 2017

k82cn commented Jan 11, 2017

davidopp commented Jan 16, 2017

k82cn commented Jan 16, 2017

wojtek-t commented Jan 16, 2017

fejta-bot commented Dec 21, 2017

fejta-bot commented Jan 21, 2018

fejta-bot commented Feb 20, 2018

Scheduler, StatefulSets, and External Controllers #39687

Scheduler, StatefulSets, and External Controllers #39687

Comments

foxish commented Jan 10, 2017

timothysc commented Jan 10, 2017

jayunit100 commented Jan 10, 2017 • edited Loading

smarterclayton commented Jan 11, 2017 via email

jayunit100 commented Jan 11, 2017 • edited Loading

davidopp commented Jan 11, 2017 • edited Loading

foxish commented Jan 11, 2017

davidopp commented Jan 11, 2017

k82cn commented Jan 11, 2017

davidopp commented Jan 16, 2017

k82cn commented Jan 16, 2017

wojtek-t commented Jan 16, 2017

fejta-bot commented Dec 21, 2017

fejta-bot commented Jan 21, 2018

fejta-bot commented Feb 20, 2018

jayunit100 commented Jan 10, 2017 •

edited

Loading

jayunit100 commented Jan 11, 2017 •

edited

Loading

davidopp commented Jan 11, 2017 •

edited

Loading