Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler, StatefulSets, and External Controllers #39687

Closed
foxish opened this issue Jan 10, 2017 · 14 comments
Closed

Scheduler, StatefulSets, and External Controllers #39687

foxish opened this issue Jan 10, 2017 · 14 comments
Labels
area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@foxish
Copy link
Contributor

foxish commented Jan 10, 2017

The implementation of equivalence classes currently ignores StatefulSets. The reason seems to be #32024 (comment). I think the reasoning no longer holds given the goals and the current implementation of the StatefulSet. It does not create pods with different resource shapes and this can be removed safely.

From the standpoint of fork-ability of StatefulSets and similar controllers, we do not want the scheduler to have any special casing of the StatefulSet kind. However, it exposes a different concern, one of external controllers that do manage non-uniform pods, which would need a way to tell the scheduler that they cannot each be treated as equivalent.

cc @davidopp @kubernetes/sig-scheduling-misc @kubernetes/sig-apps-misc

@foxish foxish added area/stateful-apps sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 10, 2017
@timothysc timothysc added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 10, 2017
@timothysc timothysc added this to the next-candidate milestone Jan 10, 2017
@timothysc
Copy link
Member

@jayunit100 assigning

@jayunit100
Copy link
Member

jayunit100 commented Jan 10, 2017

I assume the problem of "treating pods as equivalent" is more related to the concern of rescheduling then anything else, right? because the use of an equivalence class to simply predicate match is inoccuous wether or not a pod is stateful...

Assuming the above, then to uplevel, this issue is a good example of the general problem of scheduler logic conflicting with external controllers... and i guess there are two categories here, where the intersection of what the scheduler wants to do can either :

(1) subvert other features (like statefullness) or

(2) be subverted by other issues (i.e. for example #39687).

Solutions to (1) are likely to conflict with solutions to (2). For example, aggressive rescheduling gaurantees scheduler policy is adhered to and fixed, but could break any number of features like StatefulSets which we otherwise care about.

Now, if we assume other pod mutating processes are imperfect ( I agree on 4301 w/ @timothysc that rescheduling is a better solution then forcing a dependency on the scheduler for everyone that wants to do a one off scheduling event), then we must take rescheduling as a necessary feature to fix asymmetrical affinity or other things that can be caused by downscaling...

... if thats the case we have solved problem (2), but possibly inflamed problem (1) ... so a simple rectification is then to give pods a way to communicate, generically, to the scheduler that they are immune to rescheduling ...

So, is it crazy-talk to propose a (possibly expiring) re-scheduler immunity field to the pod-spec API ? This seems like it could be used as an elegant primitive for building scheduler logic that is robust enough to always, eventually, rebalance a cluster - yet flexible enough not to trample fine grained controllers that may be required in core kubernetes or else in user applications for highly specified , high performance applications.

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 11, 2017 via email

@jayunit100
Copy link
Member

jayunit100 commented Jan 11, 2017

I think we all agree on the change in this context, the remaining question is the last one posed in the original issue : how should external controllers ensure the rescheduler doesnt touch certain pods/treat them as uniform ?

That is : I think the implication being made above == special casing is bad, but it is a means for protecting stateful sets. So... we can agree that we don't want special casing, but in that case, we need a generic mechanism for protection of statefull pod creators (and other special controllers).

I guess a Disruption budget of 0 would prevent any destabilization ; but that would disable any evictions entirely.... I assume folks would want to support eviction while still telling the rescheduler not to rebalance automagically . Hence I think we might need some more direct means of protecting from rescheduling of state dependent / non uniform pods . i.e. an expiring pod-rescheduler-immunity field in a pod spec.

@davidopp
Copy link
Member

davidopp commented Jan 11, 2017

(Sorry, I totally mis-read the initial comment in the thread, so ignore what I previously wrote and now deleted.)

@foxish is making a good point but I think we should not bother to discuss this until we have an actual example of a controller that generates pods that are not identical with respect to the things the scheduler cares about.

cc/ @wojtek-t

@foxish
Copy link
Contributor Author

foxish commented Jan 11, 2017

I agree. I vote for deleting the special casing of the StatefulSet from the scheduler for now.

As for the more general issue, we would either want a way to use an annotation or field to specify if a controller is managing homogeneous or heterogeneous pods, or for the scheduler to figure that out from inspection of resource requirements of individual pods. But I agree with @davidopp that we can discuss that when we get there.

@davidopp
Copy link
Member

Yeah, the "normal" way to identify equivalent pods is to hash (as a single unit) the fields of the pod template that the scheduler cares about, and build a map from hash to set of pods (that are all equivalent).

k8s-github-robot pushed a commit that referenced this issue Jan 11, 2017
Automatic merge from submit-queue (batch tested with PRs 39230, 39718)

Remove special case for StatefulSets in scheduler

**What this PR does / why we need it**: Removes special case for StatefulSet in scheduler code
/ref: #39687

**Special notes for your reviewer**:

**Release note**:

```release-note
Scheduler treats StatefulSet pods as belonging to a single equivalence class.
```
@k82cn
Copy link
Member

k82cn commented Jan 11, 2017

@davidopp, when we're talking about scheduler, suppose it's default scheduler and for all kind of workload. If we're going to support all kind of workload in Kubernetes, I agree with @jayunit100 to add some flags which let "rescheduler doesnt touch certain pods": for some "computing" workload, e.g. MPI, it'll lost internal result which makes those "computing" workload to re-run. But for others kind of workload, I agree that re-scheduler is helpful for cluster utilization.

@timothysc timothysc assigned foxish and unassigned jayunit100 Jan 11, 2017
@davidopp
Copy link
Member

I think we should move the discussion of how to ensure rescheduler operates "safely" and how it would prioritize/deprioritize which pod(s) to kill when it has a choice, to the design issue for rescheduler (#12140). But in general I think it comes down to mechanisms like PDB and priority, and appropriate quota on those so that there isn't an "arms race" where everyone puts "max disruption == 0" and "priority == infinite" on all of their pods.

@k82cn
Copy link
Member

k82cn commented Jan 16, 2017

sure, will continue the discussion at #12140

"max disruption == 0" and "priority == infinite" on all of their pods

Great 👍

@wojtek-t
Copy link
Member

Yes - we shouldn't mix rescheduler here. The real rescheduler doesn't still exists and is not even designed. So all discussions about rescheduler should be moved there.

And i agree that removing StatefulSet from the check was a good thing to do for now.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants