Create a rescheduler #12140

davidopp · 2015-08-03T08:04:24Z

It's premature to start working on this, but I wanted to jot down some notes collected from past conversations and experience on this topic (and I noticed we didn't have an issue for this yet).

A rescheduler is an agent that proactively causes currently-running Pods to be moved, so as to optimize some objective function for goodness of the layout of Pods in the cluster. (The objective function doesn't have to be expressed mathematically; it may just be a collection of ad-hoc rules, but in principle there is an objective function. Implicitly an objective function is described by the scheduler's predicate and priority functions.) It might be triggered to run every N minutes, or whenever some event happens that is known to make the objective function worse (for example, whenever a Pod goes PENDING for a long time.)

A rescheduler is useful because without a rescheduler, scheduling decisions are only made at the time Pods are created. But as the cluster layout changes over time, free "holes" are often produced that were not available when a Pod was initially scheduled. These holes are produced by run-to-completion Pods terminating, empty nodes being added by a node auto-scaler, etc. Moving already-running Pods into these holes may lead to a better cluster layout. A rescheduler might not just exploit existing holes, but also create holes by evicting Pods (assuming it knows they can reschedule elsewhere), as in free space defragmentation.

[Although alluded to above, it's worth emphasizing that rescheduling is the only way to make use of new nodes added by a cluster auto-scaler (unless Pods were already PENDING; but even then, it's likely advantageous to put more than just the previously PENDING Pods on the new nodes.)]

Because rescheduling is disruptive--it causes one or more already-running Pods to die when they otherwise wouldn't--a key constraint on rescheduling is that it must be done subject to disruption SLOs. There are a number of ways to specify these SLOs--a global rate limit across all Pods, a rate limit across a set of Pods defined by some particular label selector, a maximum number of Pods that can be down at any one time among a set defined by some particular label selector, etc. These policies are presumably part of the Rescheduler's configuration.

There are a lot of design possibilities for a rescheduler. To explain them, it's easiest to start with the description of a baseline rescheduler, and then describe possible modifications. The Baseline rescheduler

only kicks in when there are one or more PENDING Pods for some period of time; its objective function is binary: completely happy if there are no PENDING Pods, and completely unhappy if there are PENDING Pods; it does not try to optimize for any other aspect of cluster layout
is not a scheduler -- it simply identifies a node where a PENDING Pod could fit if one or more Pods on that node were moved out of the way, and then kills those Pods to make room for the PENDING Pod, which will then be scheduled there by the regular scheduler(s). [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless] Of course it should only do this if it is sure the killed Pods will be able to reschedule into already-free space in the cluster. Note that although it is not a scheduler, the Rescheduler needs to be linked with the predicate functions of the scheduling algorithm(s) so that it can know (1) that the PENDING Pod would actually schedule into the hole it has identified once the hole is created, and (2) that the evicted Pod(s) will be able to schedule somewhere else in the cluster.

Possible variations on this Baseline rescheduler are

it can kill the Pod(s) whose space it wants and also schedule the Pod that will take that space and reschedule the Pod(s) that were killed, rather than just killing the Pod(s) whose space it wants and relying on the regular scheduler(s) to schedule the Pod that will take that space (and to reschedule the Pod(s) that were evicted)
it can run continuously in the background to optimize general cluster layout instead of just trying to get a PENDING Pod to schedule
it can try to move groups of Pods instead of using a one-at-a-time / greedy approach
it can formulate multi-hop plans instead of single-hop

A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s).

For the Baseline rescheduler, it needs to know the predicate functions used by the cluster's scheduler(s) else it can't know how to create a hole that the PENDING Pod will fit into, nor be sure that the evicted Pod(s) will be able to reschedule elsewhere.
If it is going to run continuously in the background to optimize cluster layout but is still only going to kill Pods, then it still needs to know the predicate functions for the reason mentioned above. In principle it doesn't need to know the priority functions; it could just randomly kill Pods and rely on the regular scheduler to put them back in better places. However, this is a rather inexact approach. Thus it is useful for the rescheduler to know the priority functions, or at least some subset of them, so it can be sure that an action it takes will actually improve the cluster layout.
If it is going to run continuously in the background to optimize cluster layout and is going to act as a scheduler rather than just killing Pods, then it needs to know the predicate functions and some compatible (but not necessarily identical) priority functions One example of a case where "compatible but not identical" might be useful is if the main scheduler(s) has a very simple scheduling policy optimized for low scheduling latency, and the Rescheduler having a more sophisticated/optimal scheduling policy that requires more computation time. The main thing to avoid is for the scheduler(s) and rescheduler to have incompatible priority functions, as this will cause them to "fight" (though it still can't lead to an infinite loop, since the scheduler(s) only ever touches a Pod once).

The vast majority of users probably only care about rescheduling for three scenarios:

Redistribute Pods onto new nodes added by a cluster auto-scaler
Move Pods around to get a PENDING Pod to schedule
Move Pods around when CPU starvation is detected on a node

davidopp · 2015-08-03T20:36:41Z

We also had some discussion of how a rescheduler might trigger cluster auto-scaling (to scale up). Instead of moving Pods around to free up space, it might just add a new node (and then move some Pods onto the new node). More generally, it might be useful to integrate the rescheduler and cluster auto-scaler. @erictune made the observation that for scaling up the cluster a reasonable workflow might be:

pod horizontal auto-scaler decides to add one or more Pods to a service, based on the metrics it is observing
the Pod goes PENDING due to lack of a suitable node with sufficient resources
rescheduler notices the PENDING Pod and determines that the Pod cannot schedule just by rearranging existing Pods (while respecting SLOs)
rescheduler triggers cluster auto-scaler to add a node of the appropriate type for the PENDING Pod
the PENDING Pod schedules onto the new node (and possibly the rescheduler also moves other Pods onto that node)

We talked a little about the role of simulation. Things like knowing what will be the effect of different rearrangements of Pods requires a form of simulation of the scheduling algorithm (see also discussion in previous entry about what the rescheduler needs to know about the predicate and priority functions of the cluster's scheduler(s)). For cluster auto-scaling down, @erictune pointed out that you could do a simulation to see whether after removing a node from the cluster, will the Pods that were on that node be able to reschedule, either directly or with the help of the rescheduler; if the answer is yes, then you can safely auto-scale down (assuming services will still meeting their application-level SLOs).

timothysc · 2015-08-03T20:39:24Z

@davidopp Could you write as a .txt so we can comment inline..? I've become a fan of the proposal before code mantra that has been going on.

davidopp · 2015-08-03T20:50:52Z

I'm reluctant to describe this as a proposal right now because the above was more targeted as a description of the full space, as opposed to a concrete proposal.

Or to put it another way, what I wrote above might make for a good introduction to a concrete proposal describing exactly what we would implement.

But I can put it in a PR that tries to make it clear that it's not a proposal or design doc, just a description of the space.

davidopp · 2015-08-03T20:51:43Z

(My expectation is that nobody would work on implementing anything rescheduler-like for at least the next 2-3 months).

smarterclayton · 2015-08-04T04:12:17Z

@danmcp

davidopp · 2015-08-04T04:34:20Z

@bgrant0607 @fgrzadkowski @wojtek-t @piosz @gmarek @vishh @dchen1107 @mwielgus

xiejunan · 2015-08-04T09:14:03Z

@HaiyangDING

srcspider · 2015-08-04T12:17:12Z

@davidopp @smarterclayton

Question: If a script were to continually increase the replica count by one then decrease it back by one moments later will the current scheduler do the right thing and favor creating the new replica on a "empty-er" node and/or killing pending/blocked pods that have failed to resolve for a long time? (ie. workaround for #12195 and this)

smarterclayton · 2015-08-04T23:28:35Z

Yes. Although it would probably be less efficient than finding the worst
fit and deleting only that pod. Jittering the replication controller may
result in a significant loss of availability because the rc has no concept
of readiness - it wouldn't handle all of your pods being in pending, for
instance.

On Aug 4, 2015, at 8:17 AM, srcspider notifications@github.com wrote:

@davidopp https://github.com/davidopp @smarterclayton
https://github.com/smarterclayton

Question: If a script were to continually increase the replica count by one
then decrease it back by one moments later will the current scheduler do
the right thing and favor creating the new replica on a "empty-er" node
and/or killing pending/blocked pods that have failed to resolve for a long
time? (ie. workaround for #12195
#12195 and this)

—
Reply to this email directly or view it on GitHub
#12140 (comment)
.

gmarek · 2015-08-05T08:05:36Z

I think the answer is 'probably eventually'. Currently RC favors removal of Pods in earlier stages (Pending < Unknown < Running), so if you'd flip counts too quickly it'll remove the pod it added just before that.

We also don't mark Nodes as 'just-been-kicked-out-of-it-please-don't-put-me-there-again', which with current scheduler priorities implementation in some cases may force scheduler to put the new Pod on the same machine over and over again.

srcspider · 2015-08-05T12:19:23Z

@gmarek wasn't there something about scheduler trying to spread out pods? From the userguide,

http://kubernetes.io/v1.0/docs/user-guide/compute-resources.html

Although the scheduler normally spreads pods out across nodes, there are currently some cases where pods with no limits (unset values) might all land on the same node

...should I be reading this as "unrestricted pods" will most likely end up on the same node? (at this time)

Would help to have some more transparency on what the algorithm is really doing (ie. what are those cases where it doesn't spread, what are the cases where it spreads fine, etc), since its not like everyone has the same "corner cases," someone's corner case might be someone else's default way of doing things.

More transparency and insight would also help a lot with potential debugging scenarios or optimizing user land processes so they fit in nicely with what kubernetes wants to do under the hood (whatever that may be).

From trying to integrating kubernetes some of the biggest headaches have been with these semi-blackbox undocumented algorithms (potentially involving opinionated strategies), since they lead to some really nasty gotchas like "if a container has limits set, and no ideal fit is found, it will NOT be scheduled anywhere and just stay in pending" (I assumed "running" even at the cost of potentially crowding other containers or being slower then desirable would be the more robust/ideal way of it compromising not having enough resources, until I stumbled on the documentation--or comments--suggesting otherwise)

Side note: I'm aware that in theory "perfectly allocating resources" would be ideal and technically mitigate some of these problems, but in practice I find it hard to motivate such a strategy. Mainly because of the "who exactly can tell what's the perfect fit" and "how much time and resources does it take to search for the perfect fit" and also the problem of "why would I want it not to try to use as much as it can when I'm already paying for the underlying hardware?" (assuming I'm not misunderstanding the "completely fair shares" system that's applied at the OS level when resources are restricted). It would be nice if in the future we could have "maximum node CPU/Replica ratio" as a soft hint (ie. that can be ignored) to the (re)scheduler, eg. 1:1, no more then 1 replica for every 1 CPU on the node; so if a node has 4 CPUs no more then 4 replicas "should" try to get squeezed in if there's space elsewhere; replicas would still see everything as if not restricted at all

gmarek · 2015-08-05T13:19:04Z

@srcspider The problem is that I'm not sure that anyone actually understands how exactly it is done. Currently scheduler take into account few things (amount of free resources on a Node, number of Pods running on the Node, how many Pods from the same ReplicationController/Service is already running on a given Node, how 'balanced' is resource usage) assigns number from 0 to 10 for each of them, adds them up and picks the Node with the highest one.

I think that unrestricted pods case is already solved, so that part of the doc is outdated.

You might be interested in #11713

srcspider · 2015-08-05T13:42:27Z

@gmarek knowing it's trying to weigh them as it's strategy is still more helpful then not knowing at all. Thanks for the explanation, much appreciated. :)

timothysc · 2015-08-06T14:20:56Z

@davidopp

So long as we have the ability to plug in our own policy engines I like it. Other systems often referred to them as the defragmentation process, and typically employ analogous algorithms.

Ideally I would like an engine/controller/rescheduler/process where I can plug in a cloud provider and spin down resources as they are offloaded. (inverse of bursting).

davidopp · 2015-08-08T07:19:36Z

@timothysc I think the architecture is TBD. (Even the feature set is TBD. :-) Would love to get your input on how to make the policy pluggable, once we get to the design stage. Probably something along the lines of how the scheduler policy is pluggable would work. (Though I think we'd like to refactor the scheduler to make customization a bit simpler.)

Defragmentation is definitely part of this, but the objective function can be more general, i.e. incorporate more than just defragmentation as a goal.

And I agree that this is related to cluster auto-scaling (seems to be what you referred to in your last paragraph).

bgrant0607 · 2015-09-04T05:02:30Z

I missed commenting on this issue and on the proposal: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduler.md

A few quick comments now:

I was also going to suggest taking cluster auto-scaling into account: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduler.md#appendix-integrating-rescheduler-with-cluster-auto-scaler-scale-up
This could dramatically simplify the rescheduler, because it would only need to identify unhappy pods (e.g., those not adequately spread for availability) and/or unhappy nodes (empty nodes, oversubscribed notes, fragmented resources, etc.)
Application performance is another common reason to kill pods and move them elsewhere. We could use metrics available from the OS like cpu load, cpu latency, paging stats, etc. But users could also configure policy based on app stats. Both cases are similar to pod auto-scaling triggers.
Anomalous crashlooping can be another reason to move a pod: all other replicas are fine, but one is unhappy. Where this decision goes is debatable, but I put it in the general category of unhappy pods, so "rescheduler" (maybe better called "rebalancer") makes sense to me.
When the rescheduler kills a pod, it's helpful if the scheduler doesn't put it right back in the same place. We need this, in general, such as to deal with out-of-resource killing, scheduler-kubelet disagreement, anomalous crashlooping, etc. I call this "anti-recidivism".

davidopp · 2015-09-04T05:15:01Z

This could dramatically simplify the rescheduler, because it would only need to identify unhappy pods (e.g., those not adequately spread for availability) and/or unhappy nodes (empty nodes, oversubscribed notes, fragmented resources, etc.)

Simplify compared to what? (Please don't say "compared to not taking cluster auto-scaling into account :) )

Anomalous crashlooping can be another reason to move a pod

This is a good point. Since our restarts never go through the scheduler, the rescheduler is the only place we can put this logic. (Contrast with Borg, where IIRC after some number of local restarts, it punts back to the master to make a decision about restarting it, in which case the scheduler makes a decision and can take into account previous crashlooping if it wants to.)

When the rescheduler kills a pod, it's helpful if the scheduler doesn't put it right back in the same place.

Agreed; this is covered in the doc (" [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless]")

davidopp · 2017-04-11T06:31:12Z

@aveshagarwal offered to take a crack at this.

I think the spreading use case is the best one to start with. Kill pods on heavily utilized nodes if you think that will move them to under-utilized nodes. Of course the utilization threshold between the over-
and under- utilized nodes should be configurable. As @bgrant0607 mentioned earlier in this issue, this is useful when a machine is brought back after maintenance, when cluster autoscaler adds a node (unless there are a lot of pending pods so the new node fills up), when pods terminate and there's low churn, etc.

I would suggest starting with the current "rescheduler" codebase and modifying it, since some of the code can be reused. Hopefully we'll be able to get rid of the critical pod preemption stuff from the rescheduler in 1.7 by implementing a more general priority/preemption scheme, but we need to leave it in there for now.

I think a key design criteria is extensibility -- make it easy for people to add new policies and to choose which ones to activate.

It should definitely use /eviction subresource, i.e. respect PDB.

We talked about importing scheduler code into other components in the sig-scheduling meeting today (e.g. cluster autoscaler appears to do it here). You may want to do a "simulation" like that to verify that the evicted pods wil actually move to the under-utilized nodes. I'm sure mwielgus and others would be happy to explain the cluster autoscaler simulation code if it's not straightforward (I've haven't looked at it carefully).

timothysc · 2017-04-11T15:30:51Z

I would suggest starting with the current "rescheduler" codebase and modifying it, since some of the code can be reused. Hopefully we'll be able to get rid of the critical pod preemption stuff from the rescheduler in 1.7 by implementing a more general priority/preemption scheme, but we need to leave it in there for now.

Ideally I would hope we fork a repo for this, putting this in contrib is a bad move imo.

aveshagarwal · 2017-04-24T19:09:28Z

@timothysc you mean a new sub repo in kubernetes or in kube-incubator?

timothysc · 2017-04-24T19:52:43Z

Doesn't matter to me, so long is it's not in the main or contrib repos.

ajtrichards · 2017-04-27T17:38:04Z

/sub

We'd be really keen to see this feature implemented sooner rather than later.

davidopp · 2017-04-28T07:36:59Z

@ajtrichards Can you say what particular functionality you're looking for?

ajtrichards · 2017-04-28T10:03:12Z

Hi @davidopp the specific need we had was to shift pods around on to other nodes to make space for some pods with larger resource requirements.

We specify the request and limits for each Deployment. We then have one deployment that requests 40% of the CPU and then it gets stuck in a Pending state as it can't schedule anywhere. It would be good to be able to shift some of the smaller pods around on to other nodes to help get the larger one provisioned and running.

davidopp · 2017-04-29T07:59:10Z

@ajtrichards Priority/preemption (kubernetes/enhancements#268) is one possible solution to that problem, and I'm not sure the rescheduler should be involved. I think we were planning to do "spreading pods away from high/over-utilized nodes" as the first rescheduler use case.

(It's true that rescheduler is needed for your use case if the large deployment runs at the same priority as the smaller pods).

ajtrichards · 2017-04-29T10:09:39Z

Thanks @davidopp - sounds like Priority / Preemption would be a good solution for us. Will keep an eye out on that becoming available.

davidopp · 2017-06-26T05:51:47Z

One of the key se cases we're interested in is re-spreading pods when a zone comes back online after having failed. The pods will have been moved onto nodes in the non-failed zones, so we would want to rebalance load by moving some back to the recovered zone. This is similar to a node failing and then recovering, but on a larger scale.

krmayankk · 2017-06-26T06:44:01Z

this feature request is extremely similar #47965

davidopp · 2017-07-20T07:27:39Z

Design doc from @aveshagarwal : https://docs.google.com/document/d/1KXw02Q0cOF1MUrdpPNiug0yGZlixvPg2SwBycrT5DkE

obeattie · 2017-07-20T08:06:31Z

@davidopp I can't access this doc. Is it perhaps only available to a mailing list I'm not on, or is intentionally private for now? 🙃

wanghaoran1988 · 2017-07-20T10:09:16Z

@obeattie I guess you should join sig scheduling mail list

bitbrain · 2017-10-16T17:48:41Z

As far as I have read this whole ticket is about "Defragmentation" of a cluster, consisting of running containers. What is the current progress on this?

davidopp · 2017-10-16T18:00:15Z

Work has been progressing in an incubator repo:
https://github.com/kubernetes-incubator/descheduler

timothysc · 2017-10-16T18:13:41Z

I think we should close this issue fwiw.

davidopp · 2017-10-16T18:48:34Z

SGTM. People can file issues in the descheduler repo.

davidopp added team/master sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Aug 3, 2015

piosz mentioned this issue Aug 4, 2015

Scheduler input should be taken when reducing replicas #4301

Open

This was referenced Aug 8, 2015

Rescheduler design space doc. #12433

Merged

Move scheduler fit predicates into a library #12744

Closed

mbforbes added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Aug 16, 2015

erictune added team/control-plane and removed team/master labels Aug 19, 2015

bgrant0607 mentioned this issue Sep 2, 2015

Rolling restart of pods #13488

Closed

bgrant0607 mentioned this issue Sep 4, 2015

Add support for Quotas on Running Pods #13567

Closed

yujuhong mentioned this issue Sep 14, 2015

No documentation on how to use kubelet API #13470

Closed

bgrant0607 mentioned this issue Sep 30, 2015

Mark node to be decommissioned and act accordingly #3885

Closed

bgrant0607 mentioned this issue Oct 21, 2015

Pod rescheduling when the pods' container can not be started on the node #14796

Closed

davidopp mentioned this issue Jan 16, 2017

Scheduler, StatefulSets, and External Controllers #39687

Closed

cknowles mentioned this issue Mar 4, 2017

Kubernetes Rebalancing the deployed applications after rolling upgrade of workers kubernetes-retired/kube-aws#380

Closed

davidopp assigned aveshagarwal and davidopp Apr 11, 2017

davidopp mentioned this issue May 12, 2017

Rescheduler kubernetes/enhancements#109

Closed

23 tasks

davidopp mentioned this issue Jun 26, 2017

Automatic pod rescheduling #47965

Closed

erikerlandson mentioned this issue Jul 10, 2017

Interactions with proposed Rescheduler apache-spark-on-k8s/spark#370

Open

bgrant0607 removed team/control-plane (deprecated - do not use) labels Sep 5, 2017

davidopp closed this as completed Oct 16, 2017

rjeberhard mentioned this issue May 16, 2019

Managed server pods not migrating to healthy k8s worker nodes oracle/weblogic-kubernetes-operator#530

Closed

neolit123 mentioned this issue Aug 3, 2019

DNS service runs entirely on a single node kubernetes/kubeadm#1657

Closed

Create a rescheduler #12140

Create a rescheduler #12140

Comments

davidopp commented Aug 3, 2015

davidopp commented Aug 3, 2015

timothysc commented Aug 3, 2015

davidopp commented Aug 3, 2015

davidopp commented Aug 3, 2015

smarterclayton commented Aug 4, 2015

davidopp commented Aug 4, 2015

xiejunan commented Aug 4, 2015

srcspider commented Aug 4, 2015

smarterclayton commented Aug 4, 2015

gmarek commented Aug 5, 2015

srcspider commented Aug 5, 2015

gmarek commented Aug 5, 2015

srcspider commented Aug 5, 2015

timothysc commented Aug 6, 2015

davidopp commented Aug 8, 2015

bgrant0607 commented Sep 4, 2015

davidopp commented Sep 4, 2015

davidopp commented Apr 11, 2017

timothysc commented Apr 11, 2017

aveshagarwal commented Apr 24, 2017

timothysc commented Apr 24, 2017

ajtrichards commented Apr 27, 2017

davidopp commented Apr 28, 2017

ajtrichards commented Apr 28, 2017

davidopp commented Apr 29, 2017

ajtrichards commented Apr 29, 2017

davidopp commented Jun 26, 2017

krmayankk commented Jun 26, 2017

davidopp commented Jul 20, 2017

obeattie commented Jul 20, 2017

wanghaoran1988 commented Jul 20, 2017

bitbrain commented Oct 16, 2017

davidopp commented Oct 16, 2017

timothysc commented Oct 16, 2017

davidopp commented Oct 16, 2017