[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

davidopp · 2016-02-29T22:42:30Z

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, /evict subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things.

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107

k8s-bot · 2016-02-29T23:52:38Z

GCE e2e build/test passed for commit a543cfe3de6872f60c7d0a72e084082603cbcf72.

mml · 2016-03-01T00:14:26Z

docs/proposals/rescheduling.md

+Kubernetes will terminate a pod that is managed by a controller, and the controller will
+create a replacement pod that is then scheduled by the pod's scheduler. The terminated
+pod and replacement pod are completely separate pods, and no pod migration is
+implied. However, describing the process as "moving" the pod is approximately accurate


Given that we have a pretty tight container abstraction and live migration has come a long way, should we consider if live container (pod) migration isn't a better way to go? At the least, it probably deservers an "alternatives considered" entry.

We (and the user) would still need to deal with the node failure case, where live migration is impossible. So it becomes a matter of how many terminations the client sees, not whether they see them. Unless the reduction in number of terminations is dramatic and crucial (which I don't think it is), I think that the consistency of termination/failure semantics wins here (i.e. pods always terminate and replacements are created, rather than pods sometimes moving, and sometimes dieing and being replaced).

I would prefer we avoid live container (pod) migration, and I would imagine the rescheduler will continue to respect graceful termination.

+1 for avoiding live migration. Currently, Pods in k8s are stateless, we save data in PersistentVolume, so live migration is less meaningful.

+1 for avoiding live migration.

This proposal doesn't preclude migration, and this functionality would be required in order to implement migration.

However, migration would be a lot more work than what is described here.

Migration is covered by #3949.

Right. More generally, this proposal is an attempt to define an initial version and short-term roadmap, not the entire design space or long-term ideas. Once live migration of containers is available (the containerd docs seem to imply it will be soon? https://github.com/docker/containerd ) I would assume Kubernetes will take advantage of it.

@hurf Persisted state can still be "migrated" via persistent volume claim or flocker. Presumably the stateful applications have to be able to deal with restart after failure, so migrating the in-memory state shouldn't be strictly necessary.

Yes, that's what we do now. A problem is dealing with failure sometime may lower performance indicator of a service(doesn't mean not going to deal with the failure, but try to reduce failure). Especially in rescheduler case, failure may not caused by service itself but by an eviction(unless we give it a disruption budget of none). If we can have in-memory migration, the pod can get rescheduled without breaking its ongoing task. It frees more pods and looses the disruption budget restriction. Indeed it's not a necessary feature but an optimized option.

I'm not asking for live migration. It's a good thing but not urgent.

SrinivasChilveri · 2016-03-01T04:03:52Z

@kubernetes/huawei

mqliang · 2016-03-01T05:37:04Z

@bgrant0607 @davidopp This proposal is very huge. May I volunteer myself to implement part of this? I am really interested in:

Priority and Preemption. I have a proposal about Preemption in Random thought about Rescheduler implementation #22054, but without disruption budgets. And I have a proposal about "the order in which a scheduler examines pods in its scheduling loop" in Scheduling policy proposal #20203
One feature of Rescheduler: moving a pod onto an under-utilized node. I'd like implement this using the idea of "Pod Stealing", I described it in Random thought about Rescheduler implementation #22054

mqliang · 2016-03-01T06:00:58Z

docs/proposals/rescheduling.md

+signature will be a
+[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
+i.e. a reference to the pod's controller. Controllers are responsible for garbage
+collecting, after some period of time, `PreferAvoidPods` that point to them, but the API


Could you be more specific here？I am a little confused. Does "garbage collecting" mean: Pod "Pa" appear in the PreferAvoidPods of Node "Na", but unfortunately scheduler schedule it to "Na" again, so after some period of time, Controller will try to delete it to see if there is a better Node for "Pa"?

Correct me I am wrong. IIUC, if the Pod "Pa" unfortunately goes to "Na" eventually, whether to re-evaluate the situation of "Pa" is not GC's work, it is completely another job. Once "Pa" is scheduled, it is scheduled.

The GC here means the removal of the controllerRef of the PreferAvoidPods, since this "anti-affinity" is by no means a permanent thing. Ideally, once the Pod we want to be running on the node gets correctly scheduled, the controllerRef of the evicted pod should be deleted, but it is hard to implement and kind useless. Just making controllerRef in the PreferAvoidPods a temporary thing is enough (by GC of the controllers and the api-server described below.).

Yeah, sorry for the confusing terminology. As @HaiyangDING inferred, I just meant the controller should remove itself from the list after some period of time.

davidopp · 2016-03-01T06:16:27Z

Let's use the scheduling SIG mailing list and meetings to discuss who is interested in implementing the various pieces. But let's make sure we converge on the design for a piece before building it. :)

mqliang · 2016-03-01T06:19:12Z

docs/proposals/rescheduling.md

+TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
+(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
+budget (e.g. when instantiating a Service), or require the user to create it manually
+before they create a controller? Which objects should return the disruption budget object


I think we could use a admission controller to auto-generate a disruption budget. User could enable such an admission controller so that when they create a Service, a disruption budget will be auto-generated, or they can disable it and create disruption budget manually.

We can do just like LimitRange, users can create one, buf if they don't specifiy it, apply a default one.
Question: what's the behavior if this admission controller is disabled. All or none of the pods can be disrupted?

ghost · 2016-03-01T16:43:02Z

docs/proposals/rescheduling.md

+* moving a running pod off of a node from which it is receiving poor service
+  * anomalous crashlooping or other mysterious incompatiblity between the pod and the node
+  * repeated out-of-resource killing (see #18724)
+  * repeated attempts by the scheduler to schedule the pod onto some node, but it is


Is this really possible? If so, we should presumably fix it in the scheduler, not a rescheduler? If the scheduler schedules a pod to a node, and then discovers that it had out of date information about the node, and the pod can not in fact be scheduled there, it should automatically reschedule, possibly?

if a pod is scheduled to a kubelet and rejected in admission, isn't the pod immediately moved to a terminal state? i agree with @quinton-hoole here that this seems like something the scheduler should get right, but from what I can tell thus far in this PR is that the rescheduler does more selective killing of pods on a node, but does not actually take over primary scheduling responsibility so maybe its a no-op here and a solution is needed in both places?

In the model where scheduler has full information about kubelet resources, what you guys are saying is correct (and this is indeed how things are today). However, in the future it's possible that Kubelet will have more information than the scheduler, especially if the resource topology within a node becomes very complicated and it's not scalable for the scheduler to know all of the details. I would like to avoid moving to that world as long as possible, though.

+1 @davidopp 's comments.

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler.

The scheduler picks a non-optimal node, or no suitable node exists (without moving someother pods around). This seems like the purvue of the rescheduler.

Am I missing something?

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Am I missing something?

I will try to answer with my knowledge, correct me if I am wrong.

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.

Currently, if the pod is denied by the node proposed by the scheduler, the pod is simply marked 'failed', there is no trying again. I think in future:

It is the responsibility for the scheduler to try to scheduler the pod denied by kubelet again, or several times (<=N).

It is the responsibility for the rescheduler to handle the pod that has been denied by kubelet several times (>= N). However the mechanism needs further consideration. For instance, if the pod is denied by the same kubelet several times, we can add the avoid annotation on the node; but if the pod is denied by different, maybe we could just wait and see.

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Yes, it is. FWIW:

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), this is in the scope of the first version of rescheduler, but we need to figure out some policy to decide how non-optimal is bad enough (as well as the possibility to improve) to trigger the rescheduling behavior. Actually, by this we are asking the rescheduler to know (at least some of) the priority functions.

suitable node exists (without moving some other pods around) , this is within the scope of rescheduler but is related to the preemption & priority, so they are not going to be implemented in the first step.

Another use case of the rescheduler in the first step is to move some pod to under-utilized node, and both 'some pod' and 'under-utilized' need to be defined.

And finally, yeah, I don't think you miss anything :)

Yeah, I'd say this falls into the category of things that could go into either the rescheduler or every scheduler. As mentioned somewhere in the doc, we don't actually need a rescheduler component at all--we could just implement all of the rescheduler in every scheduler, creating a virtual/distribute rescheduler. But it's easier for people to write new schedulers (and the system is easier to understand, global policies are easier to configure, etc. etc.) if we have a single rescheduler component rather than putting the responsibility on every scheduler. With that in mind, the reasoning is basically what @HaiyangDING said -- while you could make schedulers responsible for noticing and stopping "rescheduling loops," you can instead make that the responsibility of the rescheduler, which would notice it happening (for any scheduler) and would add an indication to the node that equivalent pods should avoid that node for some period of time. But it is certainly the case that you could put this logic in the scheduler. (And we are not suggesting to address this at all in the first version of the rescheduler, especially since we don't have any scenarios today AFAIK that should cause rescheduling loops, other than stale information, which I don't consider a rescheduling loop because it will quickly stop.)

erictune · 2016-04-13T16:13:58Z

If you add an /evict subresource, then we can easily authorize that separately from DELETE /pods/foo. If you use a DeleteOptions, we don't yet have a way to handle that in authorization.

erictune · 2016-04-13T16:27:01Z

I haven't read all of this, but from our conversation, I got the impression the eviction is handled synchronously in the apiserver. That made me wonder:
Scheduling is asynchronous, and happens out of the apiserver process. Why isn't eviction also asynchronous, and out of the apiserver process?

Like scheduling, eviction computations:

might require a lot of computation, which we don't want in the request path of the apiserver.
might require a lot of memory for caching and holding state of all pods and nodes, which we don't want to hold in memory of apiserver.
might benefit from handling multiple requests in a batch.
benefits from letting other people try their own implementations
might need to parallelize someday
does not benefit from being "transactional", because some inputs cannot be serialized: if a node reboots, causing a pod to fail, you can't say "sorry, you can't fail, because that would exceed your disruption budget."

By the way, assigning a default disruptionBudget to newly created pods in apiserver seems fine. I'm just asking about eviction.

bgrant0607 · 2016-06-25T03:20:40Z

docs/proposals/rescheduling.md

+of overcommitment, by allowing prioritization of which pods should be allowed to run pods
+when demand for cluster resources exceeds supply.
+
+### Disruption budget


Let's move this to a design doc. It's underway already.

Nevermind. If we move priority out, isn't the rest resolved and underway, and this could be merged?

@bgrant0607 Issue/PR?

PodDisruptionBudget API is already merged, see
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/policy/v1alpha1/types.go#L56

The controller for it is awaiting my review, #25921.

Last step is to implement /evict subresource (no PR for that yet).

davidopp · 2016-07-10T21:11:19Z

Thanks to everyone for their feedback. Obviously this is a complex topic with a large design space. I have incorporated some of the suggestions, and am going to submit the doc. Note that it is intentionally in the proposals/ directory -- it's somewhere between a proposal and a design doc. I expect a fuller design doc for some issues, such as priorities and preemption, before any implementation happens. Other features, such as disruption budget and evict subresource, are almost finished. I don't think it's worth splitting different aspects of this doc into separate docs, since the ideas are closely inter-related, and I think it's helpful for someone to see a global view of how they fit together. Instead I think we should view this as an overview doc, with the expectation that there will be more detailed design docs for some features as necessary.

davidopp · 2016-07-10T21:29:43Z

(Note: I haven't pushed the commit with the fixed yet. Having some trouble on my machine. Will do it soon.)

k8s-bot · 2016-07-10T22:31:15Z

GCE e2e build/test passed for commit a85277313ea6372d35a93fd01411544d700accf3.

k8s-bot · 2016-07-10T22:31:25Z

GCE e2e build/test passed for commit b77e392.

k8s-bot · 2016-07-10T22:39:59Z

GCE e2e build/test passed for commit 19fbd90e691932724f2c50e4f21bcbd48d92fab1.

k8s-github-robot · 2016-07-10T23:12:21Z

Automatic merge from submit-queue

krmayankk · 2017-08-17T08:20:44Z

@davidopp is this available in 1.7 ?

timothysc · 2017-09-05T21:07:28Z

@krmayankk It is being built out of core and is working towards incubation.

@aveshagarwal has more details.

davidopp assigned bgrant0607 Feb 29, 2016

googlebot added the cla: yes label Feb 29, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 29, 2016

davidopp force-pushed the rescheduling branch from 2bb0125 to a543cfe Compare February 29, 2016 23:01

mml reviewed Mar 1, 2016
View reviewed changes

mqliang reviewed Mar 1, 2016
View reviewed changes

davidopp mentioned this pull request Mar 1, 2016

Mark node to be decommissioned and act accordingly #3885

Closed

ghost reviewed Mar 1, 2016
View reviewed changes

therc mentioned this pull request Apr 10, 2016

Initial design doc for AWS GPU support #24071

Merged

bgrant0607 mentioned this pull request Apr 12, 2016

API definition of DisruptionBudget #22774

Closed

davidopp mentioned this pull request Apr 15, 2016

Implement /evict subresource for pod #24321

Closed

davidopp mentioned this pull request May 8, 2016

Requirements for PodDisruptionBudget, /eviction subresource, and PreferAvoidPods to graduate to Beta and then v1 #25321

Closed

10 tasks

bgrant0607 reviewed Jun 25, 2016
View reviewed changes

AdlyTempleton mentioned this pull request Jul 8, 2016

WIP/RFC: A background rescheduling proposal #28689

Closed

davidopp force-pushed the rescheduling branch 2 times, most recently from a852773 to 19fbd90 Compare July 10, 2016 21:55

k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 10, 2016

Rescheduling in Kubernetes design proposal.

b77e392

davidopp force-pushed the rescheduling branch from 19fbd90 to b77e392 Compare July 10, 2016 22:00

davidopp added release-note-none Denotes a PR that doesn't merit a release note. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels Jul 10, 2016

davidopp added the retest-not-required label Jul 10, 2016

k8s-github-robot merged commit 710374b into kubernetes:master Jul 10, 2016

therc mentioned this pull request Jul 15, 2016

Prioritizing Pods in Scheduling #28928

Closed

davidopp mentioned this pull request Aug 30, 2016

PodDisruptionBudget and /eviction subresource kubernetes/enhancements#85

Closed

12 tasks

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Conversation

davidopp commented Feb 29, 2016

k8s-bot commented Feb 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SrinivasChilveri commented Mar 1, 2016

mqliang commented Mar 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Mar 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented Apr 13, 2016

erictune commented Apr 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp commented Jul 10, 2016

davidopp commented Jul 10, 2016

k8s-bot commented Jul 10, 2016

k8s-bot commented Jul 10, 2016

k8s-bot commented Jul 10, 2016

k8s-github-robot commented Jul 10, 2016

krmayankk commented Aug 17, 2017

timothysc commented Sep 5, 2017