Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Merged
merged 1 commit into from
Jul 10, 2016

Conversation

davidopp
Copy link
Member

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, /evict subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things.

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 29, 2016
@k8s-bot
Copy link

k8s-bot commented Feb 29, 2016

GCE e2e build/test passed for commit a543cfe3de6872f60c7d0a72e084082603cbcf72.

Kubernetes will terminate a pod that is managed by a controller, and the controller will
create a replacement pod that is then scheduled by the pod's scheduler. The terminated
pod and replacement pod are completely separate pods, and no pod migration is
implied. However, describing the process as "moving" the pod is approximately accurate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have a pretty tight container abstraction and live migration has come a long way, should we consider if live container (pod) migration isn't a better way to go? At the least, it probably deservers an "alternatives considered" entry.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (and the user) would still need to deal with the node failure case, where live migration is impossible. So it becomes a matter of how many terminations the client sees, not whether they see them. Unless the reduction in number of terminations is dramatic and crucial (which I don't think it is), I think that the consistency of termination/failure semantics wins here (i.e. pods always terminate and replacements are created, rather than pods sometimes moving, and sometimes dieing and being replaced).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we avoid live container (pod) migration, and I would imagine the rescheduler will continue to respect graceful termination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for avoiding live migration. Currently, Pods in k8s are stateless, we save data in PersistentVolume, so live migration is less meaningful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for avoiding live migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal doesn't preclude migration, and this functionality would be required in order to implement migration.

However, migration would be a lot more work than what is described here.

Migration is covered by #3949.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. More generally, this proposal is an attempt to define an initial version and short-term roadmap, not the entire design space or long-term ideas. Once live migration of containers is available (the containerd docs seem to imply it will be soon? https://github.com/docker/containerd ) I would assume Kubernetes will take advantage of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hurf Persisted state can still be "migrated" via persistent volume claim or flocker. Presumably the stateful applications have to be able to deal with restart after failure, so migrating the in-memory state shouldn't be strictly necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what we do now. A problem is dealing with failure sometime may lower performance indicator of a service(doesn't mean not going to deal with the failure, but try to reduce failure). Especially in rescheduler case, failure may not caused by service itself but by an eviction(unless we give it a disruption budget of none). If we can have in-memory migration, the pod can get rescheduled without breaking its ongoing task. It frees more pods and looses the disruption budget restriction. Indeed it's not a necessary feature but an optimized option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not asking for live migration. It's a good thing but not urgent.

@SrinivasChilveri
Copy link

@kubernetes/huawei

@mqliang
Copy link
Contributor

mqliang commented Mar 1, 2016

@bgrant0607 @davidopp This proposal is very huge. May I volunteer myself to implement part of this? I am really interested in:

  1. Priority and Preemption. I have a proposal about Preemption in Random thought about Rescheduler implementation #22054, but without disruption budgets. And I have a proposal about "the order in which a scheduler examines pods in its scheduling loop" in Scheduling policy proposal #20203
  2. One feature of Rescheduler: moving a pod onto an under-utilized node. I'd like implement this using the idea of "Pod Stealing", I described it in Random thought about Rescheduler implementation #22054

signature will be a
[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648),
i.e. a reference to the pod's controller. Controllers are responsible for garbage
collecting, after some period of time, `PreferAvoidPods` that point to them, but the API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific here?I am a little confused. Does "garbage collecting" mean: Pod "Pa" appear in the PreferAvoidPods of Node "Na", but unfortunately scheduler schedule it to "Na" again, so after some period of time, Controller will try to delete it to see if there is a better Node for "Pa"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me I am wrong. IIUC, if the Pod "Pa" unfortunately goes to "Na" eventually, whether to re-evaluate the situation of "Pa" is not GC's work, it is completely another job. Once "Pa" is scheduled, it is scheduled.

The GC here means the removal of the controllerRef of the PreferAvoidPods, since this "anti-affinity" is by no means a permanent thing. Ideally, once the Pod we want to be running on the node gets correctly scheduled, the controllerRef of the evicted pod should be deleted, but it is hard to implement and kind useless. Just making controllerRef in the PreferAvoidPods a temporary thing is enough (by GC of the controllers and the api-server described below.).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for the confusing terminology. As @HaiyangDING inferred, I just meant the controller should remove itself from the list after some period of time.

@davidopp
Copy link
Member Author

davidopp commented Mar 1, 2016

Let's use the scheduling SIG mailing list and meetings to discuss who is interested in implementing the various pieces. But let's make sure we converge on the design for a piece before building it. :)

TBD: In addition to `PodSpec`, where do we store pointer to disruption budget
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption
budget (e.g. when instantiating a Service), or require the user to create it manually
before they create a controller? Which objects should return the disruption budget object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use a admission controller to auto-generate a disruption budget. User could enable such an admission controller so that when they create a Service, a disruption budget will be auto-generated, or they can disable it and create disruption budget manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do just like LimitRange, users can create one, buf if they don't specifiy it, apply a default one.
Question: what's the behavior if this admission controller is disabled. All or none of the pods can be disrupted?

* moving a running pod off of a node from which it is receiving poor service
* anomalous crashlooping or other mysterious incompatiblity between the pod and the node
* repeated out-of-resource killing (see #18724)
* repeated attempts by the scheduler to schedule the pod onto some node, but it is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really possible? If so, we should presumably fix it in the scheduler, not a rescheduler? If the scheduler schedules a pod to a node, and then discovers that it had out of date information about the node, and the pod can not in fact be scheduled there, it should automatically reschedule, possibly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a pod is scheduled to a kubelet and rejected in admission, isn't the pod immediately moved to a terminal state? i agree with @quinton-hoole here that this seems like something the scheduler should get right, but from what I can tell thus far in this PR is that the rescheduler does more selective killing of pods on a node, but does not actually take over primary scheduling responsibility so maybe its a no-op here and a solution is needed in both places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the model where scheduler has full information about kubelet resources, what you guys are saying is correct (and this is indeed how things are today). However, in the future it's possible that Kubelet will have more information than the scheduler, especially if the resource topology within a node becomes very complicated and it's not scalable for the scheduler to know all of the details. I would like to avoid moving to that world as long as possible, though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @davidopp 's comments.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler.
  2. The scheduler picks a non-optimal node, or no suitable node exists (without moving someother pods around). This seems like the purvue of the rescheduler.

Am I missing something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:

  1. The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.
  2. The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Am I missing something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to answer with my knowledge, correct me if I am wrong.

The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.

Currently, if the pod is denied by the node proposed by the scheduler, the pod is simply marked 'failed', there is no trying again. I think in future:

  1. It is the responsibility for the scheduler to try to scheduler the pod denied by kubelet again, or several times (<=N).
  2. It is the responsibility for the rescheduler to handle the pod that has been denied by kubelet several times (>= N). However the mechanism needs further consideration. For instance, if the pod is denied by the same kubelet several times, we can add the avoid annotation on the node; but if the pod is denied by different, maybe we could just wait and see.

The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.

Yes, it is. FWIW:

  • The node picked by the scheduler is non-optimal (or becomes non-optimal over time), this is in the scope of the first version of rescheduler, but we need to figure out some policy to decide how non-optimal is bad enough (as well as the possibility to improve) to trigger the rescheduling behavior. Actually, by this we are asking the rescheduler to know (at least some of) the priority functions.
  • suitable node exists (without moving some other pods around) , this is within the scope of rescheduler but is related to the preemption & priority, so they are not going to be implemented in the first step.
  • Another use case of the rescheduler in the first step is to move some pod to under-utilized node, and both 'some pod' and 'under-utilized' need to be defined.

And finally, yeah, I don't think you miss anything :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd say this falls into the category of things that could go into either the rescheduler or every scheduler. As mentioned somewhere in the doc, we don't actually need a rescheduler component at all--we could just implement all of the rescheduler in every scheduler, creating a virtual/distribute rescheduler. But it's easier for people to write new schedulers (and the system is easier to understand, global policies are easier to configure, etc. etc.) if we have a single rescheduler component rather than putting the responsibility on every scheduler. With that in mind, the reasoning is basically what @HaiyangDING said -- while you could make schedulers responsible for noticing and stopping "rescheduling loops," you can instead make that the responsibility of the rescheduler, which would notice it happening (for any scheduler) and would add an indication to the node that equivalent pods should avoid that node for some period of time. But it is certainly the case that you could put this logic in the scheduler. (And we are not suggesting to address this at all in the first version of the rescheduler, especially since we don't have any scenarios today AFAIK that should cause rescheduling loops, other than stale information, which I don't consider a rescheduling loop because it will quickly stop.)

@erictune
Copy link
Member

If you add an /evict subresource, then we can easily authorize that separately from DELETE /pods/foo. If you use a DeleteOptions, we don't yet have a way to handle that in authorization.

@erictune
Copy link
Member

I haven't read all of this, but from our conversation, I got the impression the eviction is handled synchronously in the apiserver. That made me wonder:
Scheduling is asynchronous, and happens out of the apiserver process. Why isn't eviction also asynchronous, and out of the apiserver process?

Like scheduling, eviction computations:

  • might require a lot of computation, which we don't want in the request path of the apiserver.
  • might require a lot of memory for caching and holding state of all pods and nodes, which we don't want to hold in memory of apiserver.
  • might benefit from handling multiple requests in a batch.
  • benefits from letting other people try their own implementations
  • might need to parallelize someday
  • does not benefit from being "transactional", because some inputs cannot be serialized: if a node reboots, causing a pod to fail, you can't say "sorry, you can't fail, because that would exceed your disruption budget."

By the way, assigning a default disruptionBudget to newly created pods in apiserver seems fine. I'm just asking about eviction.

of overcommitment, by allowing prioritization of which pods should be allowed to run pods
when demand for cluster resources exceeds supply.

### Disruption budget
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to a design doc. It's underway already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind. If we move priority out, isn't the rest resolved and underway, and this could be merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607 Issue/PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodDisruptionBudget API is already merged, see
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/policy/v1alpha1/types.go#L56

The controller for it is awaiting my review, #25921.

Last step is to implement /evict subresource (no PR for that yet).

@davidopp
Copy link
Member Author

Thanks to everyone for their feedback. Obviously this is a complex topic with a large design space. I have incorporated some of the suggestions, and am going to submit the doc. Note that it is intentionally in the proposals/ directory -- it's somewhere between a proposal and a design doc. I expect a fuller design doc for some issues, such as priorities and preemption, before any implementation happens. Other features, such as disruption budget and evict subresource, are almost finished. I don't think it's worth splitting different aspects of this doc into separate docs, since the ideas are closely inter-related, and I think it's helpful for someone to see a global view of how they fit together. Instead I think we should view this as an overview doc, with the expectation that there will be more detailed design docs for some features as necessary.

@davidopp
Copy link
Member Author

(Note: I haven't pushed the commit with the fixed yet. Having some trouble on my machine. Will do it soon.)

@davidopp davidopp force-pushed the rescheduling branch 2 times, most recently from a852773 to 19fbd90 Compare July 10, 2016 21:55
@k8s-github-robot k8s-github-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 10, 2016
@davidopp davidopp added release-note-none Denotes a PR that doesn't merit a release note. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels Jul 10, 2016
@k8s-bot
Copy link

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit a85277313ea6372d35a93fd01411544d700accf3.

@k8s-bot
Copy link

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit b77e392.

@k8s-bot
Copy link

k8s-bot commented Jul 10, 2016

GCE e2e build/test passed for commit 19fbd90e691932724f2c50e4f21bcbd48d92fab1.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@krmayankk
Copy link

@davidopp is this available in 1.7 ?

@timothysc
Copy link
Member

@krmayankk It is being built out of core and is working towards incubation.

@aveshagarwal has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.