-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP/RFC] Rescheduling in Kubernetes design proposal #22217
Conversation
2bb0125
to
a543cfe
Compare
GCE e2e build/test passed for commit a543cfe3de6872f60c7d0a72e084082603cbcf72. |
Kubernetes will terminate a pod that is managed by a controller, and the controller will | ||
create a replacement pod that is then scheduled by the pod's scheduler. The terminated | ||
pod and replacement pod are completely separate pods, and no pod migration is | ||
implied. However, describing the process as "moving" the pod is approximately accurate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have a pretty tight container abstraction and live migration has come a long way, should we consider if live container (pod) migration isn't a better way to go? At the least, it probably deservers an "alternatives considered" entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (and the user) would still need to deal with the node failure case, where live migration is impossible. So it becomes a matter of how many terminations the client sees, not whether they see them. Unless the reduction in number of terminations is dramatic and crucial (which I don't think it is), I think that the consistency of termination/failure semantics wins here (i.e. pods always terminate and replacements are created, rather than pods sometimes moving, and sometimes dieing and being replaced).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we avoid live container (pod) migration, and I would imagine the rescheduler will continue to respect graceful termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for avoiding live migration. Currently, Pods in k8s are stateless, we save data in PersistentVolume, so live migration is less meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for avoiding live migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal doesn't preclude migration, and this functionality would be required in order to implement migration.
However, migration would be a lot more work than what is described here.
Migration is covered by #3949.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. More generally, this proposal is an attempt to define an initial version and short-term roadmap, not the entire design space or long-term ideas. Once live migration of containers is available (the containerd docs seem to imply it will be soon? https://github.com/docker/containerd ) I would assume Kubernetes will take advantage of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hurf Persisted state can still be "migrated" via persistent volume claim or flocker. Presumably the stateful applications have to be able to deal with restart after failure, so migrating the in-memory state shouldn't be strictly necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what we do now. A problem is dealing with failure sometime may lower performance indicator of a service(doesn't mean not going to deal with the failure, but try to reduce failure). Especially in rescheduler case, failure may not caused by service itself but by an eviction(unless we give it a disruption budget of none). If we can have in-memory migration, the pod can get rescheduled without breaking its ongoing task. It frees more pods and looses the disruption budget restriction. Indeed it's not a necessary feature but an optimized option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not asking for live migration. It's a good thing but not urgent.
@kubernetes/huawei |
@bgrant0607 @davidopp This proposal is very huge. May I volunteer myself to implement part of this? I am really interested in:
|
signature will be a | ||
[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648), | ||
i.e. a reference to the pod's controller. Controllers are responsible for garbage | ||
collecting, after some period of time, `PreferAvoidPods` that point to them, but the API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific here?I am a little confused. Does "garbage collecting" mean: Pod "Pa" appear in the PreferAvoidPods
of Node "Na", but unfortunately scheduler schedule it to "Na" again, so after some period of time, Controller will try to delete it to see if there is a better Node for "Pa"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me I am wrong. IIUC, if the Pod "Pa" unfortunately goes to "Na" eventually, whether to re-evaluate the situation of "Pa" is not GC's work, it is completely another job. Once "Pa" is scheduled, it is scheduled.
The GC here means the removal of the controllerRef
of the PreferAvoidPods
, since this "anti-affinity" is by no means a permanent thing. Ideally, once the Pod we want to be running on the node gets correctly scheduled, the controllerRef
of the evicted pod should be deleted, but it is hard to implement and kind useless. Just making controllerRef
in the PreferAvoidPods
a temporary thing is enough (by GC of the controllers and the api-server described below.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry for the confusing terminology. As @HaiyangDING inferred, I just meant the controller should remove itself from the list after some period of time.
Let's use the scheduling SIG mailing list and meetings to discuss who is interested in implementing the various pieces. But let's make sure we converge on the design for a piece before building it. :) |
TBD: In addition to `PodSpec`, where do we store pointer to disruption budget | ||
(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption | ||
budget (e.g. when instantiating a Service), or require the user to create it manually | ||
before they create a controller? Which objects should return the disruption budget object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use a admission controller to auto-generate a disruption budget. User could enable such an admission controller so that when they create a Service, a disruption budget will be auto-generated, or they can disable it and create disruption budget manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do just like LimitRange, users can create one, buf if they don't specifiy it, apply a default one.
Question: what's the behavior if this admission controller is disabled. All or none of the pods can be disrupted?
* moving a running pod off of a node from which it is receiving poor service | ||
* anomalous crashlooping or other mysterious incompatiblity between the pod and the node | ||
* repeated out-of-resource killing (see #18724) | ||
* repeated attempts by the scheduler to schedule the pod onto some node, but it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really possible? If so, we should presumably fix it in the scheduler, not a rescheduler? If the scheduler schedules a pod to a node, and then discovers that it had out of date information about the node, and the pod can not in fact be scheduled there, it should automatically reschedule, possibly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a pod is scheduled to a kubelet and rejected in admission, isn't the pod immediately moved to a terminal state? i agree with @quinton-hoole here that this seems like something the scheduler should get right, but from what I can tell thus far in this PR is that the rescheduler does more selective killing of pods on a node, but does not actually take over primary scheduling responsibility so maybe its a no-op here and a solution is needed in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the model where scheduler has full information about kubelet resources, what you guys are saying is correct (and this is indeed how things are today). However, in the future it's possible that Kubelet will have more information than the scheduler, especially if the resource topology within a node becomes very complicated and it's not scalable for the scheduler to know all of the details. I would like to avoid moving to that world as long as possible, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @davidopp 's comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:
- The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler.
- The scheduler picks a non-optimal node, or no suitable node exists (without moving someother pods around). This seems like the purvue of the rescheduler.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidopp @HaiyangDING In my mind there are two distinct cases where a pod fails to schedule:
- The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.
- The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to answer with my knowledge, correct me if I am wrong.
The scheduler picks a bad node, due to out of date or incomplete information. It's not clear to me that fixing this is the responsibility of a rescheduler. It seems like the scheduler should fix it's own mistakes here, e.g by trying again.
Currently, if the pod is denied by the node proposed by the scheduler, the pod is simply marked 'failed', there is no trying again. I think in future:
- It is the responsibility for the scheduler to try to scheduler the pod denied by kubelet again, or several times (<=N).
- It is the responsibility for the rescheduler to handle the pod that has been denied by kubelet several times (>= N). However the mechanism needs further consideration. For instance, if the pod is denied by the same kubelet several times, we can add the avoid annotation on the node; but if the pod is denied by different, maybe we could just wait and see.
The node picked by the scheduler is non-optimal (or becomes non-optimal over time), or no suitable node exists (without moving some other pods around). This seems like the purview of the rescheduler.
Yes, it is. FWIW:
The node picked by the scheduler is non-optimal (or becomes non-optimal over time)
, this is in the scope of the first version of rescheduler, but we need to figure out some policy to decide how non-optimal is bad enough (as well as the possibility to improve) to trigger the rescheduling behavior. Actually, by this we are asking the rescheduler to know (at least some of) the priority functions.suitable node exists (without moving some other pods around)
, this is within the scope of rescheduler but is related to the preemption & priority, so they are not going to be implemented in the first step.- Another use case of the rescheduler in the first step is to move some pod to under-utilized node, and both 'some pod' and 'under-utilized' need to be defined.
And finally, yeah, I don't think you miss anything :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd say this falls into the category of things that could go into either the rescheduler or every scheduler. As mentioned somewhere in the doc, we don't actually need a rescheduler component at all--we could just implement all of the rescheduler in every scheduler, creating a virtual/distribute rescheduler. But it's easier for people to write new schedulers (and the system is easier to understand, global policies are easier to configure, etc. etc.) if we have a single rescheduler component rather than putting the responsibility on every scheduler. With that in mind, the reasoning is basically what @HaiyangDING said -- while you could make schedulers responsible for noticing and stopping "rescheduling loops," you can instead make that the responsibility of the rescheduler, which would notice it happening (for any scheduler) and would add an indication to the node that equivalent pods should avoid that node for some period of time. But it is certainly the case that you could put this logic in the scheduler. (And we are not suggesting to address this at all in the first version of the rescheduler, especially since we don't have any scenarios today AFAIK that should cause rescheduling loops, other than stale information, which I don't consider a rescheduling loop because it will quickly stop.)
If you add an |
I haven't read all of this, but from our conversation, I got the impression the eviction is handled synchronously in the apiserver. That made me wonder: Like scheduling, eviction computations:
By the way, assigning a default disruptionBudget to newly created pods in apiserver seems fine. I'm just asking about eviction. |
of overcommitment, by allowing prioritization of which pods should be allowed to run pods | ||
when demand for cluster resources exceeds supply. | ||
|
||
### Disruption budget |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a design doc. It's underway already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind. If we move priority out, isn't the rest resolved and underway, and this could be merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bgrant0607 Issue/PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PodDisruptionBudget API is already merged, see
https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/policy/v1alpha1/types.go#L56
The controller for it is awaiting my review, #25921.
Last step is to implement /evict subresource (no PR for that yet).
Thanks to everyone for their feedback. Obviously this is a complex topic with a large design space. I have incorporated some of the suggestions, and am going to submit the doc. Note that it is intentionally in the proposals/ directory -- it's somewhere between a proposal and a design doc. I expect a fuller design doc for some issues, such as priorities and preemption, before any implementation happens. Other features, such as disruption budget and evict subresource, are almost finished. I don't think it's worth splitting different aspects of this doc into separate docs, since the ideas are closely inter-related, and I think it's helpful for someone to see a global view of how they fit together. Instead I think we should view this as an overview doc, with the expectation that there will be more detailed design docs for some features as necessary. |
(Note: I haven't pushed the commit with the fixed yet. Having some trouble on my machine. Will do it soon.) |
a852773
to
19fbd90
Compare
GCE e2e build/test passed for commit a85277313ea6372d35a93fd01411544d700accf3. |
GCE e2e build/test passed for commit b77e392. |
GCE e2e build/test passed for commit 19fbd90e691932724f2c50e4f21bcbd48d92fab1. |
Automatic merge from submit-queue |
@davidopp is this available in 1.7 ? |
@krmayankk It is being built out of core and is working towards incubation. @aveshagarwal has more details. |
Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).
This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota,
/evict
subresource, and rescheduler.)Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things.
ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212
@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107