-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a rescheduler #12140
Comments
We also had some discussion of how a rescheduler might trigger cluster auto-scaling (to scale up). Instead of moving Pods around to free up space, it might just add a new node (and then move some Pods onto the new node). More generally, it might be useful to integrate the rescheduler and cluster auto-scaler. @erictune made the observation that for scaling up the cluster a reasonable workflow might be:
We talked a little about the role of simulation. Things like knowing what will be the effect of different rearrangements of Pods requires a form of simulation of the scheduling algorithm (see also discussion in previous entry about what the rescheduler needs to know about the predicate and priority functions of the cluster's scheduler(s)). For cluster auto-scaling down, @erictune pointed out that you could do a simulation to see whether after removing a node from the cluster, will the Pods that were on that node be able to reschedule, either directly or with the help of the rescheduler; if the answer is yes, then you can safely auto-scale down (assuming services will still meeting their application-level SLOs). |
@davidopp Could you write as a .txt so we can comment inline..? I've become a fan of the proposal before code mantra that has been going on. |
I'm reluctant to describe this as a proposal right now because the above was more targeted as a description of the full space, as opposed to a concrete proposal. Or to put it another way, what I wrote above might make for a good introduction to a concrete proposal describing exactly what we would implement. But I can put it in a PR that tries to make it clear that it's not a proposal or design doc, just a description of the space. |
(My expectation is that nobody would work on implementing anything rescheduler-like for at least the next 2-3 months). |
Question: If a script were to continually increase the replica count by one then decrease it back by one moments later will the current scheduler do the right thing and favor creating the new replica on a "empty-er" node and/or killing pending/blocked pods that have failed to resolve for a long time? (ie. workaround for #12195 and this) |
Yes. Although it would probably be less efficient than finding the worst On Aug 4, 2015, at 8:17 AM, srcspider notifications@github.com wrote: @davidopp https://github.com/davidopp @smarterclayton Question: If a script were to continually increase the replica count by one — |
I think the answer is 'probably eventually'. Currently RC favors removal of Pods in earlier stages (Pending < Unknown < Running), so if you'd flip counts too quickly it'll remove the pod it added just before that. We also don't mark Nodes as 'just-been-kicked-out-of-it-please-don't-put-me-there-again', which with current scheduler priorities implementation in some cases may force scheduler to put the new Pod on the same machine over and over again. |
@gmarek wasn't there something about scheduler trying to spread out pods? From the userguide, http://kubernetes.io/v1.0/docs/user-guide/compute-resources.html
...should I be reading this as "unrestricted pods" will most likely end up on the same node? (at this time) Would help to have some more transparency on what the algorithm is really doing (ie. what are those cases where it doesn't spread, what are the cases where it spreads fine, etc), since its not like everyone has the same "corner cases," someone's corner case might be someone else's default way of doing things. More transparency and insight would also help a lot with potential debugging scenarios or optimizing user land processes so they fit in nicely with what kubernetes wants to do under the hood (whatever that may be). From trying to integrating kubernetes some of the biggest headaches have been with these semi-blackbox undocumented algorithms (potentially involving opinionated strategies), since they lead to some really nasty gotchas like "if a container has limits set, and no ideal fit is found, it will NOT be scheduled anywhere and just stay in pending" (I assumed "running" even at the cost of potentially crowding other containers or being slower then desirable would be the more robust/ideal way of it compromising not having enough resources, until I stumbled on the documentation--or comments--suggesting otherwise) Side note: I'm aware that in theory "perfectly allocating resources" would be ideal and technically mitigate some of these problems, but in practice I find it hard to motivate such a strategy. Mainly because of the "who exactly can tell what's the perfect fit" and "how much time and resources does it take to search for the perfect fit" and also the problem of "why would I want it not to try to use as much as it can when I'm already paying for the underlying hardware?" (assuming I'm not misunderstanding the "completely fair shares" system that's applied at the OS level when resources are restricted). It would be nice if in the future we could have "maximum node CPU/Replica ratio" as a soft hint (ie. that can be ignored) to the (re)scheduler, eg. 1:1, no more then 1 replica for every 1 CPU on the node; so if a node has 4 CPUs no more then 4 replicas "should" try to get squeezed in if there's space elsewhere; replicas would still see everything as if not restricted at all |
@srcspider The problem is that I'm not sure that anyone actually understands how exactly it is done. Currently scheduler take into account few things (amount of free resources on a Node, number of Pods running on the Node, how many Pods from the same ReplicationController/Service is already running on a given Node, how 'balanced' is resource usage) assigns number from 0 to 10 for each of them, adds them up and picks the Node with the highest one. I think that unrestricted pods case is already solved, so that part of the doc is outdated. You might be interested in #11713 |
@gmarek knowing it's trying to weigh them as it's strategy is still more helpful then not knowing at all. Thanks for the explanation, much appreciated. :) |
So long as we have the ability to plug in our own policy engines I like it. Other systems often referred to them as the defragmentation process, and typically employ analogous algorithms. Ideally I would like an engine/controller/rescheduler/process where I can plug in a cloud provider and spin down resources as they are offloaded. (inverse of bursting). |
@timothysc I think the architecture is TBD. (Even the feature set is TBD. :-) Would love to get your input on how to make the policy pluggable, once we get to the design stage. Probably something along the lines of how the scheduler policy is pluggable would work. (Though I think we'd like to refactor the scheduler to make customization a bit simpler.) Defragmentation is definitely part of this, but the objective function can be more general, i.e. incorporate more than just defragmentation as a goal. And I agree that this is related to cluster auto-scaling (seems to be what you referred to in your last paragraph). |
I missed commenting on this issue and on the proposal: https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/rescheduler.md A few quick comments now:
|
Simplify compared to what? (Please don't say "compared to not taking cluster auto-scaling into account :) )
This is a good point. Since our restarts never go through the scheduler, the rescheduler is the only place we can put this logic. (Contrast with Borg, where IIRC after some number of local restarts, it punts back to the master to make a decision about restarting it, in which case the scheduler makes a decision and can take into account previous crashlooping if it wants to.)
Agreed; this is covered in the doc (" [obviously this killing operation must be able to specify "don't allow the killed Pod to reschedule back to whence it was killed" otherwise the killing is pointless]") |
@aveshagarwal offered to take a crack at this. I think the spreading use case is the best one to start with. Kill pods on heavily utilized nodes if you think that will move them to under-utilized nodes. Of course the utilization threshold between the over- I would suggest starting with the current "rescheduler" codebase and modifying it, since some of the code can be reused. Hopefully we'll be able to get rid of the critical pod preemption stuff from the rescheduler in 1.7 by implementing a more general priority/preemption scheme, but we need to leave it in there for now. I think a key design criteria is extensibility -- make it easy for people to add new policies and to choose which ones to activate. It should definitely use /eviction subresource, i.e. respect PDB. We talked about importing scheduler code into other components in the sig-scheduling meeting today (e.g. cluster autoscaler appears to do it here). You may want to do a "simulation" like that to verify that the evicted pods wil actually move to the under-utilized nodes. I'm sure mwielgus and others would be happy to explain the cluster autoscaler simulation code if it's not straightforward (I've haven't looked at it carefully). |
Ideally I would hope we fork a repo for this, putting this in contrib is a bad move imo. |
@timothysc you mean a new sub repo in kubernetes or in kube-incubator? |
Doesn't matter to me, so long is it's not in the main or contrib repos. |
/sub We'd be really keen to see this feature implemented sooner rather than later. |
@ajtrichards Can you say what particular functionality you're looking for? |
Hi @davidopp the specific need we had was to shift pods around on to other nodes to make space for some pods with larger resource requirements. We specify the request and limits for each Deployment. We then have one deployment that requests 40% of the CPU and then it gets stuck in a Pending state as it can't schedule anywhere. It would be good to be able to shift some of the smaller pods around on to other nodes to help get the larger one provisioned and running. |
@ajtrichards Priority/preemption (kubernetes/enhancements#268) is one possible solution to that problem, and I'm not sure the rescheduler should be involved. I think we were planning to do "spreading pods away from high/over-utilized nodes" as the first rescheduler use case. (It's true that rescheduler is needed for your use case if the large deployment runs at the same priority as the smaller pods). |
Thanks @davidopp - sounds like Priority / Preemption would be a good solution for us. Will keep an eye out on that becoming available. |
One of the key se cases we're interested in is re-spreading pods when a zone comes back online after having failed. The pods will have been moved onto nodes in the non-failed zones, so we would want to rebalance load by moving some back to the recovered zone. This is similar to a node failing and then recovering, but on a larger scale. |
this feature request is extremely similar #47965 |
@davidopp I can't access this doc. Is it perhaps only available to a mailing list I'm not on, or is intentionally private for now? 🙃 |
@obeattie I guess you should join sig scheduling mail list |
As far as I have read this whole ticket is about "Defragmentation" of a cluster, consisting of running containers. What is the current progress on this? |
Work has been progressing in an incubator repo: |
I think we should close this issue fwiw. |
SGTM. People can file issues in the descheduler repo. |
It's premature to start working on this, but I wanted to jot down some notes collected from past conversations and experience on this topic (and I noticed we didn't have an issue for this yet).
A rescheduler is an agent that proactively causes currently-running Pods to be moved, so as to optimize some objective function for goodness of the layout of Pods in the cluster. (The objective function doesn't have to be expressed mathematically; it may just be a collection of ad-hoc rules, but in principle there is an objective function. Implicitly an objective function is described by the scheduler's predicate and priority functions.) It might be triggered to run every N minutes, or whenever some event happens that is known to make the objective function worse (for example, whenever a Pod goes PENDING for a long time.)
A rescheduler is useful because without a rescheduler, scheduling decisions are only made at the time Pods are created. But as the cluster layout changes over time, free "holes" are often produced that were not available when a Pod was initially scheduled. These holes are produced by run-to-completion Pods terminating, empty nodes being added by a node auto-scaler, etc. Moving already-running Pods into these holes may lead to a better cluster layout. A rescheduler might not just exploit existing holes, but also create holes by evicting Pods (assuming it knows they can reschedule elsewhere), as in free space defragmentation.
[Although alluded to above, it's worth emphasizing that rescheduling is the only way to make use of new nodes added by a cluster auto-scaler (unless Pods were already PENDING; but even then, it's likely advantageous to put more than just the previously PENDING Pods on the new nodes.)]
Because rescheduling is disruptive--it causes one or more already-running Pods to die when they otherwise wouldn't--a key constraint on rescheduling is that it must be done subject to disruption SLOs. There are a number of ways to specify these SLOs--a global rate limit across all Pods, a rate limit across a set of Pods defined by some particular label selector, a maximum number of Pods that can be down at any one time among a set defined by some particular label selector, etc. These policies are presumably part of the Rescheduler's configuration.
There are a lot of design possibilities for a rescheduler. To explain them, it's easiest to start with the description of a baseline rescheduler, and then describe possible modifications. The Baseline rescheduler
Possible variations on this Baseline rescheduler are
A key design question for a Rescheduler is how much knowledge it needs about the scheduling policies used by the cluster's scheduler(s).
The vast majority of users probably only care about rescheduling for three scenarios:
The text was updated successfully, but these errors were encountered: