From b77e39298eaf3427bd004402e275dadc3b2e9fbb Mon Sep 17 00:00:00 2001 From: David Oppenheimer Date: Sun, 28 Feb 2016 12:52:36 -0800 Subject: [PATCH] Rescheduling in Kubernetes design proposal. --- CHANGELOG.md | 2 + docs/proposals/rescheduling.md | 522 +++++++++++++++++++++++++++++++++ 2 files changed, 524 insertions(+) create mode 100644 docs/proposals/rescheduling.md diff --git a/CHANGELOG.md b/CHANGELOG.md index d38902a25560d..fce9fc111fe50 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,8 @@ - [Downloads](#downloads) - [Highlights](#highlights) - [Known Issues and Important Steps before Upgrading](#known-issues-and-important-steps-before-upgrading) + - [ThirdPartyResource](#thirdpartyresource) + - [kubectl](#kubectl) - [kubernetes Core Known Issues](#kubernetes-core-known-issues) - [Docker runtime Known Issues](#docker-runtime-known-issues) - [Rkt runtime Known Issues](#rkt-runtime-known-issues) diff --git a/docs/proposals/rescheduling.md b/docs/proposals/rescheduling.md new file mode 100644 index 0000000000000..91289335b9629 --- /dev/null +++ b/docs/proposals/rescheduling.md @@ -0,0 +1,522 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Controlled Rescheduling in Kubernetes + +## Overview + +Although the Kubernetes scheduler(s) try to make good placement decisions for pods, +conditions in the cluster change over time (e.g. jobs finish and new pods arrive, nodes +are removed due to failures or planned maintenance or auto-scaling down, nodes appear due +to recovery after a failure or re-joining after maintenance or auto-scaling up or adding +new hardware to a bare-metal cluster), and schedulers are not omniscient (e.g. there are +some interactions between pods, or between pods and nodes, that they cannot predict). As +a result, the initial node selected for a pod may turn out to be a bad match, from the +perspective of the pod and/or the cluster as a whole, at some point after the pod has +started running. + +Today (Kubernetes version 1.2) once a pod is scheduled to a node, it never moves unless +it terminates on its own, is deleted by the user, or experiences some unplanned event +(e.g. the node where it is running dies). Thus in a cluster with long-running pods, the +assignment of pods to nodes degrades over time, no matter how good an initial scheduling +decision the scheduler makes. This observation motivates "controlled rescheduling," a +mechanism by which Kubernetes will "move" already-running pods over time to improve their +placement. Controlled rescheduling is the subject of this proposal. + +Note that the term "move" is not technically accurate -- the mechanism used is that +Kubernetes will terminate a pod that is managed by a controller, and the controller will +create a replacement pod that is then scheduled by the pod's scheduler. The terminated +pod and replacement pod are completely separate pods, and no pod migration is +implied. However, describing the process as "moving" the pod is approximately accurate +and easier to understand, so we will use this terminology in the document. + +We use the term "rescheduling" to describe any action the system takes to move an +already-running pod. The decision may be made and executed by any component; we wil +introduce the concept of a "rescheduler" component later, but it is not the only +component that can do rescheduling. + +This proposal primarily focuses on the architecture and features/mechanisms used to +achieve rescheduling, and only briefly discuss example policies. We expect that community +experimentation will lead to a significantly better understanding of the range, potential, +and limitations of rescheduling policies. + +## Example use cases + +Example use cases for rescheduling are + +* moving a running pod onto a node that better satisfies its scheduling criteria + * moving a pod onto an under-utilized node + * moving a pod onto a node that meets more of the pod's affinity/anti-affinity preferences +* moving a running pod off of a node in anticipation of a known or speculated future event + * draining a node in preparation for maintenance, decomissioning, auto-scale-down, etc. + * "preempting" a running pod to make room for a pending pod to schedule + * proactively/speculatively make room for large and/or exclusive pods to facilitate + fast scheduling in the future (often called "defragmentation") + * (note that these last two cases are the only use cases where the first-order intent + is to move a pod specifically for the benefit of another pod) +* moving a running pod off of a node from which it is receiving poor service + * anomalous crashlooping or other mysterious incompatiblity between the pod and the node + * repeated out-of-resource killing (see #18724) + * repeated attempts by the scheduler to schedule the pod onto some node, but it is + rejected by Kubelet admission control due to incomplete scheduler knowledge + * poor performance due to interference from other containers on the node (CPU hogs, + cache thrashers, etc.) (note that in this case there is a choice of moving the victim + or the aggressor) + +## Some axes of the design space + +Among the key design decisions are + +* how does a pod specify its tolerance for these system-generated disruptions, and how + does the system enforce such disruption limits +* for each use case, where is the decision made about when and which pods to reschedule + (controllers, schedulers, an entirely new component e.g. "rescheduler", etc.) +* rescheduler design issues: how much does a rescheduler need to know about pods' + schedulers' policies, how does the rescheduler specify its rescheduling + requests/decisions (e.g. just as an eviction, an eviction with a hint about where to + reschedule, or as an eviction paired with a specific binding), how does the system + implement these requests, does the rescheduler take into account the second-order + effects of decisions (e.g. whether an evicted pod will reschedule, will cause + a preemption when it reschedules, etc.), does the rescheduler execute multi-step plans + (e.g. evict two pods at the same time with the intent of moving one into the space + vacated by the other, or even more complex plans) + +Additional musings on the rescheduling design space can be found [here](rescheduler.md). + +## Design proposal + +The key mechanisms and components of the proposed design are priority, preemption, +disruption budgets, the `/evict` subresource, and the rescheduler. + +### Priority + +#### Motivation + + +Just as it is useful to overcommit nodes to increase node-level utilization, it is useful +to overcommit clusters to increase cluster-level utilization. Scheduling priority (which +we abbreviate as *priority*, in combination with disruption budgets (described in the +next section), allows Kubernetes to safely overcommit clusters much as QoS levels allow +it to safely overcommit nodes. + +Today, cluster sharing among users, workload types, etc. is regulated via the +[quota](../admin/resourcequota/README.md) mechanism. When allocating quota, a cluster +administrator has two choices: (1) the sum of the quotas is less than or equal to the +capacity of the cluster, or (2) the sum of the quotas is greater than the capacity of the +cluster (that is, the cluster is overcommitted). (1) is likely to lead to cluster +under-utilization, while (2) is unsafe in the sense that someone's pods may go pending +indefinitely even though they are still within their quota. Priority makes cluster +overcommitment (i.e. case (2)) safe by allowing users and/or administrators to identify +which pods should be allowed to run, and which should go pending, when demand for cluster +resources exceeds supply to due to cluster overcommitment. + +Priority is also useful in some special-case scenarios, such as ensuring that system +DaemonSets can always schedule and reschedule onto every node where they want to run +(assuming they are given the highest priority), e.g. see #21767. + +#### Specifying priorities + +We propose to add a required `Priority` field to `PodSpec`. Its value type is string, and +the cluster administrator defines a total ordering on these strings (for example +`Critical`, `Normal`, `Preemptible`). We choose string instead of integer so that it is +easy for an administrator to add new priority levels in between existing levels, to +encourage thinking about priority in terms of user intent and avoid magic numbers, and to +make the internal implementation more flexible. + +When a scheduler is scheduling a new pod P and cannot find any node that meets all of P's +scheduling predicates, it is allowed to evict ("preempt") one or more pods that are at +the same or lower priority than P (subject to disruption budgets, see next section) from +a node in order to make room for P, i.e. in order to make the scheduling predicates +satisfied for P on that node. (Note that when we add cluster-level resources (#19080), +it might be necessary to preempt from multiple nodes, but that scenario is outside the +scope of this document.) The preempted pod(s) may or may not be able to reschedule. The +net effect of this process is that when demand for cluster resources exceeds supply, the +higher-priority pods will be able to run while the lower-priority pods will be forced to +wait. The detailed mechanics of preemption are described in a later section. + +In addition to taking disruption budget into account, for equal-priority preemptions the +scheduler will try to enforce fairness (across victim controllers, services, etc.) + +Priorities could be specified directly by users in the podTemplate, or assigned by an +admission controller using +properties of the pod. Either way, all schedulers must be configured to understand the +same priorities (names and ordering). This could be done by making them constants in the +API, or using ConfigMap to configure the schedulers with the information. The advantage of +the former (at least making the names, if not the ordering, constants in the API) is that +it allows the API server to do validation (e.g. to catch mis-spelling). + +In the future, which priorities are usable for a given namespace and pods with certain +attributes may be configurable, similar to ResourceQuota, LimitRange, or security policy. + +Priority and resource QoS are indepedent. + +The priority we have described here might be used to prioritize the scheduling queue +(i.e. the order in which a scheduler examines pods in its scheduling loop), but the two +priority concepts do not have to be connected. It is somewhat logical to tie them +together, since a higher priority genreally indicates that a pod is more urgent to get +running. Also, scheduling low-priority pods before high-priority pods might lead to +avoidable preemptions if the high-priority pods end up preempting the low-priority pods +that were just scheduled. + +TODO: Priority and preemption are global or namespace-relative? See +[this discussion thread](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r55737389). + +#### Relationship of priority to quota + +Of course, if the decision of what priority to give a pod is solely up to the user, then +users have no incentive to ever request any priority less than the maximum. Thus +priority is intimately related to quota, in the sense that resource quotas must be +allocated on a per-priority-level basis (X amount of RAM at priority A, Y amount of RAM +at priority B, etc.). The "guarantee" that highest-priority pods will always be able to +schedule can only be achieved if the sum of the quotas at the top priority level is less +than or equal to the cluster capacity. This is analogous to QoS, where safety can only be +achieved if the sum of the limits of the top QoS level ("Guaranteed") is less than or +equal to the node capacity. In terms of incentives, an organization could "charge" +an amount proportional to the priority of the resources. + +The topic of how to allocate quota at different priority levels to achieve a desired +balance between utilization and probability of schedulability is an extremely complex +topic that is outside the scope of this document. For example, resource fragmentation and +RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the +sum of the quotas at the top priority level is less than or equal to the total aggregate +capacity of the cluster, some pods at the top priority level might still go pending. In +general, priority provdes a *probabilistic* guarantees of pod schedulability in the face +of overcommitment, by allowing prioritization of which pods should be allowed to run pods +when demand for cluster resources exceeds supply. + +### Disruption budget + +While priority can protect pods from one source of disruption (preemption by a +lower-priority pod), *disruption budgets* limit disruptions from all Kubernetes-initiated +causes, including preemption by an equal or higher-priority pod, or being evicted to +achieve other rescheduling goals. In particular, each pod is optionally associated with a +"disruption budget," a new API resource that limits Kubernetes-initiated terminations +across a set of pods (e.g. the pods of a particular Service might all point to the same +disruption budget object), regardless of cause. Initially we expect disruption budget +(e.g. `DisruptionBudgetSpec`) to consist of + +* a rate limit on disruptions (preemption and other evictions) across the corresponding + set of pods, e.g. no more than one disruption per hour across the pods of a particular Service +* a minimum number of pods that must be up simultaneously (sometimes called "shard + strength") (of course this can also be expressed as the inverse, i.e. the number of + pods of the collection that can be down simultaneously) + +The second item merits a bit more explanation. One use case is to specify a quorum size, +e.g. to ensure that at least 3 replicas of a quorum-based service with 5 replicas are up +at the same time. In practice, a service should ideally create enough replicas to survive +at least one planned and one unplanned outage. So in our quorum example, we would specify +that at least 4 replicas must be up at the same time; this allows for one intentional +disruption (bringing the number of live replicas down from 5 to 4 and consuming one unit +of shard strength budget) and one unplanned disruption (bringing the number of live +replicas down from 4 to 3) while still maintaining a quorum. Shard strength is also +useful for simpler replicated services; for example, you might not want more than 10% of +your front-ends to be down at the same time, so as to avoid overloading the remaining +replicas. + +Initially, disruption budgets will be specified by the user. Thus as with priority, +disruption budgets need to be tied into quota, to prevent users from saying none of their +pods can ever be disrupted. The exact way of expressing and enforcing this quota is TBD, +though a simple starting point would be to have an admission controller assign a default +disruption budget based on priority level (more liberal with decreasing priority). +We also likely need a quota that applies to Kubernetes *components*, to the limit the rate +at which any one component is allowed to consume disruption budget. + +Of course there should also be a `DisruptionBudgetStatus` that indicates the current +disruption rate that the collection of pods is experiencing, and the number of pods that +are up. + +For the purposes of disruption budget, a pod is considered to be disrupted as soon as its +graceful termination period starts. + +A pod that is not covered by a disruption budget but is managed by a controller, +gets an implicit disruption budget of infinite (though the system should try to not +unduly victimize such pods). How a pod that is not managed by a controller is +handled is TBD. + +TBD: In addition to `PodSpec`, where do we store pointer to disruption budget +(podTemplate in controller that managed the pod?)? Do we auto-generate a disruption +budget (e.g. when instantiating a Service), or require the user to create it manually +before they create a controller? Which objects should return the disruption budget object +as part of the output on `kubectl get` other than (obviously) `kubectl get` for the +disruption budget itself? + +TODO: Clean up distinction between "down due to voluntary action taken by Kubernetes" +and "down due to unplanned outage" in spec and status. + +For now, there is nothing to prevent clients from circumventing the disruption budget +protections. Of course, clients that do this are not being "good citizens." In the next +section we describe a mechanism that at least makes it easy for well-behaved clients to +obey the disruption budgets. + +See #12611 for additional discussion of disruption budgets. + +### /evict subresource and PreferAvoidPods + +Although we could put the responsibility for checking and updating disruption budgets +solely on the client, it is safer and more convenient if we implement that functionality +in the API server. Thus we will introduce a new `/evict` subresource on pod. It is similar to +today's "delete" on pod except + + * It will be rejected if the deletion would violate disruption budget. (See how + Deployment handles failure of /rollback for ideas on how clients could handle failure + of `/evict`.) There are two possible ways to implement this: + + * For the initial implementation, this will be accomplished by the API server just + looking at the `DisruptionBudgetStatus` and seeing if the disruption would violate the + `DisruptionBudgetSpec`. In this approach, we assume a disruption budget controller + keeps the `DisruptionBudgetStatus` up-to-date by observing all pod deletions and + creations in the cluster, so that an approved disruption is quickly reflected in the + `DisruptionBudgetStatus`. Of course this approach does allow a race in which one or + more additional disruptions could be approved before the first one is reflected in the + `DisruptionBudgetStatus`. + + * Thus a subsequent implementation will have the API server explicitly debit the + `DisruptionBudgetStatus` when it accepts an `/evict`. (There still needs to be a + controller, to keep the shard strength status up-to-date when replacement pods are + created after an eviction; the controller may also be necessary for the rate status + depending on how rate is represented, e.g. adding tokens to a bucket at a fixed rate.) + Once etcd support multi-object transactions (etcd v3), the debit and pod deletion will + be placed in the same transaction. + + * Note: For the purposes of disruption budget, a pod is considered to be disrupted as soon as its + graceful termination period starts (so when we say "delete" here we do not mean + "deleted from etcd" but rather "graceful termination period has started"). + + * It will allow clients to communicate additional parameters when they wish to delete a + pod. (In the absence of the `/evict` subresource, we would have to create a pod-specific + type analogous to `api.DeleteOptions`.) + +We will make `kubectl delete pod` use `/evict` by default, and require a command-line +flag to delete the pod directly. + +We will add to `NodeStatus` a bounded-sized list of signatures of pods that should avoid +that node (provisionally called `PreferAvoidPods`). One of the pieces of information +specified in the `/evict` subresource is whether the eviction should add the evicted +pod's signature to the corresponding node's `PreferAvoidPods`. Initially the pod +signature will be a +[controllerRef](https://github.com/kubernetes/kubernetes/issues/14961#issuecomment-183431648), +i.e. a reference to the pod's controller. Controllers are responsible for garbage +collecting, after some period of time, `PreferAvoidPods` entries that point to them, but the API +server will also enforce a bounded size on the list. All schedulers will have a +highest-weighted priority function that gives a node the worst priority if the pod it is +scheduling appears in that node's `PreferAvoidPods` list. Thus appearing in +`PreferAvoidPods` is similar to +[RequiredDuringScheduling node anti-affinity](../../docs/user-guide/node-selection/README.md) +but it takes precedence over all other priority criteria and is not explicitly listed in +the `NodeAffinity` of the pod. + +`PreferAvoidPods` is useful for the "moving a running pod off of a node from which it is +receiving poor service" use case, as it reduces the chance that the replacement pod will +end up on the same node (keep in mind that most of those cases are situations that the +scheduler does not have explicit priority functions for, for example it cannot know in +advance that a pod will be starved). Also, though we do not intend to implement any such +policies in the first version of the rescheduler, it is useful whenever the rescheduler evicts +two pods A and B with the intention of moving A into the space vacated by B (it prevents +B from rescheduling back into the space it vacated before A's scheduler has a chance to +reschedule A there). Note that these two uses are subtly different; in the first +case we want the avoidance to last a relatively long time, whereas in the second case we +may only need it to last until A schedules. + +See #20699 for more discussion. + +### Preemption mechanics + +**NOTE: We expect a fuller design doc to be written on preemption before it is implemented. +However, a sketch of some ideas are presented here, since preemption is closely related to the +concepts discussed in this doc.** + +Pod schedulers will decide and enact preemptions, subject to the priority and disruption +budget rules described earlier. (Though note that we currently do not have any mechanism +to prevent schedulers from bypassing either the priority or disruption budget rules.) +The scheduler does not concern itself with whether the evicted pod(s) can reschedule. The +eviction(s) use(s) the `/evict` subresource so that it is subject to the disruption +budget(s) of the victim(s), but it does not request to add the victim pod(s) to the +nodes' `PreferAvoidPods`. + +Evicting victim(s) and binding the pending pod that the evictions are intended to enable +to schedule, are not transactional. We expect the scheduler to issue the operations in +sequence, but it is still possible that another scheduler could schedule its pod in +between the eviction(s) and the binding, or that the set of pods running on the node in +question changed between the time the scheduler made its decision and the time it sent +the operations to the API server thereby causing the eviction(s) to be not sufficient to get the +pending pod to schedule. In general there are a number of race conditions that cannot be +avoided without (1) making the evictions and binding be part of a single transaction, and +(2) making the binding preconditioned on a version number that is associated with the +node and is incremented on every binding. We may or may not implement those mechanisms in +the future. + +Given a choice between a node where scheduling a pod requires preemption and one where it +does not, all other things being equal, a scheduler should choose the one where +preemption is not required. (TBD: Also, if the selected node does require preemption, the +scheduler should preempt lower-priority pods before higher-priority pods (e.g. if the +scheduler needs to free up 4 GB of RAM, and the node has two 2 GB low-priority pods and +one 4 GB high-priority pod, all of which have sufficient disruption budget, it should +preempt the two low-priority pods). This is debatable, since all have sufficient +disruption budget. But still better to err on the side of giving better disruption SLO to +higher-priority pods when possible?) + +Preemption victims must be given their termination grace period. One possible sequence +of events is + +1. The API server binds the preemptor to the node (i.e. sets `nodeName` on the +preempting pod) and sets `deletionTimestamp` on the victims +2. Kubelet sees that `deletionTimestamp` has been set on the victims; they enter their +graceful termination period +3. Kubelet sees the preempting pod. It runs the admission checks on the new pod +assuming all pods that are in their graceful termination period are gone and that +all pods that are in the waiting state (see (4)) are running. +4. If (3) fails, then the new pod is rejected. If (3) passes, then Kubelet holds the +new pod in a waiting state, and does not run it until the pod passes passes the +admission checks using the set of actually running pods. + +Note that there are a lot of details to be figured out here; above is just a very +hand-wavy sketch of one general approach that might work. + +See #22212 for additional discussion. + +### Node drain + +Node drain will be handled by one or more components not described in this document. They +will respect disruption budgets. Initially, we will just make `kubectl drain` +respect disruption budgets. See #17393 for other discussion. + +### Rescheduler + +All rescheduling other than preemption and node drain will be decided and enacted by a +new component called the *rescheduler*. It runs continuously in the background, looking +for opportunities to move pods to better locations. It acts when the degree of +improvement meets some threshold and is allowed by the pod's disruption budget. The +action is eviction of a pod using the `/evict` subresource, with the pod's signature +enqueued in the node's `PreferAvoidPods`. It does not force the pod to reschedule to any +particular node. Thus it is really an "unscheduler"; only in combination with the evicted +pod's scheduler, which schedules the replacement pod, do we get true "rescheduling." See +the "Example use cases" section earlier for some example use cases. + +The rescheduler is a best-effort service that makes no guarantees about how quickly (or +whether) it will resolve a suboptimal pod placement. + +The first version of the rescheduler will not take into consideration where or whether an +evicted pod will reschedule. The evicted pod may go pending, consuming one unit of the +corresponding shard strength disruption budget by one indefinitely. By using the `/evict` +subresource, the rescheduler ensures that an evicted pod has sufficient budget for the +evicted pod to go and stay pending. We expect future versions of the rescheduler may be +linked with the "mandatory" predicate functions (currently, the ones that constitute the +Kubelet admission criteria), and will only evict if the rescheduler determines that the +pod can reschedule somewhere according to those criteria. (Note that this still does not +guarantee that the pod actually will be able to reschedule, for at least two reasons: (1) +the state of the cluster may change between the time the rescheduler evaluates it and +when the evicted pod's scheduler tries to schedule the replacement pod, and (2) the +evicted pod's scheduler may have additional predicate functions in addition to the +mandatory ones). + +(Note: see [this comment](https://github.com/kubernetes/kubernetes/pull/22217#discussion_r54527968)). + +The first version of the rescheduler will only implement two objectives: moving a pod +onto an under-utilized node, and moving a pod onto a node that meets more of the pod's +affinity/anti-affinity preferences than wherever it is currently running. (We assume that +nodes that are intentionally under-utilized, e.g. because they are being drained, are +marked unschedulable, thus the first objective will not cause the rescheduler to "fight" +a system that is draining nodes.) We assume that all schedulers sufficiently weight the +priority functions for affinity/anti-affinity and avoiding very packed nodes, +otherwise evicted pods may not actually move onto a node that is better according to +the criteria that caused it to be evicted. (But note that in all cases it will move to a +node that is better according to the totality of its scheduler's priority functions, +except in the case where the node where it was already running was the only node +where it can run.) As a general rule, the rescheduler should only act when it sees +particularly bad situations, since (1) an eviction for a marginal improvement is likely +not worth the disruption--just because there is sufficient budget for an eviction doesn't +mean an eviction is painless to the application, and (2) rescheduling the pod might not +actually mitigate the identified problem if it is minor enough that other scheduling +factors dominate the decision of where the replacement pod is scheduled. + +We assume schedulers' priority functions are at least vaguely aligned with the +rescheduler's policies; otherwise the rescheduler will never accomplish anything useful, +given that it relies on the schedulers to actually reschedule the evicted pods. (Even if +the rescheduler acted as a scheduler, explicitly rebinding evicted pods, we'd still want +this to be true, to prevent the schedulers and rescheduler from "fighting" one another.) + +The rescheduler will be configured using ConfigMap; the cluster administrator can enable +or disable policies and can tune the rescheduler's aggressiveness (aggressive means it +will use a relatively low threshold for triggering an eviction and may consume a lot of +disruption budget, while non-aggressive means it will use a relatively high threshold for +triggering an eviction and will try to leave plenty of buffer in disruption budgets). The +first version of the rescheduler will not be extensible or pluggable, since we want to +keep the code simple while we gain experience with the overall concept. In the future, we +anticipate a version that will be extensible and pluggable. + +We might want some way to force the evicted pod to the front of the scheduler queue, +independently of its priority. + +See #12140 for additional discussion. + +### Final comments + +In general, the design space for this topic is huge. This document describes some of the +design considerations and proposes one particular initial implementation. We expect +certain aspects of the design to be "permanent" (e.g. the notion and use of priorities, +preemption, disruption budgets, and the `/evict` subresource) while others may change over time +(e.g. the partitioning of functionality between schedulers, controllers, rescheduler, +horizontal pod autoscaler, and cluster autoscaler; the policies the rescheduler implements; +the factors the rescheduler takes into account when making decisions (e.g. knowledge of +schedulers' predicate and priority functions, second-order effects like whether and where +evicted pod will be able to reschedule, etc.); the way the rescheduler enacts its +decisions; and the complexity of the plans the rescheduler attempts to implement). + +## Implementation plan + +The highest-priority feature to implement is the rescheduler with the two use cases +highlighted earlier: moving a pod onto an under-utilized node, and moving a pod onto a +node that meets more of the pod's affinity/anti-affinity preferences. The former is +useful to rebalance pods after cluster auto-scale-up, and the latter is useful for +Ubernetes. This requires implementing disruption budgets and the `/evict` subresource, +but not priority or preemption. + +Because the general topic of rescheduling is very speculative, we have intentionally +proposed that the first version of the rescheduler be very simple -- only uses eviction +(no attempt to guide replacement pod to any particular node), doesn't know schedulers' +predicate or priority functions, doesn't try to move two pods at the same time, and only +implements two use cases. As alluded to in the previous subsection, we expect the design +and implementation to evolve over time, and we encourage members of the community to +experiment with more sophisticated policies and to report their results from using them +on real workloads. + +## Alternative implementations + +TODO. + +## Additional references + +TODO. + +TODO: Add reference to this doc from docs/proposals/rescheduler.md + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/rescheduling.md?pixel)]() +