Design proposal for pod preemption in Kubernetes #745

bsalamat · 2017-06-20T21:46:27Z

@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers

vishh · 2017-06-20T23:29:28Z

contributors/design-proposals/pod-preemption.md

+
+## Scheduler performs preemption
+
+We propose preemption to be done by the scheduler -- it does it by deleting the being preempted pods. The component that performs the preemption must have the logic to find the right nodes for the pending pod. It must also have the logic to check whether preempting the chosen pods allows scheduling of the pending pod. These require the component to have the knowledge of predicate and priority functions.  


I assume delete here means a graceful deletion similar to how kubectl delete works?

Yes. It is explained with more details later in the doc.

how about un-binding(unset nodeName) that pod instead of delete?

@ChenLingPeng Kubernetes already have an "eviction" mechanism where kubelet on a node may remove a pod from the node under resource pressure. It works by deleting pods and giving them their grace termination period. We will follow the same procedure.

Is the node selection part of this design too or is this just an API improvement?

The procedure to select a suitable node for preemption can be actually pretty complex (resolving system with dependencies and conflicts fully is usually NP-complete task), but I see no description of that specific algorithm in this pull request.

I think node selection is kind of an implementation detail, although it is a hard problem. As you said, we cannot find the best node for preemption in larger clusters, as it becomes computationally very expensive. We will have to stop once we find a certain number of nodes that are preemption candidates. We will probably stop our search once we find something like min(a fixed number, a percentage of all nodes) nodes. We can fine tune the number in the future.

Well the search combined with "These require the component to have the knowledge of predicate and priority functions." might get tricky and I would expect the way to access the predicates and priorities to be part of the design. Do not forget the scheduler is extensible and all pods do not even have to use the same scheduler.

I would expect the way to access the predicates and priorities to be part of the design.

I am not sure what you mean by the above sentence. The scheduler has access to predicate and priority functions already, so do other schedulers. What am I missing?

I think I kind of assumed that the preemption will be implemented as extra service and the scheduler flow will be extended to handle not enough resources situations via it. That way it would not depend on any specific scheduler and so my comment would make sense (a way to cooperate would be needed).

There won't be any new component/service that only preforms preemption. Preemption will be implemented as a few functions. The main scheduler and/or other scheduler will be able to call them. The approach will be similar to the way priority and predicate functions are implemented.

vishh · 2017-06-20T23:34:04Z

contributors/design-proposals/pod-preemption.md

+
+-  When ordering the pods from lowest to highest priority for considering which pod(s) to preempt, among pods with equal priority the pods are ordered by their [QoS class](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes): Best Effort, Burstable, Guaranteed.
+-  Scheduler respects pods' disruption budget when considering them for preemption.
+-  Scheduler will try to minimize the number of preempted pods. As a result, it may preempt a pod while leaving lower priority pods running if preemption of those lower priority pods is not enough to schedule the pending pod while preemption of the higher priority pod(s) is enough to schedule the pending pod. For example, if node capacity is 10, and pending pod is priority 10 and requires 5 units of resource, and the running pods are {priority 0 request 3, priority 1 request 1, priority 2 request 5, priority 3 request 1}, scheduler will preempt the priority 2 pod only and leaves priority 1 and priority 0 running.


Why? If priority 0 mapped to batch tasks, wouldn't users want to kill most of their batch tasks before killing any other tasks?

In the given example, preemption of priority 0 and priority 1 pods does not release enough resources to let the pending pod be scheduled, and preemption of the priority 2 pod will still be needed.

vishh · 2017-06-20T23:35:16Z

contributors/design-proposals/pod-preemption.md

+-  Scheduler respects pods' disruption budget when considering them for preemption.
+-  Scheduler will try to minimize the number of preempted pods. As a result, it may preempt a pod while leaving lower priority pods running if preemption of those lower priority pods is not enough to schedule the pending pod while preemption of the higher priority pod(s) is enough to schedule the pending pod. For example, if node capacity is 10, and pending pod is priority 10 and requires 5 units of resource, and the running pods are {priority 0 request 3, priority 1 request 1, priority 2 request 5, priority 3 request 1}, scheduler will preempt the priority 2 pod only and leaves priority 1 and priority 0 running.
+-  Scheduler does not have the knowledge of resource usage of pods. It makes scheduling decisions based on the requested resources ("requests") of the pods and when it considers a pod for preemption, it assumes the "requests" to be freed on the node.
+   -  This means that scheduler will never preempt a Best Effort pod to make more resources available. That's because the requests of Best Effort pods is zero and therefore, preempting them will not release any resources on the node from the scheduler's point of view.


Nodes have a limit on the number of pods they can, and in the future I expect them to limit the number of containers too. You might want to preempt when nodes are full of pods - can happen when running batch for example.

we really need a counted resource for "pod slots" and for IP addresses.

@thockin Makes sense, but addition of those resources won't change things in this proposal.

We have a counted resource for "pod slots" (kubernetes/kubernetes#5547)

We discussed doing it for IP Addresses (kubernetes/kubernetes#5507) but never did it. Maybe we should revisit that.

vishh · 2017-06-20T23:39:10Z

contributors/design-proposals/pod-preemption.md

+## Preemption - Eviction workflow
+
+"Eviction" is the act of killing one or more pods on a node when the node is under resource pressure. Kubelet performs eviction. The eviction process is described in separate document by sig-node, but since it is closely related to the "preemption", we explain it briefly here.  
+Similar to preemption, the lowest priority pods are chosen for eviction first. The difference between preemption and eviction is that when pods with the same priority are considered for eviction, the one with the highest percentage of usage over "requests" is the one that is evicted first. In other words, Kubelet uses priority for eviction and breaks ties by usage over requests.  


A Burstable pod started with a higher priority can use up the entire node and evict all other pods with the policy described here. Giving a priority penalty for pods as they use more resources than their requests might be worth considering.
For example, a pod with request 100Mi and usage 1Gi with priority 10 might drop down to priority 1 because of its utilization.

I strongly believe that we should not change priority of pods. There are many use-cases that clients want to run very important workloads as burstable and let them use as much resources as they need. This is a common scenario when running servers.
Besides, scheduler does not have pod usage information today.
If abuse is a concern, we plan to address that with quota for priority.

vishh · 2017-06-20T23:41:07Z

contributors/design-proposals/pod-preemption.md

+
+So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.  
+Now, assume everything in the above example, but the best effort pod has priority 300. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it is evicted by the Kubelet, because the best effort pod has a higher priority. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet.
+


There exists no notion of avoid previous node for pods since pods do not get recreated. How do you plan on making scheduler place subsequent incarnations of the same pod on different nodes?
This issue is relevant to other features as well.

I think this is more of a nice-to-have than absolutely necessary. But actually we do have "avoid previous node" already, see kubernetes/kubernetes#20699, but note that (1) it's not used anywhere yet so is definitely an alpha feature, and (2) it's on the set granularity rather than pod granularity for the reason you mentioned (pods do not get re-created, so it's nontrivial to determine that a particular pod you are trying to schedule is the replacement pod for some particular other pod that terminated; as an approximation we just apply the avoidance to the entire set).

vishh · 2017-06-20T23:43:11Z

contributors/design-proposals/pod-preemption.md

+
+## Preemption mechanics
+
+As explained above, evicting victim(s) and binding the pending pod are not transactional. Preemption victims may have "`TerminationGracePeriodSeconds`" which will create even a larger time gap between the eviction and binding points. When a victim with termination grace period receives its termination signal, it keeps running on the node until it terminates successfully or its grace period is over. In the meantime the node resources won't be available to another pod. So, the scheduler cannot bind the pending pod right away. Scheduler should mark the pending pod as assigned and move on to schedule other pods. To do so, we propose adding a new field to PodSpec called "`FutureNodeName`". When this field is set, scheduler knows that the pod is destined to run on the given node and takes it into account when making scheduling decisions for other pods.  


Note that resources are not freed up until the pod object gets deleted from the API server by the kubelet.
Processes belonging to a pod get TerminationGracePeriodSeconds to exit gracefully. However any compute resources (memory, storage) that was used by those processes do not have a cleanup SLA from kubelet as of now.

I assume by PodSpec you meant Pod.Status? Can this logic be dealt with via annotations?

PodSpec is where I think the field should go. Pod.status is also possible, but I am not sure if something like "FutureNodeName" belongs to status. Annotations would work too, but I learned that annotations are not discoverable by users. This field will be useful when debugging/reasoning about events in the system. So, I think it should be discoverable.

Who has permission to set FutureNodeName? Is it done via bind or some other subresource? @liggitt

Scheduler should have permission to set FutureNodeName. It does not happen before we can actually place a pod on a node. So, it happens before bind.

need to clarify:

when is futureNodeName settable and via what resources / subresources

when is futureNodeName mutable

its relationship to nodeName

what does futureNodeName govern? if it references a node that does not satisfy other aspects of the pod spec like the nodeSelector or tolerations, is it ignored? if it references a node that does not exist, is it ignored?

Do these constraints sound right:

spec.futureNodeName must be empty if spec.nodeName is non-empty

spec.futureNodeName must be empty on initial pod creation

spec.futureNodeName is set to "" when setting spec.nodeName via the bind subresource

need to clarify:
when is futureNodeName settable and via what resources / subresources
when is futureNodeName mutable
its relationship to nodeName
what does futureNodeName govern? if it references a node that does not satisfy other aspects of the pod spec like the nodeSelector or tolerations, is it ignored? if it references a node that does not exist, is it ignored?

I believe the examples have answered these questions. Please let me know if they haven't.

Do these constraints sound right:
spec.futureNodeName must be empty if spec.nodeName is non-empty

No. Examples have covered this question.

spec.futureNodeName must be empty on initial pod creation

Correct.

spec.futureNodeName is set to "" when setting spec.nodeName via the bind subresource

No. It is left as it was.

@liggitt
I wonder how important it is to set nominatedNodeName (aka futureNodeName) via a sub-resource? Is a subresource needed only for security and rbac purposes?

Is a subresource needed only for security and rbac purposes?

No, it's a matter of providing the API you mean. This field is designed to only be set at certain points in the pod lifecycle, and it's clearer if the API used to set it is fit to the purpose. The pods/binding subresource was formed for the same reasons before RBAC even existed, and before most clusters ran the scheduler with scoped permissions.

I see. Thanks! We will add the subresource.

vishh · 2017-06-20T23:46:47Z

contributors/design-proposals/pod-preemption.md

+-  requires replicating the scheduler logic in another component. In particular, the rescheduler is responsible for choosing which node the pending pod should schedule onto, which requires it to know the predicate and priority functions.
+-  increases the race condition between pod preemption and pending pod scheduling.
+
+Another option is for the scheduler to send the pending pod to a node without doing any preemption, and relying on the kubelet to do the preemption(s). Similar to the rescheduler option, this option has  requires replicating the preemption and scheduling logic. Kubelet already has the logic to evict pods when a node is under resource pressure, but this logic is much simpler than the whole scheduling logic that considers various scheduling parameters, such as affinity, anti-affinity, PodDisruptionBudget, etc. That is why we believe the scheduler is the right component to perform preemption.


s/has requires/requires?

ChenLingPeng · 2017-06-27T12:12:44Z

contributors/design-proposals/pod-preemption.md

+
+## Race condition in multi-scheduler clusters
+
+Kubernetes allows a cluster to have more than one scheduler. This introduces a race condition where one scheduler (scheduler A) may perform preemption of one or more pods and another scheduler (scheduler B) schedules a different pod than the initial pending pod in the space opened after the preemption of pods and before the scheduler A has the chance to schedule the initial pending pod. In this case, scheduler A goes ahead and schedules the initial pending pod on the node thinking that the space is still available. However, the pod from A will be rejected by the kubelet admission process if there are not enough free resources on the node after the pod from B has been bound (or any other predicate that kubelet admission checks fails). This is not a major issue, as schedulers will try again to schedule the rejected pod.  


Is it true that schedulers will try again to schedule the rejected pod by the kubelet admission process ?

Practically yes. What happens is that the status of the pod is set to failed and the corresponding controller creates a replacement pod which is getting scheduled later again.

thockin · 2017-07-10T16:24:17Z

contributors/design-proposals/pod-preemption.md

+
+## Preemption scenario
+
+In this proposal, the only scenario under which a group of pods in Kubernetes may be preempted is when a higher priority pod cannot be scheduled due to various unmet scheduling requirements, such as lack of resources, unsatisfied affinity or anti-affinity rules, etc., and the preemption of the lower priority pods allows the higher priority pod to be scheduled. So, if the preemption of the lower priority pods does not help with scheduling of the higher priority pod, those lower priority pods will keep running and the higher priority pod will stay pending.  


how does this overlap with autoscaling?

After a cluster reaches steady state (i.e., no more preemptions can be performed), the cluster may still have a few pending pods which are either lower priority than the running pods, or no more preemption can help schedule them. In this case, the autoscaler may decide to add more nodes to schedule them. I think autoscaler should add a knob for users to tell it what is the minimum priority of pending pods that warrants adding more nodes to the cluster.

+1

The one thing that's not covered is if users want to grow the cluster instead of preempting, just to avoid the disruption (i.e. instead of preempting and then growing the cluster because the victim goes pending, avoid the preemption altogether by growing the cluster when the high-priority pod is submitted). Obviously one way to do this is to set all pods at the same priority (i.e. not use the priority/preemption feature) but it would be nice if clusters that use priority/preemption could somehow set the tradeoff. PDB might play a role here -- we could say that if a preemption would violate PDB then we will try to grow the cluster, and only preempt if the cluster has hit the max size the user configured for the cluster autoscaler. I suspect that in the absence of PDB, the current behavior is probably what users want because growing the cluster takes a while, whereas preemption allows the high-priority pod to get scheduled right away (and then if the victim has to wait a while for the new node to spin up, it's not a big deal). Keep in mind that PDB will eventually support eviction rate limits (due to "voluntary" causes, of course -- but preemption seems to fit in that bucket), not just minAvailable/maxUnavailable.

ravisantoshgudimetla · 2017-07-12T14:23:10Z

/cc @derekwaynecarr @aveshagarwal @jeremyeder

davidopp · 2017-07-23T00:28:03Z

/lgtm
/approve

derekwaynecarr

Generally LGTM, just one question about how future node name is set.

@vishh - i am not sure what we will do when we do NUMA aware fits in the kubelet. The memory example here for priority winning in eviction I think may have to be tweaked to be local to that NUMA node only in future.

derekwaynecarr · 2017-07-23T01:26:53Z

contributors/design-proposals/pod-preemption.md

+1. The high priority pod uses more than 1GB of memory. Kubelet detects the resource pressure and kills the best effort pod.
+
+So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.  
+Now, assume everything in the above example, but the best effort pod has priority 2000. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it may be evicted by the Kubelet, because Kubelet's eviction function may determine that the  best effort pod should stay given its high priority and despite its usage above request. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet. This is an optimization to prevent possible ping-pong behavior between Kubelet and Scheduler.


I hope we do not encourage users to have high priority pods that make no resource requests. As a result, this would not be likely to happen.

We certainly don't. In fact, doing so has a big drawback for users. Their pods may be scheduled on nodes with little available resources and get evicted.

derekwaynecarr · 2017-07-23T01:28:58Z

contributors/design-proposals/pod-preemption.md

+
+## Preemption mechanics
+
+As explained above, evicting victim(s) and binding the pending pod are not transactional. Preemption victims may have "`TerminationGracePeriodSeconds`" which will create even a larger time gap between the eviction and binding points. When a victim with termination grace period receives its termination signal, it keeps running on the node until it terminates successfully or its grace period is over. In the meantime the node resources won't be available to another pod. So, the scheduler cannot bind the pending pod right away. Scheduler should mark the pending pod as assigned and move on to schedule other pods. To do so, we propose adding a new field to PodSpec called "`FutureNodeName`". When this field is set, scheduler knows that the pod is destined to run on the given node and takes it into account when making scheduling decisions for other pods.  


Who has permission to set FutureNodeName? Is it done via bind or some other subresource? @liggitt

derekwaynecarr · 2017-07-23T01:29:16Z

contributors/design-proposals/pod-preemption.md

+Here are all the steps taken in the process:
+
+1. Scheduler sets "`deletionTimestamp`" of the victims and sets "`FutureNodeName`" of the pending pod.
+1. Kublete sees the `deletionTimestamp` and the victims enter their graceful termination period.


derekwaynecarr · 2017-07-23T01:30:35Z

contributors/design-proposals/pod-preemption.md

+1. When any pod is terminated (whether victims or not), Scheduler starts from the beginning of its queue which is sorted by descending priority of pods to see if it can schedule them.
+   1. Scheduler skips a pod in its queue when there is no node for scheduling the pod.
+   1. Scheduler evaluates the "future" feasibility of a pending pod in the queue as if the preemption victims are already gone and the pods which are ahead in the queue and have that node as "`FutureNodeName`" are already bound. See example 1 below.
+   1. When a scheduler pass is triggered, scheduler reevaluates all the pods from the head of the queue and updates their "`FutureNodeName`" if needed. See example 4.


I guess this is useful in scenarios where a pod may get stuck terminating or a node is lost

Or a higher priority pod shows up in the meantime between one scheduler pass to the time we can bind the pod, as explained in example 4.

derekwaynecarr · 2017-07-23T01:34:23Z

contributors/design-proposals/pod-preemption.md

+
+An alternative to preemption by priority and breaking ties with QoS which was proposed earlier, is to preempt by QoS first and break ties with priority. We believe this could cause confusion for users and might reduce cluster utilization. Imagine the following scenario:
+
+-  User runs a web server with a very high priority and is willing to give as much resources as possible to this web server. So, the users chooses a reasonable "requests" for resources and does not set any "limits" to let the web server use as much resources as it needs.


This example is resource specific. It's really only meaningful with memory. Cpu in the scenario will still be limited. Maybe clarify that it's memory only.

Could you please explain more why this does not apply to CPU?

derekwaynecarr · 2017-07-23T01:39:01Z

Last question, is this behind a feature gate in 1.8?

bsalamat · 2017-07-24T00:49:29Z

Last question, is this behind a feature gate in 1.8?

It will be implemented as an alpha feature in 1.8.

liggitt · 2017-07-27T15:48:18Z

contributors/design-proposals/pod-preemption.md

+   1. Scheduler evaluates the "future" feasibility of a pending pod in the queue as if the preemption victims are already gone and the pods which are ahead in the queue and have that node as "`FutureNodeName`" are already bound. See example 1 below.
+   1. When a scheduler pass is triggered, scheduler reevaluates all the pods from the head of the queue and updates their "`FutureNodeName`" if needed. See example 4.
+
+1. When a node becomes available, scheduler binds the pending pod to the node. The node may or may not be the same as "`FutureNodeName`". Scheduler sets the "`NodeName`" field of PodSpec, but it does not clear "`FutureNodeName`". See example 2 to find our reasons.


hmm... leaving "futureNodeName" set even once the pod is bound (possibly to a different node) is really confusing... it makes it appear as if a bound pod is going to be moved to a different node. if the intent of the field is to indicate on which node preemption is/was performed to make room for this pod, I'd rather the field name reflect that.

That's a fair point. How about changing the field name to NominatedNodeName?

resouer · 2017-07-31T05:31:01Z

@bsalamat This is promising! Do you have working item list or issue to track or xref? Would like to see if any place we can help.

bsalamat · 2017-07-31T18:42:14Z

This is promising! Do you have working item list or issue to track or xref? Would like to see if any place we can help.

@resouer Yes, please see kubernetes/kubernetes#47604.

resouer · 2017-08-02T05:22:26Z

@bsalamat Thanks, also, seems this proposal has been lgtm and approved, I think you can merge it now?

bsalamat · 2017-08-02T06:15:09Z

@resouer Yes, I am aware that it is lgtm'ed, but since there were questions after the lgtm, I wanted to make sure they were answered before merging it. I guess I can merge now and make further changes if there is any major issues.

davidopp · 2017-08-02T06:20:18Z

Sorry, I forgot what the outstanding issues were. I think you mentioned someone suggested a subresource (like /binding subresource) for setting future nodeName (or whatever it will be called). This seems reasonable, for symmetry with setting regular nodeName. As to @liggitt point about RBAC it's a good point and we need to take it into account when deploying this feature but I don't think there's anything fundamental in the design that we can change to make it a non-issue; it's really about security policy.

bsalamat · 2017-08-02T06:34:48Z

@davidopp As you said, the question that @liggitt has asked is what resource/subresource is going to add FutureNodeName. We are going to add a subresource for it, something like FutureBinding.

davidopp · 2017-08-02T06:48:17Z

If you think there are no outstanding issues (sounds like that's the case) I think it's fine to merge it. (I guess it is already LGTMd anyway.)

* Add pod preemption design proposal

Aspen Mesh was an incubator under the F5 Networks brand. We continue to be a part of the Istio PSWG and meet its obligations.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2017

vishh reviewed Jun 20, 2017

View reviewed changes

bsalamat force-pushed the preemption branch from 4b2fe10 to 13482b4 Compare June 21, 2017 10:04

ChenLingPeng reviewed Jun 27, 2017

View reviewed changes

bsalamat force-pushed the preemption branch 2 times, most recently from 8657e09 to 417c24b Compare June 28, 2017 00:26

thockin reviewed Jul 10, 2017

View reviewed changes

k8s-ci-robot requested review from aveshagarwal, derekwaynecarr and jeremyeder July 12, 2017 14:23

Add pod preemption design proposal

8c6fac3

davidopp self-assigned this Jul 23, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2017

derekwaynecarr approved these changes Jul 23, 2017

View reviewed changes

liggitt reviewed Jul 27, 2017

View reviewed changes

Addressed reviewer comments

3630b53

bsalamat force-pushed the preemption branch from 417c24b to 3630b53 Compare July 28, 2017 00:58

bsalamat merged commit c20f7cc into kubernetes:master Aug 2, 2017

bsalamat deleted the preemption branch August 2, 2017 08:41

erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017

Design proposal for pod preemption in Kubernetes (kubernetes#745)

a8de4ef

* Add pod preemption design proposal

erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017

Design proposal for pod preemption in Kubernetes (kubernetes#745)

04614bb

* Add pod preemption design proposal

MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021

Design proposal for pod preemption in Kubernetes (kubernetes#745)

26bd189

* Add pod preemption design proposal

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Change aspenmesh.io to F5 (kubernetes#745)

9fbcb5e

Aspen Mesh was an incubator under the F5 Networks brand. We continue to be a part of the Istio PSWG and meet its obligations.


		## Scheduler performs preemption

		We propose preemption to be done by the scheduler -- it does it by deleting the being preempted pods. The component that performs the preemption must have the logic to find the right nodes for the pending pod. It must also have the logic to check whether preempting the chosen pods allows scheduling of the pending pod. These require the component to have the knowledge of predicate and priority functions.


		So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.
		Now, assume everything in the above example, but the best effort pod has priority 300. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it is evicted by the Kubelet, because the best effort pod has a higher priority. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet.


		## Preemption mechanics

		As explained above, evicting victim(s) and binding the pending pod are not transactional. Preemption victims may have "`TerminationGracePeriodSeconds`" which will create even a larger time gap between the eviction and binding points. When a victim with termination grace period receives its termination signal, it keeps running on the node until it terminates successfully or its grace period is over. In the meantime the node resources won't be available to another pod. So, the scheduler cannot bind the pending pod right away. Scheduler should mark the pending pod as assigned and move on to schedule other pods. To do so, we propose adding a new field to PodSpec called "`FutureNodeName`". When this field is set, scheduler knows that the pod is destined to run on the given node and takes it into account when making scheduling decisions for other pods.


		## Race condition in multi-scheduler clusters

		Kubernetes allows a cluster to have more than one scheduler. This introduces a race condition where one scheduler (scheduler A) may perform preemption of one or more pods and another scheduler (scheduler B) schedules a different pod than the initial pending pod in the space opened after the preemption of pods and before the scheduler A has the chance to schedule the initial pending pod. In this case, scheduler A goes ahead and schedules the initial pending pod on the node thinking that the space is still available. However, the pod from A will be rejected by the kubelet admission process if there are not enough free resources on the node after the pod from B has been bound (or any other predicate that kubelet admission checks fails). This is not a major issue, as schedulers will try again to schedule the rejected pod.


		## Preemption scenario

		In this proposal, the only scenario under which a group of pods in Kubernetes may be preempted is when a higher priority pod cannot be scheduled due to various unmet scheduling requirements, such as lack of resources, unsatisfied affinity or anti-affinity rules, etc., and the preemption of the lower priority pods allows the higher priority pod to be scheduled. So, if the preemption of the lower priority pods does not help with scheduling of the higher priority pod, those lower priority pods will keep running and the higher priority pod will stay pending.


		An alternative to preemption by priority and breaking ties with QoS which was proposed earlier, is to preempt by QoS first and break ties with priority. We believe this could cause confusion for users and might reduce cluster utilization. Imagine the following scenario:

		- User runs a web server with a very high priority and is willing to give as much resources as possible to this web server. So, the users chooses a reasonable "requests" for resources and does not set any "limits" to let the web server use as much resources as it needs.

Design proposal for pod preemption in Kubernetes #745

Design proposal for pod preemption in Kubernetes #745

Conversation

bsalamat commented Jun 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp Jul 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Jun 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp Jun 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat Jun 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp Jul 10, 2017 • edited Loading

Choose a reason for hiding this comment

ravisantoshgudimetla commented Jul 12, 2017

davidopp commented Jul 23, 2017

derekwaynecarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Jul 23, 2017

bsalamat commented Jul 24, 2017

liggitt Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

resouer commented Jul 31, 2017

bsalamat commented Jul 31, 2017

resouer commented Aug 2, 2017

bsalamat commented Aug 2, 2017

davidopp commented Aug 2, 2017

bsalamat commented Aug 2, 2017

davidopp commented Aug 2, 2017

bsalamat Jun 28, 2017 •

edited

Loading

davidopp Jul 10, 2017 •

edited

Loading

bsalamat Jun 21, 2017 •

edited

Loading

davidopp Jun 21, 2017 •

edited

Loading

bsalamat Jun 21, 2017 •

edited

Loading

davidopp Jul 10, 2017 •

edited

Loading

liggitt Jul 27, 2017 •

edited

Loading