Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for pod preemption in Kubernetes #745

Merged
merged 2 commits into from
Aug 2, 2017

Conversation

bsalamat
Copy link
Member

@kubernetes/sig-scheduling-feature-requests
@kubernetes/sig-scheduling-misc
@kubernetes/sig-node-proposals
@kubernetes/api-approvers
@kubernetes/api-reviewers

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 20, 2017

## Scheduler performs preemption

We propose preemption to be done by the scheduler -- it does it by deleting the being preempted pods. The component that performs the preemption must have the logic to find the right nodes for the pending pod. It must also have the logic to check whether preempting the chosen pods allows scheduling of the pending pod. These require the component to have the knowledge of predicate and priority functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume delete here means a graceful deletion similar to how kubectl delete works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is explained with more details later in the doc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about un-binding(unset nodeName) that pod instead of delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChenLingPeng Kubernetes already have an "eviction" mechanism where kubelet on a node may remove a pod from the node under resource pressure. It works by deleting pods and giving them their grace termination period. We will follow the same procedure.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the node selection part of this design too or is this just an API improvement?

The procedure to select a suitable node for preemption can be actually pretty complex (resolving system with dependencies and conflicts fully is usually NP-complete task), but I see no description of that specific algorithm in this pull request.

Copy link
Member Author

@bsalamat bsalamat Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think node selection is kind of an implementation detail, although it is a hard problem. As you said, we cannot find the best node for preemption in larger clusters, as it becomes computationally very expensive. We will have to stop once we find a certain number of nodes that are preemption candidates. We will probably stop our search once we find something like min(a fixed number, a percentage of all nodes) nodes. We can fine tune the number in the future.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the search combined with "These require the component to have the knowledge of predicate and priority functions." might get tricky and I would expect the way to access the predicates and priorities to be part of the design. Do not forget the scheduler is extensible and all pods do not even have to use the same scheduler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the way to access the predicates and priorities to be part of the design.

I am not sure what you mean by the above sentence. The scheduler has access to predicate and priority functions already, so do other schedulers. What am I missing?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I kind of assumed that the preemption will be implemented as extra service and the scheduler flow will be extended to handle not enough resources situations via it. That way it would not depend on any specific scheduler and so my comment would make sense (a way to cooperate would be needed).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There won't be any new component/service that only preforms preemption. Preemption will be implemented as a few functions. The main scheduler and/or other scheduler will be able to call them. The approach will be similar to the way priority and predicate functions are implemented.


- When ordering the pods from lowest to highest priority for considering which pod(s) to preempt, among pods with equal priority the pods are ordered by their [QoS class](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes): Best Effort, Burstable, Guaranteed.
- Scheduler respects pods' disruption budget when considering them for preemption.
- Scheduler will try to minimize the number of preempted pods. As a result, it may preempt a pod while leaving lower priority pods running if preemption of those lower priority pods is not enough to schedule the pending pod while preemption of the higher priority pod(s) is enough to schedule the pending pod. For example, if node capacity is 10, and pending pod is priority 10 and requires 5 units of resource, and the running pods are {priority 0 request 3, priority 1 request 1, priority 2 request 5, priority 3 request 1}, scheduler will preempt the priority 2 pod only and leaves priority 1 and priority 0 running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? If priority 0 mapped to batch tasks, wouldn't users want to kill most of their batch tasks before killing any other tasks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the given example, preemption of priority 0 and priority 1 pods does not release enough resources to let the pending pod be scheduled, and preemption of the priority 2 pod will still be needed.

- Scheduler respects pods' disruption budget when considering them for preemption.
- Scheduler will try to minimize the number of preempted pods. As a result, it may preempt a pod while leaving lower priority pods running if preemption of those lower priority pods is not enough to schedule the pending pod while preemption of the higher priority pod(s) is enough to schedule the pending pod. For example, if node capacity is 10, and pending pod is priority 10 and requires 5 units of resource, and the running pods are {priority 0 request 3, priority 1 request 1, priority 2 request 5, priority 3 request 1}, scheduler will preempt the priority 2 pod only and leaves priority 1 and priority 0 running.
- Scheduler does not have the knowledge of resource usage of pods. It makes scheduling decisions based on the requested resources ("requests") of the pods and when it considers a pod for preemption, it assumes the "requests" to be freed on the node.
- This means that scheduler will never preempt a Best Effort pod to make more resources available. That's because the requests of Best Effort pods is zero and therefore, preempting them will not release any resources on the node from the scheduler's point of view.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes have a limit on the number of pods they can, and in the future I expect them to limit the number of containers too. You might want to preempt when nodes are full of pods - can happen when running batch for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need a counted resource for "pod slots" and for IP addresses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Makes sense, but addition of those resources won't change things in this proposal.

Copy link
Member

@davidopp davidopp Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a counted resource for "pod slots" (kubernetes/kubernetes#5547)

We discussed doing it for IP Addresses (kubernetes/kubernetes#5507) but never did it. Maybe we should revisit that.

## Preemption - Eviction workflow

"Eviction" is the act of killing one or more pods on a node when the node is under resource pressure. Kubelet performs eviction. The eviction process is described in separate document by sig-node, but since it is closely related to the "preemption", we explain it briefly here.
Similar to preemption, the lowest priority pods are chosen for eviction first. The difference between preemption and eviction is that when pods with the same priority are considered for eviction, the one with the highest percentage of usage over "requests" is the one that is evicted first. In other words, Kubelet uses priority for eviction and breaks ties by usage over requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Burstable pod started with a higher priority can use up the entire node and evict all other pods with the policy described here. Giving a priority penalty for pods as they use more resources than their requests might be worth considering.
For example, a pod with request 100Mi and usage 1Gi with priority 10 might drop down to priority 1 because of its utilization.

Copy link
Member Author

@bsalamat bsalamat Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly believe that we should not change priority of pods. There are many use-cases that clients want to run very important workloads as burstable and let them use as much resources as they need. This is a common scenario when running servers.
Besides, scheduler does not have pod usage information today.
If abuse is a concern, we plan to address that with quota for priority.


So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.
Now, assume everything in the above example, but the best effort pod has priority 300. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it is evicted by the Kubelet, because the best effort pod has a higher priority. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exists no notion of avoid previous node for pods since pods do not get recreated. How do you plan on making scheduler place subsequent incarnations of the same pod on different nodes?
This issue is relevant to other features as well.

Copy link
Member

@davidopp davidopp Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more of a nice-to-have than absolutely necessary. But actually we do have "avoid previous node" already, see kubernetes/kubernetes#20699, but note that (1) it's not used anywhere yet so is definitely an alpha feature, and (2) it's on the set granularity rather than pod granularity for the reason you mentioned (pods do not get re-created, so it's nontrivial to determine that a particular pod you are trying to schedule is the replacement pod for some particular other pod that terminated; as an approximation we just apply the avoidance to the entire set).


## Preemption mechanics

As explained above, evicting victim(s) and binding the pending pod are not transactional. Preemption victims may have "`TerminationGracePeriodSeconds`" which will create even a larger time gap between the eviction and binding points. When a victim with termination grace period receives its termination signal, it keeps running on the node until it terminates successfully or its grace period is over. In the meantime the node resources won't be available to another pod. So, the scheduler cannot bind the pending pod right away. Scheduler should mark the pending pod as assigned and move on to schedule other pods. To do so, we propose adding a new field to PodSpec called "`FutureNodeName`". When this field is set, scheduler knows that the pod is destined to run on the given node and takes it into account when making scheduling decisions for other pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that resources are not freed up until the pod object gets deleted from the API server by the kubelet.
Processes belonging to a pod get TerminationGracePeriodSeconds to exit gracefully. However any compute resources (memory, storage) that was used by those processes do not have a cleanup SLA from kubelet as of now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume by PodSpec you meant Pod.Status? Can this logic be dealt with via annotations?

Copy link
Member Author

@bsalamat bsalamat Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PodSpec is where I think the field should go. Pod.status is also possible, but I am not sure if something like "FutureNodeName" belongs to status. Annotations would work too, but I learned that annotations are not discoverable by users. This field will be useful when debugging/reasoning about events in the system. So, I think it should be discoverable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who has permission to set FutureNodeName? Is it done via bind or some other subresource? @liggitt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler should have permission to set FutureNodeName. It does not happen before we can actually place a pod on a node. So, it happens before bind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to clarify:

  • when is futureNodeName settable and via what resources / subresources
  • when is futureNodeName mutable
  • its relationship to nodeName
  • what does futureNodeName govern? if it references a node that does not satisfy other aspects of the pod spec like the nodeSelector or tolerations, is it ignored? if it references a node that does not exist, is it ignored?

Do these constraints sound right:

  • spec.futureNodeName must be empty if spec.nodeName is non-empty
  • spec.futureNodeName must be empty on initial pod creation
  • spec.futureNodeName is set to "" when setting spec.nodeName via the bind subresource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to clarify:
when is futureNodeName settable and via what resources / subresources
when is futureNodeName mutable
its relationship to nodeName
what does futureNodeName govern? if it references a node that does not satisfy other aspects of the pod spec like the nodeSelector or tolerations, is it ignored? if it references a node that does not exist, is it ignored?

I believe the examples have answered these questions. Please let me know if they haven't.

Do these constraints sound right:
spec.futureNodeName must be empty if spec.nodeName is non-empty

No. Examples have covered this question.

spec.futureNodeName must be empty on initial pod creation

Correct.

spec.futureNodeName is set to "" when setting spec.nodeName via the bind subresource

No. It is left as it was.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt
I wonder how important it is to set nominatedNodeName (aka futureNodeName) via a sub-resource? Is a subresource needed only for security and rbac purposes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a subresource needed only for security and rbac purposes?

No, it's a matter of providing the API you mean. This field is designed to only be set at certain points in the pod lifecycle, and it's clearer if the API used to set it is fit to the purpose. The pods/binding subresource was formed for the same reasons before RBAC even existed, and before most clusters ran the scheduler with scoped permissions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks! We will add the subresource.

- requires replicating the scheduler logic in another component. In particular, the rescheduler is responsible for choosing which node the pending pod should schedule onto, which requires it to know the predicate and priority functions.
- increases the race condition between pod preemption and pending pod scheduling.

Another option is for the scheduler to send the pending pod to a node without doing any preemption, and relying on the kubelet to do the preemption(s). Similar to the rescheduler option, this option has requires replicating the preemption and scheduling logic. Kubelet already has the logic to evict pods when a node is under resource pressure, but this logic is much simpler than the whole scheduling logic that considers various scheduling parameters, such as affinity, anti-affinity, PodDisruptionBudget, etc. That is why we believe the scheduler is the right component to perform preemption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/has requires/requires?


## Race condition in multi-scheduler clusters

Kubernetes allows a cluster to have more than one scheduler. This introduces a race condition where one scheduler (scheduler A) may perform preemption of one or more pods and another scheduler (scheduler B) schedules a different pod than the initial pending pod in the space opened after the preemption of pods and before the scheduler A has the chance to schedule the initial pending pod. In this case, scheduler A goes ahead and schedules the initial pending pod on the node thinking that the space is still available. However, the pod from A will be rejected by the kubelet admission process if there are not enough free resources on the node after the pod from B has been bound (or any other predicate that kubelet admission checks fails). This is not a major issue, as schedulers will try again to schedule the rejected pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that schedulers will try again to schedule the rejected pod by the kubelet admission process ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practically yes. What happens is that the status of the pod is set to failed and the corresponding controller creates a replacement pod which is getting scheduled later again.

@bsalamat bsalamat force-pushed the preemption branch 2 times, most recently from 8657e09 to 417c24b Compare June 28, 2017 00:26

## Preemption scenario

In this proposal, the only scenario under which a group of pods in Kubernetes may be preempted is when a higher priority pod cannot be scheduled due to various unmet scheduling requirements, such as lack of resources, unsatisfied affinity or anti-affinity rules, etc., and the preemption of the lower priority pods allows the higher priority pod to be scheduled. So, if the preemption of the lower priority pods does not help with scheduling of the higher priority pod, those lower priority pods will keep running and the higher priority pod will stay pending.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this overlap with autoscaling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a cluster reaches steady state (i.e., no more preemptions can be performed), the cluster may still have a few pending pods which are either lower priority than the running pods, or no more preemption can help schedule them. In this case, the autoscaler may decide to add more nodes to schedule them. I think autoscaler should add a knob for users to tell it what is the minimum priority of pending pods that warrants adding more nodes to the cluster.

Copy link
Member

@davidopp davidopp Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The one thing that's not covered is if users want to grow the cluster instead of preempting, just to avoid the disruption (i.e. instead of preempting and then growing the cluster because the victim goes pending, avoid the preemption altogether by growing the cluster when the high-priority pod is submitted). Obviously one way to do this is to set all pods at the same priority (i.e. not use the priority/preemption feature) but it would be nice if clusters that use priority/preemption could somehow set the tradeoff. PDB might play a role here -- we could say that if a preemption would violate PDB then we will try to grow the cluster, and only preempt if the cluster has hit the max size the user configured for the cluster autoscaler. I suspect that in the absence of PDB, the current behavior is probably what users want because growing the cluster takes a while, whereas preemption allows the high-priority pod to get scheduled right away (and then if the victim has to wait a while for the new node to spin up, it's not a big deal). Keep in mind that PDB will eventually support eviction rate limits (due to "voluntary" causes, of course -- but preemption seems to fit in that bucket), not just minAvailable/maxUnavailable.

@ravisantoshgudimetla
Copy link
Contributor

@davidopp davidopp self-assigned this Jul 23, 2017
@davidopp
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2017
Copy link
Member

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, just one question about how future node name is set.

@vishh - i am not sure what we will do when we do NUMA aware fits in the kubelet. The memory example here for priority winning in eviction I think may have to be tweaked to be local to that NUMA node only in future.

1. The high priority pod uses more than 1GB of memory. Kubelet detects the resource pressure and kills the best effort pod.

So, best effort pods may be killed to make room for higher priority pods, although the scheduler does not preempt them directly.
Now, assume everything in the above example, but the best effort pod has priority 2000. In this scenario, scheduler schedules the pending pod with priority 200 on the node, but it may be evicted by the Kubelet, because Kubelet's eviction function may determine that the best effort pod should stay given its high priority and despite its usage above request. Given this scenario, scheduler should avoid the node and should try scheduling the pod on a different node if the pod is evicted by the Kubelet. This is an optimization to prevent possible ping-pong behavior between Kubelet and Scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we do not encourage users to have high priority pods that make no resource requests. As a result, this would not be likely to happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly don't. In fact, doing so has a big drawback for users. Their pods may be scheduled on nodes with little available resources and get evicted.


## Preemption mechanics

As explained above, evicting victim(s) and binding the pending pod are not transactional. Preemption victims may have "`TerminationGracePeriodSeconds`" which will create even a larger time gap between the eviction and binding points. When a victim with termination grace period receives its termination signal, it keeps running on the node until it terminates successfully or its grace period is over. In the meantime the node resources won't be available to another pod. So, the scheduler cannot bind the pending pod right away. Scheduler should mark the pending pod as assigned and move on to schedule other pods. To do so, we propose adding a new field to PodSpec called "`FutureNodeName`". When this field is set, scheduler knows that the pod is destined to run on the given node and takes it into account when making scheduling decisions for other pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who has permission to set FutureNodeName? Is it done via bind or some other subresource? @liggitt

Here are all the steps taken in the process:

1. Scheduler sets "`deletionTimestamp`" of the victims and sets "`FutureNodeName`" of the pending pod.
1. Kublete sees the `deletionTimestamp` and the victims enter their graceful termination period.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet

1. When any pod is terminated (whether victims or not), Scheduler starts from the beginning of its queue which is sorted by descending priority of pods to see if it can schedule them.
1. Scheduler skips a pod in its queue when there is no node for scheduling the pod.
1. Scheduler evaluates the "future" feasibility of a pending pod in the queue as if the preemption victims are already gone and the pods which are ahead in the queue and have that node as "`FutureNodeName`" are already bound. See example 1 below.
1. When a scheduler pass is triggered, scheduler reevaluates all the pods from the head of the queue and updates their "`FutureNodeName`" if needed. See example 4.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is useful in scenarios where a pod may get stuck terminating or a node is lost

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a higher priority pod shows up in the meantime between one scheduler pass to the time we can bind the pod, as explained in example 4.


An alternative to preemption by priority and breaking ties with QoS which was proposed earlier, is to preempt by QoS first and break ties with priority. We believe this could cause confusion for users and might reduce cluster utilization. Imagine the following scenario:

- User runs a web server with a very high priority and is willing to give as much resources as possible to this web server. So, the users chooses a reasonable "requests" for resources and does not set any "limits" to let the web server use as much resources as it needs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is resource specific. It's really only meaningful with memory. Cpu in the scenario will still be limited. Maybe clarify that it's memory only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain more why this does not apply to CPU?

@derekwaynecarr
Copy link
Member

Last question, is this behind a feature gate in 1.8?

@bsalamat
Copy link
Member Author

Last question, is this behind a feature gate in 1.8?

It will be implemented as an alpha feature in 1.8.

1. Scheduler evaluates the "future" feasibility of a pending pod in the queue as if the preemption victims are already gone and the pods which are ahead in the queue and have that node as "`FutureNodeName`" are already bound. See example 1 below.
1. When a scheduler pass is triggered, scheduler reevaluates all the pods from the head of the queue and updates their "`FutureNodeName`" if needed. See example 4.

1. When a node becomes available, scheduler binds the pending pod to the node. The node may or may not be the same as "`FutureNodeName`". Scheduler sets the "`NodeName`" field of PodSpec, but it does not clear "`FutureNodeName`". See example 2 to find our reasons.
Copy link
Member

@liggitt liggitt Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... leaving "futureNodeName" set even once the pod is bound (possibly to a different node) is really confusing... it makes it appear as if a bound pod is going to be moved to a different node. if the intent of the field is to indicate on which node preemption is/was performed to make room for this pod, I'd rather the field name reflect that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. How about changing the field name to NominatedNodeName?

@resouer
Copy link
Contributor

resouer commented Jul 31, 2017

@bsalamat This is promising! Do you have working item list or issue to track or xref? Would like to see if any place we can help.

@bsalamat
Copy link
Member Author

This is promising! Do you have working item list or issue to track or xref? Would like to see if any place we can help.

@resouer Yes, please see kubernetes/kubernetes#47604.

@resouer
Copy link
Contributor

resouer commented Aug 2, 2017

@bsalamat Thanks, also, seems this proposal has been lgtm and approved, I think you can merge it now?

@bsalamat
Copy link
Member Author

bsalamat commented Aug 2, 2017

@resouer Yes, I am aware that it is lgtm'ed, but since there were questions after the lgtm, I wanted to make sure they were answered before merging it. I guess I can merge now and make further changes if there is any major issues.

@davidopp
Copy link
Member

davidopp commented Aug 2, 2017

Sorry, I forgot what the outstanding issues were. I think you mentioned someone suggested a subresource (like /binding subresource) for setting future nodeName (or whatever it will be called). This seems reasonable, for symmetry with setting regular nodeName. As to @liggitt point about RBAC it's a good point and we need to take it into account when deploying this feature but I don't think there's anything fundamental in the design that we can change to make it a non-issue; it's really about security policy.

@bsalamat
Copy link
Member Author

bsalamat commented Aug 2, 2017

@davidopp As you said, the question that @liggitt has asked is what resource/subresource is going to add FutureNodeName. We are going to add a subresource for it, something like FutureBinding.

@davidopp
Copy link
Member

davidopp commented Aug 2, 2017

If you think there are no outstanding issues (sounds like that's the case) I think it's fine to merge it. (I guess it is already LGTMd anyway.)

@bsalamat bsalamat merged commit c20f7cc into kubernetes:master Aug 2, 2017
@bsalamat bsalamat deleted the preemption branch August 2, 2017 08:41
erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017
erinboyd pushed a commit to erinboyd/community that referenced this pull request Oct 23, 2017
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Aspen Mesh was an incubator under the F5 Networks brand. We continue to be a part of the Istio PSWG and meet its obligations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.