-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343
Comments
/area cluster-autoscaler |
It might be better to use some more precise information that the Pod is meant to be going to that particular node, like the |
I wonder if the timeout is a root cause of this issue or an effect of scaling down a Node which was under delayed binding. Is it possible to verify that?
As we discussed, the nodes which are not unschedulable (for instance the ones waiting for PV provisioning) are not considered in simulation (are invisible to the Autoscaler), so maybe this Pod wasn't the reason for scale up, but just a victim of unrelated scale down? Use of |
It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations and delayed binding can cause the observed issues. Use of the nominatedNodeName to communicate scheduler binding decision should reduce the risk of the race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing |
Which component are you using?: cluster-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
Background:
Scale-up logic finds unschedulable pods to help, simulates scheduling of the unschedulable pods onto nodes and performs scale-up for the remaining unschedulable pods that weren't helped by the simulation. Scale-up logic saves the simulated bindings into cluster snapshot code. Scale-down logic then reuses this simulated pod binding to avoid removing nodes that are simulated targets for pods scheduling code.
Problem:
When a PVC provisions for a long time scheduler's
PreBind
pluginVolumeBinding
times out. Scheduler will retry scheduling the Pod, but until this happens Pod's condition becomesPodScheduled=false
with reasonSchedulerError
:In the next loop cluster-autoscaler observes that the Pod is not unschedulable code, doesn't simulate its scheduling on any nodes and subsequently scale-down logic is not prevented from deleting a node that was previously scaled-up for this pod.
Describe the solution you'd like.:
Pods with
PodScheduled=false
and reasonSchedulerError
should be treated as unschedulable by the scale-up logic code. This seems safer than ignoring them by the cluster-autoscaler – better to provision extra capacity than scaling down because of a possibly transient error.The text was updated successfully, but these errors were encountered: