Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

Open
pbetkier opened this issue Oct 4, 2024 · 5 comments
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.

Comments

@pbetkier
Copy link
Contributor

pbetkier commented Oct 4, 2024

Which component are you using?: cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Background:
Scale-up logic finds unschedulable pods to help, simulates scheduling of the unschedulable pods onto nodes and performs scale-up for the remaining unschedulable pods that weren't helped by the simulation. Scale-up logic saves the simulated bindings into cluster snapshot code. Scale-down logic then reuses this simulated pod binding to avoid removing nodes that are simulated targets for pods scheduling code.

Problem:
When a PVC provisions for a long time scheduler's PreBind plugin VolumeBinding times out. Scheduler will retry scheduling the Pod, but until this happens Pod's condition becomes PodScheduled=false with reason SchedulerError:

"Plugin failed" err="binding volumes: context deadline exceeded" plugin="VolumeBinding" pod="..." node="..."
"Error scheduling pod; retrying" err="running PreBind plugin \"VolumeBinding\": binding volumes: context deadline exceeded" pod="..."
"Updating pod condition" pod="..." conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"

In the next loop cluster-autoscaler observes that the Pod is not unschedulable code, doesn't simulate its scheduling on any nodes and subsequently scale-down logic is not prevented from deleting a node that was previously scaled-up for this pod.

Describe the solution you'd like.:
Pods with PodScheduled=false and reason SchedulerError should be treated as unschedulable by the scale-up logic code. This seems safer than ignoring them by the cluster-autoscaler – better to provision extra capacity than scaling down because of a possibly transient error.

@pbetkier pbetkier added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 4, 2024
@pbetkier
Copy link
Contributor Author

pbetkier commented Oct 4, 2024

@adrianmoisey
Copy link
Member

/area cluster-autoscaler

@alculquicondor
Copy link
Member

alculquicondor commented Oct 7, 2024

SchedulerError could happen due to other problems not related to volume binding, so the proposal isn't generally safe.

It might be better to use some more precise information that the Pod is meant to be going to that particular node, like the nominatedNodeName, as it is proposed for kubernetes/kubernetes#125491

@dom4ha
Copy link
Member

dom4ha commented Oct 15, 2024

I wonder if the timeout is a root cause of this issue or an effect of scaling down a Node which was under delayed binding. Is it possible to verify that?

In the next loop cluster-autoscaler observes that the Pod is not unschedulable code, doesn't simulate its scheduling on any nodes and subsequently scale-down logic is not prevented from deleting a node that was previously scaled-up for this pod.

As we discussed, the nodes which are not unschedulable (for instance the ones waiting for PV provisioning) are not considered in simulation (are invisible to the Autoscaler), so maybe this Pod wasn't the reason for scale up, but just a victim of unrelated scale down?

Use of nominatedNodeName could actually fill that gap and make Autoscaler aware of the Pods that are for whatever reasons waiting for binding.

@dom4ha
Copy link
Member

dom4ha commented Oct 24, 2024

It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations and delayed binding can cause the observed issues.

Use of the nominatedNodeName to communicate scheduler binding decision should reduce the risk of the race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

5 participants