cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

pbetkier · 2024-10-04T09:13:02Z

Which component are you using?: cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Background:
Scale-up logic finds unschedulable pods to help, simulates scheduling of the unschedulable pods onto nodes and performs scale-up for the remaining unschedulable pods that weren't helped by the simulation. Scale-up logic saves the simulated bindings into cluster snapshot code. Scale-down logic then reuses this simulated pod binding to avoid removing nodes that are simulated targets for pods scheduling code.

Problem:
When a PVC provisions for a long time scheduler's PreBind plugin VolumeBinding times out. Scheduler will retry scheduling the Pod, but until this happens Pod's condition becomes PodScheduled=false with reason SchedulerError:

"Plugin failed" err="binding volumes: context deadline exceeded" plugin="VolumeBinding" pod="..." node="..."
"Error scheduling pod; retrying" err="running PreBind plugin \"VolumeBinding\": binding volumes: context deadline exceeded" pod="..."
"Updating pod condition" pod="..." conditionType="PodScheduled" conditionStatus="False" conditionReason="SchedulerError"

In the next loop cluster-autoscaler observes that the Pod is not unschedulable code, doesn't simulate its scheduling on any nodes and subsequently scale-down logic is not prevented from deleting a node that was previously scaled-up for this pod.

Describe the solution you'd like.:
Pods with PodScheduled=false and reason SchedulerError should be treated as unschedulable by the scale-up logic code. This seems safer than ignoring them by the cluster-autoscaler – better to provision extra capacity than scaling down because of a possibly transient error.

The text was updated successfully, but these errors were encountered:

pbetkier · 2024-10-04T09:13:33Z

PTAL @MaciekPytel @alculquicondor @BigDarkClown.

adrianmoisey · 2024-10-04T12:35:30Z

/area cluster-autoscaler

alculquicondor · 2024-10-07T17:40:10Z

SchedulerError could happen due to other problems not related to volume binding, so the proposal isn't generally safe.

It might be better to use some more precise information that the Pod is meant to be going to that particular node, like the nominatedNodeName, as it is proposed for kubernetes/kubernetes#125491

dom4ha · 2024-10-15T10:13:32Z

I wonder if the timeout is a root cause of this issue or an effect of scaling down a Node which was under delayed binding. Is it possible to verify that?

In the next loop cluster-autoscaler observes that the Pod is not unschedulable code, doesn't simulate its scheduling on any nodes and subsequently scale-down logic is not prevented from deleting a node that was previously scaled-up for this pod.

As we discussed, the nodes which are not unschedulable (for instance the ones waiting for PV provisioning) are not considered in simulation (are invisible to the Autoscaler), so maybe this Pod wasn't the reason for scale up, but just a victim of unrelated scale down?

Use of nominatedNodeName could actually fill that gap and make Autoscaler aware of the Pods that are for whatever reasons waiting for binding.

dom4ha · 2024-10-24T10:14:11Z

It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations and delayed binding can cause the observed issues.

Use of the nominatedNodeName to communicate scheduler binding decision should reduce the risk of the race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing

pbetkier added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 4, 2024

pbetkier mentioned this issue Oct 4, 2024

Scheduler pre-binding can cause race conditions with automated empty node removal kubernetes/kubernetes#125491

Open

k8s-ci-robot added the area/cluster-autoscaler label Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

pbetkier commented Oct 4, 2024

pbetkier commented Oct 4, 2024

adrianmoisey commented Oct 4, 2024

alculquicondor commented Oct 7, 2024 •

edited

Loading

dom4ha commented Oct 15, 2024

dom4ha commented Oct 24, 2024

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed #7343

Comments

pbetkier commented Oct 4, 2024

pbetkier commented Oct 4, 2024

adrianmoisey commented Oct 4, 2024

alculquicondor commented Oct 7, 2024 • edited Loading

dom4ha commented Oct 15, 2024

dom4ha commented Oct 24, 2024

alculquicondor commented Oct 7, 2024 •

edited

Loading