Scheduler pre-binding can cause race conditions with automated empty node removal #125491

BigDarkClown · 2024-06-13T14:03:20Z

What happened?

In a Google Kubernetes Engine (GKE) environment, a pod was requesting a large Persistent Volume Claim (PVC). After the appropriate node was identified for the pod, the pod became stuck in the prebinding stage for several minutes while the volume provisioning process completed. Since the node name was not assigned to the pod during this time, the Cluster Autoscaler perceived the node as unoccupied. Consequently, the Cluster Autoscaler initiated a scale-down of the node, unaware that the pending pod was scheduled to run there.

What did you expect to happen?

I would expect that the Scheduler would communicate the intended binding of the pod to the identified node. This would enable the Cluster Autoscaler to recognize that the node is not actually empty and prevent it from being scaled down prematurely.

How can we reproduce it (as minimally and precisely as possible)?

The issue arose in a large GKE cluster with pods requesting substantial PVCs, making replication potentially challenging. However, the race condition within the Scheduler is evident.

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

GKE

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

k8s-ci-robot · 2024-06-13T14:03:30Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

neolit123 · 2024-06-13T15:29:11Z

/sig scheduling

sanposhiho · 2024-08-06T12:24:30Z

/sig autoscaling
/sig storage

sanposhiho · 2024-08-06T12:32:00Z

The scheduler tracks the node status (e.g., which Node has what amount of free space, etc) with the cache.
When the Pod is at the PreBind stage (i.e., the binding cycle), the scheduler reserves the node for the Pod so that other Pods won't steal the node while the Pod is being bound at the binding cycle.
The problem is that the cache in the scheduler is invisible to the cluster autoscaler.

sanposhiho · 2024-08-06T12:38:23Z

I'm a scheduler person and not an expert of the cluster autoscaler. But, AFAIK, the cluster autoscaler has the capability of simulating the scheduling so that it can make a proper decision. I'm wondering why the cluster autoscaler didn't understand the Pod is going to be scheduled onto the Node. I thought it should because, otherwise, after the removal of the node, it has to create another Node again for this unschedulable Pod; would it keep removing -> creating a Node until somehow the Pod gets successfully scheduled/bound?

googs1025 · 2024-08-06T12:45:30Z

After binding, a k8s event will be sent. I wonder if this can be used as a notification method?

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get event
LAST SEEN   TYPE     REASON      OBJECT       MESSAGE
4m18s       Normal   Scheduled   pod/my-pod   Successfully assigned default/my-pod to cluster1-worker
4m17s       Normal   Pulling     pod/my-pod   Pulling image "nginx:latest"
4m6s        Normal   Pulled      pod/my-pod   Successfully pulled image "nginx:latest" in 11.698s (11.698s including waiting)
4m6s        Normal   Created     pod/my-pod   Created container my-container
4m6s        Normal   Started     pod/my-pod   Started container my-container

refer to: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L1008

sanposhiho · 2024-08-06T12:59:49Z

The Scheduled event is shipped after the binding cycle finishes (= after PreBind)
Also, the event shouldn't be an API between the components.

alculquicondor · 2024-09-12T17:54:08Z

Maybe we could be adding the information to nominatedNodeName?

This issue sounds kind of related to #126499, in which, upon restart, kube-scheduler decides to schedule a different pod from the one that had a PVC already attached for.

@BigDarkClown, do you think you could be using the nominatedNodeName? This is the same field we fill up when a Pod is waiting for preemptions to happen in a particular node.

cc @dom4ha

dom4ha · 2024-09-13T10:43:04Z

Let me share my first thought about the issue. Forgive me my limited knowledge of the system, but I'm ramping up.

The race between scheduler context and autoscaling (downsizing) seems pretty obvious. When more scheduling decisions are decoupled from the actuation (like #126858), there's a bigger likelihood of attempting to follow a wrong decision.

I think that there are two main approaches we could take to address the issues like this:

Detect wrong scheduling decisions without unnecessary delay and correct them in the next cycle
Reduce the race condition window to the minimum between components which can interfere with each other

I suspect that with a growing cluster scales, there will be growing number of inconsistencies the scheduler may run into. Scheduler and autoscaler is probably only one of such examples, but for instance pod vertical scaling might be potential future one, so maybe we should consider exploring the first approach, which should work with all potential future issues.

alculquicondor · 2024-09-13T15:55:28Z

Detect wrong scheduling decisions without unnecessary delay and correct them in the next cycle

@BigDarkClown IIUC, the situation eventually recovers, right? Meaning, once the PVC is detached, the scheduler is able to initiate a different node assignment. But I understand that it can take time for a PVC to be detached and so on.

Reduce the race condition window to the minimum between components which can interfere with each other

I think the only way is for the scheduler to communicate its intent and that it is waiting for the PVC to be attached to a particular node.

OTOH, could CA pay attention to the PVCs that are getting attached? Or does the information need to be contained in the Pod itself?

sanposhiho · 2024-09-14T03:23:24Z

Maybe we could be adding the information to nominatedNodeName?

Looks like a good one. So, we can add nominatedNodeName just after Permit and before PreBind.
Does CA already take nominatedNodeName into consideration? (I guess Yes, because otherwise when the preemption happens and the Node has got the space, the CA might scale down the Node, which it shouldn't.)

dom4ha · 2024-09-16T17:06:25Z

Do I understand correctly that PVC provisioning process is a part of the scheduling loop? Since it is a long lasting process (sounds like more heavy than the binding itself), why isn't it performed async as well?

With that regard, it looks like the problem of the preemption (being a part of the scheduling loop) is similar to this one. This means that we could use the same mechanisms:

reuse the nominatedNodeName mechanism for communicating the scheduler intention for various types of long lasting operations (which was proposed by @alculquicondor)
eventually reuse the async preemption design to perform PVC provisioning asynchronously as well

Is my understanding correct here?

@alculquicondor indeed, using nominatedNodeName can potentially address #126499 as well

alculquicondor · 2024-09-16T17:30:08Z

Do I understand correctly that PVC provisioning process is a part of the scheduling loop?

Actually, no. I double checked and the API call happens in the PreBind extension point

kubernetes/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go

Lines 496 to 501 in 3ab0c84

    
           // PreBind will make the API update with the assumed bindings and wait until 
        
           // the PV controller has completely finished the binding operation. 
        
           // 
        
           // If binding errors, times out or gets undone, then an error will be returned to 
        
           // retry scheduling. 
        
           func (pl *VolumeBinding) PreBind(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) *framework.Status {

PreBind is part of the binding cycle https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces

So this issue is not so much about performance, just communicating the decision to the apiserver (which, in turn, would make cluster-autoscaler aware of it). This can still be done in PreBind, I think.

dom4ha · 2024-09-16T20:23:54Z

Ack, thanks for checking.

Following the analogy to the preemption process, why the async preemption cannot become a part of the binding (PreBind) process instead of making it a separate goroutine (in KEP-4832) that requires putting the preemptor back to the scheduling queue (unblocking it). This way the nominatedNodeName would be called async as well (which IIUC is an expensive external call) instead of calling it inside scheduling loop?

dom4ha · 2024-09-17T13:42:31Z

/assign

alculquicondor · 2024-09-17T14:53:19Z

Following the analogy to the preemption process, why the async preemption cannot become a part of the binding (PreBind) process instead of making it a separate goroutine (in KEP-4832) that requires putting the preemptor back to the scheduling queue (unblocking it)

We don't initiate a binding process if the Pod is not ready to be bound, which is the case when a pod needs preemption. A pod needing preemption is not just a matter of issuing the preemptions. Preempted pods have to finish, respecting their graceful termination period. And while pods are gracefully terminating (which can take even hours, depending on their configuration), the incoming pod could potentially be placed somewhere else, hence why it goes back to the queue.

Maybe what you are proposing is just another routine, which happens to be similar to the binding routine. This could potentially work. But I think we should move this discussion to kubernetes/enhancements#4833

sanposhiho · 2024-09-17T15:02:21Z

Maybe what you are proposing is just another routine

Yeah, looks like very similar to the idea mentioned at Alternative section of kubernetes/enhancements#4833. (create a new extension point that is run in a goroutine)

pbetkier · 2024-10-04T09:17:05Z

I observed a similar problem, described in #7343. There I propose to treat pods with PodsScheduled=false reason SchedulerError as unschedulable in cluster-autoscaler.

@BigDarkClown could you verify if in the issue you're describing pod status was changed to PodScheduled=false reason SchedulerError by the scheduler at any point? If yes, a simple fix as proposed in #7343 would be sufficient I think.

dom4ha · 2024-10-04T22:27:21Z

@pbetkier Not sure whether autoscaler detects all unschedulable pods correctly in kubernetes/autoscaler#7343, but I think that the problem might still exist even if it wasn't an issue.

Not sure how the simulation works, but is it possible that it can give different results than the real scheduler and cause autoscaler to take wrong actions? Let's consider a scenario that there are 10 nodes to turn down and 1 Pod still waiting for PVC provisioning. How autoscaler knows which one to keep?

Communicating scheduler decisions for all kinds of delayed binding scenarios (preemption, PVC or DRA) should be beneficial. Reusing the status.nominatedNodeName for that seems to be a good idea.

dom4ha · 2024-10-24T10:10:31Z

It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations, therefore delayed binding that takes place in PVC and DRA plugins cause the observed issues.

The proposal is to use the nominatedNodeName to communicate scheduler binding decision at the beginning of the delayed binding process, to reduce the risk of this race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing

x13n · 2024-12-02T11:59:39Z

Cluster Autoscaler already injects pods into internal in-memory representation of the state of the cluster based on nominated node name:

https://github.com/kubernetes/autoscaler/blob/4463c60361425106c742b77d3cdcbbc9abc4e9da/cluster-autoscaler/core/podlistprocessor/filter_out_expendable.go#L54-L65

Subsequent logic - thanks to that - won't trigger cluster scale up for that pod and won't trigger scale down for the node, which is exactly what we want. I think this is a good improvement: small change required in scheduler, no change at all required in Cluster Autoscaler.

dom4ha · 2024-12-03T18:46:34Z

There is a problem with directly reusing the concept of nominatedNodeName for delayed binding scenarios as originally proposed.

For scheduler it makes a big difference whether the node is just nominated or scheduled for binding (for instance pod affinity is taken into account only after the binding decision was taken). Note that to prevent from account resources twice, nominated node is cleared up and scheduler starts assuming the pod is already bind (we say the pod is assumed).

Scheduler communicates the nominated nodes outside (in preemption case), but only until pods are assumed (as nominations are cleared at that point). So outside of scheduler, we don't know about assumed pods until the pod is actually bind (assigned). In a typical case it's a short time (therefore this wasn't considered an issue in the past), but in certain cases it takes a long time.

Let's enumerate these three different pod scheduling phases explicitly:

Nomination as a preliminary reservation,
Assumption after scheduler takes scheduling decision, which can be long lasting in some cases, as scheduler might be running some pre-bindings (like PVC or DRA provisioning),
Assigned, the actual binding, which is also a process, but relatively short

I think we have two options:

We could extend the semantic of nominated nodes to cover both phase 1 and 2 (as outside of scheduler the distinction is not that important). It would be very confusing within scheduler though, as some pods will be both tracked as assigned and nominated (double accounting problem).
Introduce a similar concept to nominated node, but make it specifically dedicated to the assumption phase 2.

The advantage of the second approach is that we separate concepts of nominations from assumptions. Communicating it differently makes also sense, since the latter is much stronger, as the preparation process to bind has been already initialized (PVC might be under provisioning). Communicating and so persisting information about initiating the binding process would also allow to solve yet another type of issues - as scheduler may stick to it's own decision even after a failover.

haosdent · 2024-12-06T11:27:04Z

both tracked as assigned and nominated (double accounting problem).

Regarding the double accounting issue, what is the impact?

However, it is indeed clearer if the second approach is used.

BigDarkClown added the kind/bug Categorizes issue or PR as related to a bug. label Jun 13, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2024

k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Aug 6, 2024

alculquicondor mentioned this issue Sep 12, 2024

Scheduling Inconsistency Caused by kube-scheduler Restart #126499

Open

k8s-ci-robot assigned dom4ha Sep 17, 2024

dom4ha mentioned this issue Sep 17, 2024

KEP-4832: Asynchronous preemption in the scheduler kubernetes/enhancements#4833

Merged

alculquicondor mentioned this issue Oct 7, 2024

cluster-autoscaler: prevent scale-down of nodes for which VolumeBinding failed kubernetes/autoscaler#7343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

BigDarkClown commented Jun 13, 2024

k8s-ci-robot commented Jun 13, 2024

neolit123 commented Jun 13, 2024

sanposhiho commented Aug 6, 2024

sanposhiho commented Aug 6, 2024 •

edited

Loading

sanposhiho commented Aug 6, 2024 •

edited

Loading

googs1025 commented Aug 6, 2024

sanposhiho commented Aug 6, 2024

alculquicondor commented Sep 12, 2024

dom4ha commented Sep 13, 2024

alculquicondor commented Sep 13, 2024

sanposhiho commented Sep 14, 2024

dom4ha commented Sep 16, 2024

alculquicondor commented Sep 16, 2024

dom4ha commented Sep 16, 2024

dom4ha commented Sep 17, 2024

alculquicondor commented Sep 17, 2024

sanposhiho commented Sep 17, 2024 •

edited

Loading

pbetkier commented Oct 4, 2024 •

edited

Loading

dom4ha commented Oct 4, 2024

dom4ha commented Oct 24, 2024

x13n commented Dec 2, 2024

dom4ha commented Dec 3, 2024

haosdent commented Dec 6, 2024

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

Comments

BigDarkClown commented Jun 13, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Jun 13, 2024

neolit123 commented Jun 13, 2024

sanposhiho commented Aug 6, 2024

sanposhiho commented Aug 6, 2024 • edited Loading

sanposhiho commented Aug 6, 2024 • edited Loading

googs1025 commented Aug 6, 2024

sanposhiho commented Aug 6, 2024

alculquicondor commented Sep 12, 2024

dom4ha commented Sep 13, 2024

alculquicondor commented Sep 13, 2024

sanposhiho commented Sep 14, 2024

dom4ha commented Sep 16, 2024

alculquicondor commented Sep 16, 2024

dom4ha commented Sep 16, 2024

dom4ha commented Sep 17, 2024

alculquicondor commented Sep 17, 2024

sanposhiho commented Sep 17, 2024 • edited Loading

pbetkier commented Oct 4, 2024 • edited Loading

dom4ha commented Oct 4, 2024

dom4ha commented Oct 24, 2024

x13n commented Dec 2, 2024

dom4ha commented Dec 3, 2024

haosdent commented Dec 6, 2024

sanposhiho commented Aug 6, 2024 •

edited

Loading

sanposhiho commented Aug 6, 2024 •

edited

Loading

sanposhiho commented Sep 17, 2024 •

edited

Loading

pbetkier commented Oct 4, 2024 •

edited

Loading