Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler pre-binding can cause race conditions with automated empty node removal #125491

Open
BigDarkClown opened this issue Jun 13, 2024 · 23 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@BigDarkClown
Copy link
Contributor

What happened?

In a Google Kubernetes Engine (GKE) environment, a pod was requesting a large Persistent Volume Claim (PVC). After the appropriate node was identified for the pod, the pod became stuck in the prebinding stage for several minutes while the volume provisioning process completed. Since the node name was not assigned to the pod during this time, the Cluster Autoscaler perceived the node as unoccupied. Consequently, the Cluster Autoscaler initiated a scale-down of the node, unaware that the pending pod was scheduled to run there.

What did you expect to happen?

I would expect that the Scheduler would communicate the intended binding of the pod to the identified node. This would enable the Cluster Autoscaler to recognize that the node is not actually empty and prevent it from being scaled down prematurely.

How can we reproduce it (as minimally and precisely as possible)?

The issue arose in a large GKE cluster with pods requesting substantial PVCs, making replication potentially challenging. However, the race condition within the Scheduler is evident.

Anything else we need to know?

No response

Kubernetes version

1.27

Cloud provider

GKE

OS version

No response

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@BigDarkClown BigDarkClown added the kind/bug Categorizes issue or PR as related to a bug. label Jun 13, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@neolit123
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 13, 2024
@sanposhiho
Copy link
Member

/sig autoscaling
/sig storage

@k8s-ci-robot k8s-ci-robot added sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Aug 6, 2024
@sanposhiho
Copy link
Member

sanposhiho commented Aug 6, 2024

The scheduler tracks the node status (e.g., which Node has what amount of free space, etc) with the cache.
When the Pod is at the PreBind stage (i.e., the binding cycle), the scheduler reserves the node for the Pod so that other Pods won't steal the node while the Pod is being bound at the binding cycle.
The problem is that the cache in the scheduler is invisible to the cluster autoscaler.

@sanposhiho
Copy link
Member

sanposhiho commented Aug 6, 2024

I'm a scheduler person and not an expert of the cluster autoscaler. But, AFAIK, the cluster autoscaler has the capability of simulating the scheduling so that it can make a proper decision. I'm wondering why the cluster autoscaler didn't understand the Pod is going to be scheduled onto the Node. I thought it should because, otherwise, after the removal of the node, it has to create another Node again for this unschedulable Pod; would it keep removing -> creating a Node until somehow the Pod gets successfully scheduled/bound?

@googs1025
Copy link
Member

After binding, a k8s event will be sent. I wonder if this can be used as a notification method?

root@VM-0-15-ubuntu:/home/ubuntu# kubectl get event
LAST SEEN   TYPE     REASON      OBJECT       MESSAGE
4m18s       Normal   Scheduled   pod/my-pod   Successfully assigned default/my-pod to cluster1-worker
4m17s       Normal   Pulling     pod/my-pod   Pulling image "nginx:latest"
4m6s        Normal   Pulled      pod/my-pod   Successfully pulled image "nginx:latest" in 11.698s (11.698s including waiting)
4m6s        Normal   Created     pod/my-pod   Created container my-container
4m6s        Normal   Started     pod/my-pod   Started container my-container

refer to: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L1008

@sanposhiho
Copy link
Member

The Scheduled event is shipped after the binding cycle finishes (= after PreBind)
Also, the event shouldn't be an API between the components.

@alculquicondor
Copy link
Member

Maybe we could be adding the information to nominatedNodeName?

This issue sounds kind of related to #126499, in which, upon restart, kube-scheduler decides to schedule a different pod from the one that had a PVC already attached for.

@BigDarkClown, do you think you could be using the nominatedNodeName? This is the same field we fill up when a Pod is waiting for preemptions to happen in a particular node.

cc @dom4ha

@dom4ha
Copy link
Member

dom4ha commented Sep 13, 2024

Let me share my first thought about the issue. Forgive me my limited knowledge of the system, but I'm ramping up.

The race between scheduler context and autoscaling (downsizing) seems pretty obvious. When more scheduling decisions are decoupled from the actuation (like #126858), there's a bigger likelihood of attempting to follow a wrong decision.

I think that there are two main approaches we could take to address the issues like this:

  1. Detect wrong scheduling decisions without unnecessary delay and correct them in the next cycle
  2. Reduce the race condition window to the minimum between components which can interfere with each other

I suspect that with a growing cluster scales, there will be growing number of inconsistencies the scheduler may run into. Scheduler and autoscaler is probably only one of such examples, but for instance pod vertical scaling might be potential future one, so maybe we should consider exploring the first approach, which should work with all potential future issues.

@alculquicondor
Copy link
Member

Detect wrong scheduling decisions without unnecessary delay and correct them in the next cycle

@BigDarkClown IIUC, the situation eventually recovers, right? Meaning, once the PVC is detached, the scheduler is able to initiate a different node assignment. But I understand that it can take time for a PVC to be detached and so on.

Reduce the race condition window to the minimum between components which can interfere with each other

I think the only way is for the scheduler to communicate its intent and that it is waiting for the PVC to be attached to a particular node.

OTOH, could CA pay attention to the PVCs that are getting attached? Or does the information need to be contained in the Pod itself?

@sanposhiho
Copy link
Member

Maybe we could be adding the information to nominatedNodeName?

Looks like a good one. So, we can add nominatedNodeName just after Permit and before PreBind.
Does CA already take nominatedNodeName into consideration? (I guess Yes, because otherwise when the preemption happens and the Node has got the space, the CA might scale down the Node, which it shouldn't.)

@dom4ha
Copy link
Member

dom4ha commented Sep 16, 2024

Do I understand correctly that PVC provisioning process is a part of the scheduling loop? Since it is a long lasting process (sounds like more heavy than the binding itself), why isn't it performed async as well?

With that regard, it looks like the problem of the preemption (being a part of the scheduling loop) is similar to this one. This means that we could use the same mechanisms:

  1. reuse the nominatedNodeName mechanism for communicating the scheduler intention for various types of long lasting operations (which was proposed by @alculquicondor)
  2. eventually reuse the async preemption design to perform PVC provisioning asynchronously as well

Is my understanding correct here?

@alculquicondor indeed, using nominatedNodeName can potentially address #126499 as well

@alculquicondor
Copy link
Member

Do I understand correctly that PVC provisioning process is a part of the scheduling loop?

Actually, no. I double checked and the API call happens in the PreBind extension point

// PreBind will make the API update with the assumed bindings and wait until
// the PV controller has completely finished the binding operation.
//
// If binding errors, times out or gets undone, then an error will be returned to
// retry scheduling.
func (pl *VolumeBinding) PreBind(ctx context.Context, cs *framework.CycleState, pod *v1.Pod, nodeName string) *framework.Status {

PreBind is part of the binding cycle https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces

So this issue is not so much about performance, just communicating the decision to the apiserver (which, in turn, would make cluster-autoscaler aware of it). This can still be done in PreBind, I think.

@dom4ha
Copy link
Member

dom4ha commented Sep 16, 2024

Ack, thanks for checking.

Following the analogy to the preemption process, why the async preemption cannot become a part of the binding (PreBind) process instead of making it a separate goroutine (in KEP-4832) that requires putting the preemptor back to the scheduling queue (unblocking it). This way the nominatedNodeName would be called async as well (which IIUC is an expensive external call) instead of calling it inside scheduling loop?

@dom4ha
Copy link
Member

dom4ha commented Sep 17, 2024

/assign

@alculquicondor
Copy link
Member

Following the analogy to the preemption process, why the async preemption cannot become a part of the binding (PreBind) process instead of making it a separate goroutine (in KEP-4832) that requires putting the preemptor back to the scheduling queue (unblocking it)

We don't initiate a binding process if the Pod is not ready to be bound, which is the case when a pod needs preemption. A pod needing preemption is not just a matter of issuing the preemptions. Preempted pods have to finish, respecting their graceful termination period. And while pods are gracefully terminating (which can take even hours, depending on their configuration), the incoming pod could potentially be placed somewhere else, hence why it goes back to the queue.

Maybe what you are proposing is just another routine, which happens to be similar to the binding routine. This could potentially work. But I think we should move this discussion to kubernetes/enhancements#4833

@sanposhiho
Copy link
Member

sanposhiho commented Sep 17, 2024

Maybe what you are proposing is just another routine

Yeah, looks like very similar to the idea mentioned at Alternative section of kubernetes/enhancements#4833. (create a new extension point that is run in a goroutine)

@pbetkier
Copy link
Contributor

pbetkier commented Oct 4, 2024

I observed a similar problem, described in #7343. There I propose to treat pods with PodsScheduled=false reason SchedulerError as unschedulable in cluster-autoscaler.

@BigDarkClown could you verify if in the issue you're describing pod status was changed to PodScheduled=false reason SchedulerError by the scheduler at any point? If yes, a simple fix as proposed in #7343 would be sufficient I think.

@dom4ha
Copy link
Member

dom4ha commented Oct 4, 2024

@pbetkier Not sure whether autoscaler detects all unschedulable pods correctly in kubernetes/autoscaler#7343, but I think that the problem might still exist even if it wasn't an issue.

Not sure how the simulation works, but is it possible that it can give different results than the real scheduler and cause autoscaler to take wrong actions? Let's consider a scenario that there are 10 nodes to turn down and 1 Pod still waiting for PVC provisioning. How autoscaler knows which one to keep?

Communicating scheduler decisions for all kinds of delayed binding scenarios (preemption, PVC or DRA) should be beneficial. Reusing the status.nominatedNodeName for that seems to be a good idea.

@dom4ha
Copy link
Member

dom4ha commented Oct 24, 2024

It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations, therefore delayed binding that takes place in PVC and DRA plugins cause the observed issues.

The proposal is to use the nominatedNodeName to communicate scheduler binding decision at the beginning of the delayed binding process, to reduce the risk of this race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing

@x13n
Copy link
Member

x13n commented Dec 2, 2024

Cluster Autoscaler already injects pods into internal in-memory representation of the state of the cluster based on nominated node name:

https://github.com/kubernetes/autoscaler/blob/4463c60361425106c742b77d3cdcbbc9abc4e9da/cluster-autoscaler/core/podlistprocessor/filter_out_expendable.go#L54-L65

Subsequent logic - thanks to that - won't trigger cluster scale up for that pod and won't trigger scale down for the node, which is exactly what we want. I think this is a good improvement: small change required in scheduler, no change at all required in Cluster Autoscaler.

@dom4ha
Copy link
Member

dom4ha commented Dec 3, 2024

There is a problem with directly reusing the concept of nominatedNodeName for delayed binding scenarios as originally proposed.

For scheduler it makes a big difference whether the node is just nominated or scheduled for binding (for instance pod affinity is taken into account only after the binding decision was taken). Note that to prevent from account resources twice, nominated node is cleared up and scheduler starts assuming the pod is already bind (we say the pod is assumed).

Scheduler communicates the nominated nodes outside (in preemption case), but only until pods are assumed (as nominations are cleared at that point). So outside of scheduler, we don't know about assumed pods until the pod is actually bind (assigned). In a typical case it's a short time (therefore this wasn't considered an issue in the past), but in certain cases it takes a long time.

Let's enumerate these three different pod scheduling phases explicitly:

  1. Nomination as a preliminary reservation,
  2. Assumption after scheduler takes scheduling decision, which can be long lasting in some cases, as scheduler might be running some pre-bindings (like PVC or DRA provisioning),
  3. Assigned, the actual binding, which is also a process, but relatively short

I think we have two options:

  1. We could extend the semantic of nominated nodes to cover both phase 1 and 2 (as outside of scheduler the distinction is not that important). It would be very confusing within scheduler though, as some pods will be both tracked as assigned and nominated (double accounting problem).
  2. Introduce a similar concept to nominated node, but make it specifically dedicated to the assumption phase 2.

The advantage of the second approach is that we separate concepts of nominations from assumptions. Communicating it differently makes also sense, since the latter is much stronger, as the preparation process to bind has been already initialized (PVC might be under provisioning). Communicating and so persisting information about initiating the binding process would also allow to solve yet another type of issues - as scheduler may stick to it's own decision even after a failover.

@haosdent
Copy link
Member

haosdent commented Dec 6, 2024

both tracked as assigned and nominated (double accounting problem).

Regarding the double accounting issue, what is the impact?

However, it is indeed clearer if the second approach is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

10 participants