-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler pre-binding can cause race conditions with automated empty node removal #125491
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig scheduling |
/sig autoscaling |
The scheduler tracks the node status (e.g., which Node has what amount of free space, etc) with the cache. |
I'm a scheduler person and not an expert of the cluster autoscaler. But, AFAIK, the cluster autoscaler has the capability of simulating the scheduling so that it can make a proper decision. I'm wondering why the cluster autoscaler didn't understand the Pod is going to be scheduled onto the Node. I thought it should because, otherwise, after the removal of the node, it has to create another Node again for this unschedulable Pod; would it keep removing -> creating a Node until somehow the Pod gets successfully scheduled/bound? |
After binding, a k8s event will be sent. I wonder if this can be used as a notification method? root@VM-0-15-ubuntu:/home/ubuntu# kubectl get event
LAST SEEN TYPE REASON OBJECT MESSAGE
4m18s Normal Scheduled pod/my-pod Successfully assigned default/my-pod to cluster1-worker
4m17s Normal Pulling pod/my-pod Pulling image "nginx:latest"
4m6s Normal Pulled pod/my-pod Successfully pulled image "nginx:latest" in 11.698s (11.698s including waiting)
4m6s Normal Created pod/my-pod Created container my-container
4m6s Normal Started pod/my-pod Started container my-container refer to: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L1008 |
The |
Maybe we could be adding the information to This issue sounds kind of related to #126499, in which, upon restart, kube-scheduler decides to schedule a different pod from the one that had a PVC already attached for. @BigDarkClown, do you think you could be using the cc @dom4ha |
Let me share my first thought about the issue. Forgive me my limited knowledge of the system, but I'm ramping up. The race between scheduler context and autoscaling (downsizing) seems pretty obvious. When more scheduling decisions are decoupled from the actuation (like #126858), there's a bigger likelihood of attempting to follow a wrong decision. I think that there are two main approaches we could take to address the issues like this:
I suspect that with a growing cluster scales, there will be growing number of inconsistencies the scheduler may run into. Scheduler and autoscaler is probably only one of such examples, but for instance pod vertical scaling might be potential future one, so maybe we should consider exploring the first approach, which should work with all potential future issues. |
@BigDarkClown IIUC, the situation eventually recovers, right? Meaning, once the PVC is detached, the scheduler is able to initiate a different node assignment. But I understand that it can take time for a PVC to be detached and so on.
I think the only way is for the scheduler to communicate its intent and that it is waiting for the PVC to be attached to a particular node. OTOH, could CA pay attention to the PVCs that are getting attached? Or does the information need to be contained in the Pod itself? |
Looks like a good one. So, we can add |
Do I understand correctly that PVC provisioning process is a part of the scheduling loop? Since it is a long lasting process (sounds like more heavy than the binding itself), why isn't it performed async as well? With that regard, it looks like the problem of the preemption (being a part of the scheduling loop) is similar to this one. This means that we could use the same mechanisms:
Is my understanding correct here? @alculquicondor indeed, using |
Actually, no. I double checked and the API call happens in the kubernetes/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go Lines 496 to 501 in 3ab0c84
So this issue is not so much about performance, just communicating the decision to the apiserver (which, in turn, would make cluster-autoscaler aware of it). This can still be done in PreBind, I think. |
Ack, thanks for checking. Following the analogy to the preemption process, why the async preemption cannot become a part of the binding (PreBind) process instead of making it a separate goroutine (in KEP-4832) that requires putting the preemptor back to the scheduling queue (unblocking it). This way the |
/assign |
We don't initiate a binding process if the Pod is not ready to be bound, which is the case when a pod needs preemption. A pod needing preemption is not just a matter of issuing the preemptions. Preempted pods have to finish, respecting their graceful termination period. And while pods are gracefully terminating (which can take even hours, depending on their configuration), the incoming pod could potentially be placed somewhere else, hence why it goes back to the queue. Maybe what you are proposing is just another routine, which happens to be similar to the binding routine. This could potentially work. But I think we should move this discussion to kubernetes/enhancements#4833 |
Yeah, looks like very similar to the idea mentioned at |
I observed a similar problem, described in #7343. There I propose to treat pods with @BigDarkClown could you verify if in the issue you're describing pod status was changed to |
@pbetkier Not sure whether autoscaler detects all unschedulable pods correctly in kubernetes/autoscaler#7343, but I think that the problem might still exist even if it wasn't an issue. Not sure how the simulation works, but is it possible that it can give different results than the real scheduler and cause autoscaler to take wrong actions? Let's consider a scenario that there are 10 nodes to turn down and 1 Pod still waiting for PVC provisioning. How autoscaler knows which one to keep? Communicating scheduler decisions for all kinds of delayed binding scenarios (preemption, PVC or DRA) should be beneficial. Reusing the |
It was confirmed in the code and in GKE logs that Autoscaler does not consider in-flight pods in their internal simulations, therefore delayed binding that takes place in PVC and DRA plugins cause the observed issues. The proposal is to use the nominatedNodeName to communicate scheduler binding decision at the beginning of the delayed binding process, to reduce the risk of this race to the very minimum. There is a short doc describing the approach, so feel free to comment: https://docs.google.com/document/d/1aIhh4EsGNTfF6ZOnjZBTLJClhR-NXepWSClp2-oo26I/edit?usp=sharing |
Cluster Autoscaler already injects pods into internal in-memory representation of the state of the cluster based on nominated node name: Subsequent logic - thanks to that - won't trigger cluster scale up for that pod and won't trigger scale down for the node, which is exactly what we want. I think this is a good improvement: small change required in scheduler, no change at all required in Cluster Autoscaler. |
There is a problem with directly reusing the concept of For scheduler it makes a big difference whether the node is just nominated or scheduled for binding (for instance pod affinity is taken into account only after the binding decision was taken). Note that to prevent from account resources twice, nominated node is cleared up and scheduler starts assuming the pod is already bind (we say the pod is assumed). Scheduler communicates the nominated nodes outside (in preemption case), but only until pods are assumed (as nominations are cleared at that point). So outside of scheduler, we don't know about assumed pods until the pod is actually bind (assigned). In a typical case it's a short time (therefore this wasn't considered an issue in the past), but in certain cases it takes a long time. Let's enumerate these three different pod scheduling phases explicitly:
I think we have two options:
The advantage of the second approach is that we separate concepts of nominations from assumptions. Communicating it differently makes also sense, since the latter is much stronger, as the preparation process to bind has been already initialized (PVC might be under provisioning). Communicating and so persisting information about initiating the binding process would also allow to solve yet another type of issues - as scheduler may stick to it's own decision even after a failover. |
Regarding the double accounting issue, what is the impact? However, it is indeed clearer if the second approach is used. |
What happened?
In a Google Kubernetes Engine (GKE) environment, a pod was requesting a large Persistent Volume Claim (PVC). After the appropriate node was identified for the pod, the pod became stuck in the prebinding stage for several minutes while the volume provisioning process completed. Since the node name was not assigned to the pod during this time, the Cluster Autoscaler perceived the node as unoccupied. Consequently, the Cluster Autoscaler initiated a scale-down of the node, unaware that the pending pod was scheduled to run there.
What did you expect to happen?
I would expect that the Scheduler would communicate the intended binding of the pod to the identified node. This would enable the Cluster Autoscaler to recognize that the node is not actually empty and prevent it from being scaled down prematurely.
How can we reproduce it (as minimally and precisely as possible)?
The issue arose in a large GKE cluster with pods requesting substantial PVCs, making replication potentially challenging. However, the race condition within the Scheduler is evident.
Anything else we need to know?
No response
Kubernetes version
Cloud provider
OS version
No response
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: