-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336
Comments
The behaviour for 1.27 is to bail out with a |
/sig scheduling /assign 🤦. I'll came up with a fix. |
cc @gabesaba |
I found a related(maybe) issue from a long time ago: #61410 (comment)
Seems we allow distinct nodeName and hostname before. Not sure what the situation is now. |
The situation is this: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels
If this is going to be changed I think it should be done via a proper deprecation cycle; there are clusters at the moment that are broken by the bug outlined in this issue. (I understand the desire here - originally k8s had a design constraint that clusters would have a maximum size of about 10k entities; under those circumstances, simple algorithms - ie, |
For comparison, in the node affinity plugin, we only PreFilter nodes based on the
Maybe we should revert #109877? Or somehow limit it to field selectors, as opposed to label selectors. |
I'm not sure what is being asked. Various cloud providers have a variety of methods of naming nodes (documented in kubernetes/website#8873, though that file appears to be gone now... maybe with the out-of-tree cloud provider move) Are you asking if the way the kubelet sets the hostname label value should change? That will almost certainly break some existing users of the label on clouds where the current label value != the node name. Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ. |
I was asking about this. It sounds like we should be reverting #109877 then. Note that the cherry-pick deadline is tomorrow. I think that will be the last 1.27 patch release. |
FYI @msau42 and @Huang-Wei as reviewers of that PR. |
So, we do not consider converting hostname to nodeName in the prefilter for performance reasons? |
How do you suggest we do it? If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run. |
You also have to consider what's feasible to cherry-pick. But now we are late for this patch release. |
I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here. |
but that requires the prefilter step to have access to all nodes, which it doesn't today... that seems like a big change |
Yes, this doesn't seem to align with the meaning of 'prefilter'. 😂 It looks like our only option is to revert #109877. |
Agree let's revert now and we can take more time to discuss and see if there's a better solution. |
I don't want to get in the way of mitigating this in the short term. Having said that, better approaches might include fixing up the local-path provisioner for these cases to provide a direct constraint - eg, rancher/local-path-provisioner#414 and its attached PR. |
I have created #125398 as the fix for the recent releases. Which reverted #109877 and added a unit test for this. In the long run, I think we need to go on with following steps:
Sounds reasonable? |
/sig storage |
/triage accepted |
Revert scheduler GetEligibleNodes optimization from issue kubernetes#125336
What happened?
Running on a system which has node names that look like FQDNs, but hostname labels which are unqualified.
The local path PV provisioner has (correctly) added nodeAffinity constraints to the PV that reference a node's
hostname
label.A replacement pod for a statefulset that has a bound PVC cannot be re-scheduled, because the scheduler interprets
PreFilterResult.NodeNames
as node names, but the code in volume_binding.go that runs the prefilter collects a set of kubeternetes.io/hostname label values.What did you expect to happen?
Pod rescheduling should not wedge. The volume-binding scheduler plugin should resolve match constraints to a set of nodes and return their node names in its PreFilterResult.
How can we reproduce it (as minimally and precisely as possible)?
Create a node with distinct name and hostname label [k8s documentation reiterates that this situation is possible]. Schedule a pod onto it with a local path PV bound. Observe the PV has a nodeAffinity constraint that contains the node's hostname label. Attempt to reschedule a pod to use this PV.
Precise behaviour may vary from 1.27 (which introduced this prefilter notion) forwards. On 1.27, the scheduler failes with a "nodeinfo not found". A workaround was backported into the prefilter loop of
schedulePod
but AFAICT the root cause was never identified. Later versions look to end up filtering out all nodes inschedulePod
- but the root cause is the same in both cases.Anything else we need to know?
No response
Kubernetes version
The nodes in question were older ubuntu EKS images, but that's largely irrelevant; the critical point is that nodes are registered by kubelet with a FQDN name but a short hostname. (AFAICT newer ubuntu EKS images will mask this behaviour by setting both of these to the same value, but the same erroneous assumption is baked into volume-binding still.)
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
but the PV record that it creates is (IMO) correct; the matchExpression attempts to identify a node by its hostname label.
The text was updated successfully, but these errors were encountered: