volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"]Β #125336
Description
What happened?
Running on a system which has node names that look like FQDNs, but hostname labels which are unqualified.
The local path PV provisioner has (correctly) added nodeAffinity constraints to the PV that reference a node's hostname
label.
A replacement pod for a statefulset that has a bound PVC cannot be re-scheduled, because the scheduler interprets PreFilterResult.NodeNames
as node names, but the code in volume_binding.go that runs the prefilter collects a set of kubeternetes.io/hostname label values.
What did you expect to happen?
Pod rescheduling should not wedge. The volume-binding scheduler plugin should resolve match constraints to a set of nodes and return their node names in its PreFilterResult.
How can we reproduce it (as minimally and precisely as possible)?
Create a node with distinct name and hostname label [k8s documentation reiterates that this situation is possible]. Schedule a pod onto it with a local path PV bound. Observe the PV has a nodeAffinity constraint that contains the node's hostname label. Attempt to reschedule a pod to use this PV.
Precise behaviour may vary from 1.27 (which introduced this prefilter notion) forwards. On 1.27, the scheduler failes with a "nodeinfo not found". A workaround was backported into the prefilter loop of schedulePod
but AFAICT the root cause was never identified. Later versions look to end up filtering out all nodes in schedulePod
- but the root cause is the same in both cases.
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.11", GitCommit:"b9e2ad67ad146db566be5a6db140d47e52c8adb2", GitTreeState:"clean", BuildDate:"2024-02-14T10:40:40Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.13-eks-3af4770", GitCommit:"4873544ec1ec7d3713084677caa6cf51f3b1ca6f", GitTreeState:"clean", BuildDate:"2024-04-30T03:31:44Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}
The nodes in question were older ubuntu EKS images, but that's largely irrelevant; the critical point is that nodes are registered by kubelet with a FQDN name but a short hostname. (AFAICT newer ubuntu EKS images will mask this behaviour by setting both of these to the same value, but the same erroneous assumption is baked into volume-binding still.)
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
but the PV record that it creates is (IMO) correct; the matchExpression attempts to identify a node by its hostname label.