Skip to content

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"]Β #125336

Open
@jan-g

Description

What happened?

Running on a system which has node names that look like FQDNs, but hostname labels which are unqualified.

The local path PV provisioner has (correctly) added nodeAffinity constraints to the PV that reference a node's hostname label.

A replacement pod for a statefulset that has a bound PVC cannot be re-scheduled, because the scheduler interprets PreFilterResult.NodeNames as node names, but the code in volume_binding.go that runs the prefilter collects a set of kubeternetes.io/hostname label values.

What did you expect to happen?

Pod rescheduling should not wedge. The volume-binding scheduler plugin should resolve match constraints to a set of nodes and return their node names in its PreFilterResult.

How can we reproduce it (as minimally and precisely as possible)?

Create a node with distinct name and hostname label [k8s documentation reiterates that this situation is possible]. Schedule a pod onto it with a local path PV bound. Observe the PV has a nodeAffinity constraint that contains the node's hostname label. Attempt to reschedule a pod to use this PV.

Precise behaviour may vary from 1.27 (which introduced this prefilter notion) forwards. On 1.27, the scheduler failes with a "nodeinfo not found". A workaround was backported into the prefilter loop of schedulePod but AFAICT the root cause was never identified. Later versions look to end up filtering out all nodes in schedulePod - but the root cause is the same in both cases.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.11", GitCommit:"b9e2ad67ad146db566be5a6db140d47e52c8adb2", GitTreeState:"clean", BuildDate:"2024-02-14T10:40:40Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.13-eks-3af4770", GitCommit:"4873544ec1ec7d3713084677caa6cf51f3b1ca6f", GitTreeState:"clean", BuildDate:"2024-04-30T03:31:44Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}

The nodes in question were older ubuntu EKS images, but that's largely irrelevant; the critical point is that nodes are registered by kubelet with a FQDN name but a short hostname. (AFAICT newer ubuntu EKS images will mask this behaviour by setting both of these to the same value, but the same erroneous assumption is baked into volume-binding still.)

Cloud provider

EKS, 1.27 (at present)

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.8.0

but the PV record that it creates is (IMO) correct; the matchExpression attempts to identify a node by its hostname label.

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.sig/storageCategorizes an issue or PR as relevant to SIG Storage.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions