volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

jan-g · 2024-06-05T10:40:23Z

What happened?

Running on a system which has node names that look like FQDNs, but hostname labels which are unqualified.

The local path PV provisioner has (correctly) added nodeAffinity constraints to the PV that reference a node's hostname label.

A replacement pod for a statefulset that has a bound PVC cannot be re-scheduled, because the scheduler interprets PreFilterResult.NodeNames as node names, but the code in volume_binding.go that runs the prefilter collects a set of kubeternetes.io/hostname label values.

What did you expect to happen?

Pod rescheduling should not wedge. The volume-binding scheduler plugin should resolve match constraints to a set of nodes and return their node names in its PreFilterResult.

How can we reproduce it (as minimally and precisely as possible)?

Create a node with distinct name and hostname label [k8s documentation reiterates that this situation is possible]. Schedule a pod onto it with a local path PV bound. Observe the PV has a nodeAffinity constraint that contains the node's hostname label. Attempt to reschedule a pod to use this PV.

Precise behaviour may vary from 1.27 (which introduced this prefilter notion) forwards. On 1.27, the scheduler failes with a "nodeinfo not found". A workaround was backported into the prefilter loop of schedulePod but AFAICT the root cause was never identified. Later versions look to end up filtering out all nodes in schedulePod - but the root cause is the same in both cases.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.11", GitCommit:"b9e2ad67ad146db566be5a6db140d47e52c8adb2", GitTreeState:"clean", BuildDate:"2024-02-14T10:40:40Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.13-eks-3af4770", GitCommit:"4873544ec1ec7d3713084677caa6cf51f3b1ca6f", GitTreeState:"clean", BuildDate:"2024-04-30T03:31:44Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}

The nodes in question were older ubuntu EKS images, but that's largely irrelevant; the critical point is that nodes are registered by kubelet with a FQDN name but a short hostname. (AFAICT newer ubuntu EKS images will mask this behaviour by setting both of these to the same value, but the same erroneous assumption is baked into volume-binding still.)

Cloud provider

EKS, 1.27 (at present)

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.8.0

but the PV record that it creates is (IMO) correct; the matchExpression attempts to identify a node by its hostname label.

The text was updated successfully, but these errors were encountered:

jan-g · 2024-06-05T10:48:08Z

Call path for prefilter:

volume-binding prefilter https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L176
calls GetEligibleNodes here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L199
calls GetLocalPersistentVolumeNodeNames here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/binder.go#L392
which does not do what it claims here https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/util.go#L573 - returns a set of hostname label contents, not node names.

Scheulder calls schedulePod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L390
calls findNodesThatFitPod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L442
which treats the PreFilterResult.NodeNames as node names (not labels) here: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L486-L495
nodes is empty beyond that point.

jan-g · 2024-06-05T10:48:38Z

The behaviour for 1.27 is to bail out with a nodeinfo not found error, the result with the master branch looks less obvious.

AxeZhan · 2024-06-05T12:01:10Z

/sig scheduling

/assign

🤦. I'll came up with a fix.

alculquicondor · 2024-06-05T20:08:00Z

cc @gabesaba

alculquicondor · 2024-06-05T20:09:57Z

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

AxeZhan · 2024-06-06T02:32:38Z

I found a related(maybe) issue from a long time ago: #61410 (comment)

That is not a valid assumption. Node name and hostname are distinct and are definitively known to be different in some clusters.

Seems we allow distinct nodeName and hostname before. Not sure what the situation is now.

jan-g · 2024-06-06T08:17:35Z

The situation is this: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels

The value of these labels is cloud provider specific and is not guaranteed to be reliable. For example, the value of kubernetes.io/hostname may be the same as the node name in some environments and a different value in other environments.

If this is going to be changed I think it should be done via a proper deprecation cycle; there are clusters at the moment that are broken by the bug outlined in this issue.

(I understand the desire here - originally k8s had a design constraint that clusters would have a maximum size of about 10k entities; under those circumstances, simple algorithms - ie, for loops over everything - work well. Some of the commentary about this prefilter mentioned testing under simulation with 15k nodes, 60k pods. I think that was one of the drivers to include this shortcut. As I understand it there's still not really a first-class notion of attribute indexing in k8s' data model; if there's a real need to move k8s from being suitable as a substrate for appliances to a large-scale general-purpose compute platform, then efficiently suporting attribute-based queries seems to be pretty critical, but I think that is beyond the scope of this issue.)

alculquicondor · 2024-06-06T16:06:29Z

For comparison, in the node affinity plugin, we only PreFilter nodes based on the metadata.name field selector:

kubernetes/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go

Line 146 in c8a51aa

if r.Key == metav1.ObjectNameField && r.Operator == v1.NodeSelectorOpIn {

Maybe we should revert #109877?

Or somehow limit it to field selectors, as opposed to label selectors.

liggitt · 2024-06-06T18:01:07Z

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

I'm not sure what is being asked.

Various cloud providers have a variety of methods of naming nodes (documented in kubernetes/website#8873, though that file appears to be gone now... maybe with the out-of-tree cloud provider move)

Are you asking if the way the kubelet sets the hostname label value should change? That will almost certainly break some existing users of the label on clouds where the current label value != the node name.

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

alculquicondor · 2024-06-06T18:51:42Z

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

I was asking about this.

It sounds like we should be reverting #109877 then.

Note that the cherry-pick deadline is tomorrow. I think that will be the last 1.27 patch release.

alculquicondor · 2024-06-06T18:52:16Z

FYI @msau42 and @Huang-Wei as reviewers of that PR.

AxeZhan · 2024-06-07T03:34:09Z

It sounds like we should be reverting #109877 then.

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

alculquicondor · 2024-06-07T15:36:22Z

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

How do you suggest we do it? If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

alculquicondor · 2024-06-07T15:39:01Z

You also have to consider what's feasible to cherry-pick. But now we are late for this patch release.

AxeZhan · 2024-06-07T15:42:25Z

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

liggitt · 2024-06-07T15:46:15Z

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

but that requires the prefilter step to have access to all nodes, which it doesn't today... that seems like a big change

AxeZhan · 2024-06-07T15:51:32Z

Yes, this doesn't seem to align with the meaning of 'prefilter'. 😂

It looks like our only option is to revert #109877.

msau42 · 2024-06-07T16:16:27Z

Agree let's revert now and we can take more time to discuss and see if there's a better solution.

jan-g · 2024-06-07T16:36:26Z

I don't want to get in the way of mitigating this in the short term. Having said that, better approaches might include fixing up the local-path provisioner for these cases to provide a direct constraint - eg, rancher/local-path-provisioner#414 and its attached PR.

AxeZhan · 2024-06-08T14:59:00Z

I have created #125398 as the fix for the recent releases. Which reverted #109877 and added a unit test for this.

In the long run, I think we need to go on with following steps:

Change the way we handle pv's nodeAffinity, take node's name and affinity's matchFields into consideration.

kubernetes/staging/src/k8s.io/component-helpers/storage/volume/helpers.go

Lines 68 to 84 in 6b73ccc

    
           func CheckNodeAffinity(pv *v1.PersistentVolume, nodeLabels map[string]string) error { 
        
           	if pv.Spec.NodeAffinity == nil { 
        
           		return nil 
        
           	} 
        
           	if pv.Spec.NodeAffinity.Required != nil { 
        
           		node := &v1.Node{ObjectMeta: metav1.ObjectMeta{Labels: nodeLabels}} 
        
           		terms := pv.Spec.NodeAffinity.Required 
        
           		if matches, err := corev1.MatchNodeSelectorTerms(node, terms); err != nil { 
        
           			return err 
        
           		} else if !matches { 
        
           			return fmt.Errorf("no matching NodeSelectorTerms") 
        
           		} 
        
           	} 
        
           	return nil 
        
           }

	if pv.Spec.NodeAffinity.Required != nil {
		node := &v1.Node{ObjectMeta: metav1.ObjectMeta{
			Name:   nodeName,
			Labels: nodeLabels,
		}}
		terms := pv.Spec.NodeAffinity.Required
		matches, err := corev1.MatchNodeSelectorTerms(node, terms); 
	}

Update the documentation to recommend users and provisioners to use matchFields with metadata.name to define affinity instead of using matchExpressions with hostname.

Finally, bring back scheduler volumebinding: leverage PreFilterResult for bound local PVs #109877, but extracting matchFields with metadata.name from pv's affinity instead of hostname.

kubernetes/pkg/volume/util/util.go

Lines 573 to 594 in 6b73ccc

    
           func GetLocalPersistentVolumeNodeNames(pv *v1.PersistentVolume) []string { 
        
           	if pv == nil || pv.Spec.NodeAffinity == nil || pv.Spec.NodeAffinity.Required == nil { 
        
           		return nil 
        
           	} 
        
           	var result sets.Set[string] 
        
           	for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms { 
        
           		var nodes sets.Set[string] 
        
           		for _, matchExpr := range term.MatchExpressions { 
        
           			if matchExpr.Key == v1.LabelHostname && matchExpr.Operator == v1.NodeSelectorOpIn { 
        
           				if nodes == nil { 
        
           					nodes = sets.New(matchExpr.Values...) 
        
           				} else { 
        
           					nodes = nodes.Intersection(sets.New(matchExpr.Values...)) 
        
           				} 
        
           			} 
        
           		} 
        
           		result = result.Union(nodes) 
        
           	} 
        
           	return sets.List(result) 
        
           }

	for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
		var nodes sets.Set[string]
		for _, matchExpr := range term.MatchFields {
			if matchExpr.Key == "metadata.name" && matchExpr.Operator == v1.NodeSelectorOpIn {
				if nodes == nil {
					nodes = sets.New(matchExpr.Values...)
				} else {
					nodes = nodes.Intersection(sets.New(matchExpr.Values...))
				}
			}
		}
		result = result.Union(nodes)
	}

Sounds reasonable?

AxeZhan · 2024-06-08T15:10:33Z

/sig storage

xing-yang · 2024-06-26T18:03:04Z

/triage accepted

Revert scheduler GetEligibleNodes optimization from issue kubernetes#125336

jan-g added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2024

This was referenced Jun 5, 2024

[scheduler] absent key in NodeToStatusMap implies UnschedulableAndUnresolvable #125197

Merged

PV nodeAffinity matchExpressions problem with array items and in operator since 1.27.0 #123465

Closed

k8s-ci-robot assigned AxeZhan Jun 5, 2024

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 5, 2024

jan-g mentioned this issue Jun 7, 2024

provisioner assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] rancher/local-path-provisioner#413

Closed

AxeZhan mentioned this issue Jun 8, 2024

[scheduler] When the hostname and nodename of a node do not match, ensure that pods carrying PVs with nodeAffinity are scheduled correctly. #125398

Merged

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 8, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024

domenicbozzuto added a commit to DataDog/kubernetes that referenced this issue Sep 6, 2024

Merge pull request #75 from DataDog/dom.bozzuto/scheduler-fix

40c5b2b

Revert scheduler GetEligibleNodes optimization from issue kubernetes#125336

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

jan-g commented Jun 5, 2024

jan-g commented Jun 5, 2024

jan-g commented Jun 5, 2024

AxeZhan commented Jun 5, 2024

alculquicondor commented Jun 5, 2024

alculquicondor commented Jun 5, 2024

AxeZhan commented Jun 6, 2024

jan-g commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

liggitt commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

AxeZhan commented Jun 7, 2024

alculquicondor commented Jun 7, 2024

alculquicondor commented Jun 7, 2024

AxeZhan commented Jun 7, 2024

liggitt commented Jun 7, 2024

AxeZhan commented Jun 7, 2024

msau42 commented Jun 7, 2024

jan-g commented Jun 7, 2024

AxeZhan commented Jun 8, 2024

AxeZhan commented Jun 8, 2024

xing-yang commented Jun 26, 2024

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

Comments

jan-g commented Jun 5, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

jan-g commented Jun 5, 2024

jan-g commented Jun 5, 2024

AxeZhan commented Jun 5, 2024

alculquicondor commented Jun 5, 2024

alculquicondor commented Jun 5, 2024

AxeZhan commented Jun 6, 2024

jan-g commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

liggitt commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

alculquicondor commented Jun 6, 2024

AxeZhan commented Jun 7, 2024

alculquicondor commented Jun 7, 2024

alculquicondor commented Jun 7, 2024

AxeZhan commented Jun 7, 2024

liggitt commented Jun 7, 2024

AxeZhan commented Jun 7, 2024

msau42 commented Jun 7, 2024

jan-g commented Jun 7, 2024

AxeZhan commented Jun 8, 2024

AxeZhan commented Jun 8, 2024

xing-yang commented Jun 26, 2024