Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

volume-binding scheduler prefilter assumes that a node's metadata.name == metadata.labels["kubernetes.io/hostname"] #125336

Open
jan-g opened this issue Jun 5, 2024 · 22 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jan-g
Copy link

jan-g commented Jun 5, 2024

What happened?

Running on a system which has node names that look like FQDNs, but hostname labels which are unqualified.

The local path PV provisioner has (correctly) added nodeAffinity constraints to the PV that reference a node's hostname label.

A replacement pod for a statefulset that has a bound PVC cannot be re-scheduled, because the scheduler interprets PreFilterResult.NodeNames as node names, but the code in volume_binding.go that runs the prefilter collects a set of kubeternetes.io/hostname label values.

What did you expect to happen?

Pod rescheduling should not wedge. The volume-binding scheduler plugin should resolve match constraints to a set of nodes and return their node names in its PreFilterResult.

How can we reproduce it (as minimally and precisely as possible)?

Create a node with distinct name and hostname label [k8s documentation reiterates that this situation is possible]. Schedule a pod onto it with a local path PV bound. Observe the PV has a nodeAffinity constraint that contains the node's hostname label. Attempt to reschedule a pod to use this PV.

Precise behaviour may vary from 1.27 (which introduced this prefilter notion) forwards. On 1.27, the scheduler failes with a "nodeinfo not found". A workaround was backported into the prefilter loop of schedulePod but AFAICT the root cause was never identified. Later versions look to end up filtering out all nodes in schedulePod - but the root cause is the same in both cases.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.11", GitCommit:"b9e2ad67ad146db566be5a6db140d47e52c8adb2", GitTreeState:"clean", BuildDate:"2024-02-14T10:40:40Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.13-eks-3af4770", GitCommit:"4873544ec1ec7d3713084677caa6cf51f3b1ca6f", GitTreeState:"clean", BuildDate:"2024-04-30T03:31:44Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"linux/amd64"}

The nodes in question were older ubuntu EKS images, but that's largely irrelevant; the critical point is that nodes are registered by kubelet with a FQDN name but a short hostname. (AFAICT newer ubuntu EKS images will mask this behaviour by setting both of these to the same value, but the same erroneous assumption is baked into volume-binding still.)

Cloud provider

EKS, 1.27 (at present)

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver:v1.8.0

but the PV record that it creates is (IMO) correct; the matchExpression attempts to identify a node by its hostname label.

@jan-g jan-g added the kind/bug Categorizes issue or PR as related to a bug. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2024
@jan-g
Copy link
Author

jan-g commented Jun 5, 2024

The behaviour for 1.27 is to bail out with a nodeinfo not found error, the result with the master branch looks less obvious.

@AxeZhan
Copy link
Member

AxeZhan commented Jun 5, 2024

/sig scheduling

/assign

🤦. I'll came up with a fix.

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 5, 2024
@alculquicondor
Copy link
Member

cc @gabesaba

@alculquicondor
Copy link
Member

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

@AxeZhan
Copy link
Member

AxeZhan commented Jun 6, 2024

I found a related(maybe) issue from a long time ago: #61410 (comment)

That is not a valid assumption. Node name and hostname are distinct and are definitively known to be different in some clusters.

Seems we allow distinct nodeName and hostname before. Not sure what the situation is now.

@jan-g
Copy link
Author

jan-g commented Jun 6, 2024

The situation is this: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels

The value of these labels is cloud provider specific and is not guaranteed to be reliable. For example, the value of kubernetes.io/hostname may be the same as the node name in some environments and a different value in other environments.

If this is going to be changed I think it should be done via a proper deprecation cycle; there are clusters at the moment that are broken by the bug outlined in this issue.

(I understand the desire here - originally k8s had a design constraint that clusters would have a maximum size of about 10k entities; under those circumstances, simple algorithms - ie, for loops over everything - work well. Some of the commentary about this prefilter mentioned testing under simulation with 15k nodes, 60k pods. I think that was one of the drivers to include this shortcut. As I understand it there's still not really a first-class notion of attribute indexing in k8s' data model; if there's a real need to move k8s from being suitable as a substrate for appliances to a large-scale general-purpose compute platform, then efficiently suporting attribute-based queries seems to be pretty critical, but I think that is beyond the scope of this issue.)

@alculquicondor
Copy link
Member

For comparison, in the node affinity plugin, we only PreFilter nodes based on the metadata.name field selector:

if r.Key == metav1.ObjectNameField && r.Operator == v1.NodeSelectorOpIn {

Maybe we should revert #109877?

Or somehow limit it to field selectors, as opposed to label selectors.

@liggitt
Copy link
Member

liggitt commented Jun 6, 2024

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

I'm not sure what is being asked.

Various cloud providers have a variety of methods of naming nodes (documented in kubernetes/website#8873, though that file appears to be gone now... maybe with the out-of-tree cloud provider move)

Are you asking if the way the kubelet sets the hostname label value should change? That will almost certainly break some existing users of the label on clouds where the current label value != the node name.

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

@alculquicondor
Copy link
Member

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

I was asking about this.

It sounds like we should be reverting #109877 then.

Note that the cherry-pick deadline is tomorrow. I think that will be the last 1.27 patch release.

@alculquicondor
Copy link
Member

FYI @msau42 and @Huang-Wei as reviewers of that PR.

@AxeZhan
Copy link
Member

AxeZhan commented Jun 7, 2024

It sounds like we should be reverting #109877 then.

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

@alculquicondor
Copy link
Member

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

How do you suggest we do it? If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

@alculquicondor
Copy link
Member

You also have to consider what's feasible to cherry-pick. But now we are late for this patch release.

@AxeZhan
Copy link
Member

AxeZhan commented Jun 7, 2024

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

@liggitt
Copy link
Member

liggitt commented Jun 7, 2024

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

but that requires the prefilter step to have access to all nodes, which it doesn't today... that seems like a big change

@AxeZhan
Copy link
Member

AxeZhan commented Jun 7, 2024

Yes, this doesn't seem to align with the meaning of 'prefilter'. 😂

It looks like our only option is to revert #109877.

@msau42
Copy link
Member

msau42 commented Jun 7, 2024

Agree let's revert now and we can take more time to discuss and see if there's a better solution.

@jan-g
Copy link
Author

jan-g commented Jun 7, 2024

I don't want to get in the way of mitigating this in the short term. Having said that, better approaches might include fixing up the local-path provisioner for these cases to provide a direct constraint - eg, rancher/local-path-provisioner#414 and its attached PR.

@AxeZhan
Copy link
Member

AxeZhan commented Jun 8, 2024

I have created #125398 as the fix for the recent releases. Which reverted #109877 and added a unit test for this.

In the long run, I think we need to go on with following steps:

  1. Change the way we handle pv's nodeAffinity, take node's name and affinity's matchFields into consideration.
    func CheckNodeAffinity(pv *v1.PersistentVolume, nodeLabels map[string]string) error {
    if pv.Spec.NodeAffinity == nil {
    return nil
    }
    if pv.Spec.NodeAffinity.Required != nil {
    node := &v1.Node{ObjectMeta: metav1.ObjectMeta{Labels: nodeLabels}}
    terms := pv.Spec.NodeAffinity.Required
    if matches, err := corev1.MatchNodeSelectorTerms(node, terms); err != nil {
    return err
    } else if !matches {
    return fmt.Errorf("no matching NodeSelectorTerms")
    }
    }
    return nil
    }
	if pv.Spec.NodeAffinity.Required != nil {
		node := &v1.Node{ObjectMeta: metav1.ObjectMeta{
			Name:   nodeName,
			Labels: nodeLabels,
		}}
		terms := pv.Spec.NodeAffinity.Required
		matches, err := corev1.MatchNodeSelectorTerms(node, terms); 
	}
  1. Update the documentation to recommend users and provisioners to use matchFields with metadata.name to define affinity instead of using matchExpressions with hostname.
  2. Finally, bring back scheduler volumebinding: leverage PreFilterResult for bound local PVs #109877, but extracting matchFields with metadata.name from pv's affinity instead of hostname.
    func GetLocalPersistentVolumeNodeNames(pv *v1.PersistentVolume) []string {
    if pv == nil || pv.Spec.NodeAffinity == nil || pv.Spec.NodeAffinity.Required == nil {
    return nil
    }
    var result sets.Set[string]
    for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
    var nodes sets.Set[string]
    for _, matchExpr := range term.MatchExpressions {
    if matchExpr.Key == v1.LabelHostname && matchExpr.Operator == v1.NodeSelectorOpIn {
    if nodes == nil {
    nodes = sets.New(matchExpr.Values...)
    } else {
    nodes = nodes.Intersection(sets.New(matchExpr.Values...))
    }
    }
    }
    result = result.Union(nodes)
    }
    return sets.List(result)
    }
	for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
		var nodes sets.Set[string]
		for _, matchExpr := range term.MatchFields {
			if matchExpr.Key == "metadata.name" && matchExpr.Operator == v1.NodeSelectorOpIn {
				if nodes == nil {
					nodes = sets.New(matchExpr.Values...)
				} else {
					nodes = nodes.Intersection(sets.New(matchExpr.Values...))
				}
			}
		}
		result = result.Union(nodes)
	}

Sounds reasonable?

@AxeZhan
Copy link
Member

AxeZhan commented Jun 8, 2024

/sig storage

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 8, 2024
@xing-yang
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024
domenicbozzuto added a commit to DataDog/kubernetes that referenced this issue Sep 6, 2024
Revert scheduler GetEligibleNodes optimization from issue kubernetes#125336
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants