Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PV nodeAffinity matchExpressions problem with array items and in operator since 1.27.0 #123465

Closed
dsgnr opened this issue Feb 23, 2024 · 24 comments · Fixed by #124567
Closed

PV nodeAffinity matchExpressions problem with array items and in operator since 1.27.0 #123465

dsgnr opened this issue Feb 23, 2024 · 24 comments · Fixed by #124567
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@dsgnr
Copy link

dsgnr commented Feb 23, 2024

What happened?

Since v1.27.0, the nodeAffinity for PersistentVolumes appear to have added extra validation on whether the values exist, despite using the in operator. This works as expected on 1.26.14, and no longer works on 1.27.0 onwards (tested up to 1.29.2).

I am unsure whether this is an intentional breaking change in 1.27.0 or something wrong in our manifests and we've just so happened to get away with it for a few years. The only thing that stands out to me in the changelog for 1.27.0 is some changes to the schedulers Filter plugin but this may well be unrelated.

What did you expect to happen?

When using the in operator, I'd expect the expression to only apply to nodes where the match is truey.

How can we reproduce it (as minimally and precisely as possible)?

Note, using Kind for easy reproduction

v1.26.14 - working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
- role: worker
  image: kindest/node:v1.26.14
EOF

$ kind create cluster --name 12614-test --config kind_conf
$ k get no
NAME                       STATUS   ROLES           AGE   VERSION
12614-test-control-plane   Ready    control-plane   8h    v1.26.14
12614-test-worker          Ready    <none>          8h    v1.26.14
12614-test-worker2         Ready    <none>          8h    v1.26.14
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 12614-test-worker nginx=nginx
cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 12614-test-worker
            - 12614-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Since 12614-test-worker exists in the hostname values of this cluster, the pod schedules as expected. 12614-prod-worker node does not exist in this environment, and so is ignored.

$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   1/1     Running   0          6m2s

v1.27.0 - not working

creating a kind cluster

cat <<EOF > kind_conf
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
- role: worker
  image: kindest/node:v1.27.0
EOF

$ kind create cluster --name 1270-test --config kind_conf
# labelling a node to ensure pod is scheduled on specific worker
$ k label node 1270-test-worker nginx=nginx
cat <<EOF > manifest.yaml
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nginx
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  serviceName: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nginx
                operator: Exists
      containers:
      - name: nginx
        image: nginx:latest
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: default-nginx-data
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: default-nginx-data
    spec:
      accessModes:
      - ReadWriteMany
      resources:
        requests:
          storage: 5Gi
      storageClassName: my_sc
      volumeMode: Filesystem
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: default-nginx-data-nginx-0
  namespace: default
  labels:
    app: default-nginx-data-nginx-0
spec:
  storageClassName: my_sc
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /srv/k8s/default/nginx-0
  claimRef:
    namespace: default
    name: default-nginx-data-nginx-0
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
            - 1270-test-worker
            - 1270-prod-worker
EOF
# Apply the manifest
$ k apply -f manifest.yaml

Same expectations as 1.26.14, but in this case, the pod stays stuck in pending state due to the following;

$ k get no
NAME                      STATUS   ROLES           AGE     VERSION
1270-test-control-plane   Ready    control-plane   5m1s    v1.27.0
1270-test-worker          Ready    <none>          4m34s   v1.27.0
1270-test-worker2         Ready    <none>          4m39s   v1.27.0
$ k get po
NAME      READY   STATUS    RESTARTS   AGE
nginx-0   0/1     Pending   0          5m56s
$ kubectl events --for pod/nginx-0
LAST SEEN   TYPE      REASON             OBJECT        MESSAGE
3m6s        Warning   FailedScheduling   Pod/nginx-0   nodeinfo not found for node name "1270-prod-worker"

I have tested this all the way up to 1.29.2.

Anything else we need to know?

No response

Kubernetes version

v1.27.0 through to v1.29.2

Cloud provider

OS version

[root@stage1 ~]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@stage1 ~]# uname -a
Linux stage1.domain.com 3.10.0-1160.108.1.el7.x86_64 #1 SMP Thu Jan 25 16:17:31 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@dsgnr dsgnr added the kind/bug Categorizes issue or PR as related to a bug. label Feb 23, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 23, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vaibhav2107
Copy link
Member

/sig storage

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 23, 2024
@msarfaty
Copy link

msarfaty commented Feb 23, 2024

In a similar vein, we have noticed the same issue in K8s v1.27 when the kubernetes.io/hostname label doesn't match the node name. We see nodeinfo not found for <hostname>, despite (I think) the nodeinfo util supposed to be called using a node name.

@PrasanaaV
Copy link

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Feb 26, 2024
@chengjoey
Copy link
Contributor

chengjoey commented Apr 18, 2024

This is because the non-existent node 1270-prod-worker is obtained from the cache, which has been fixed by #119779

before 119779, get node will return error:

nodes := allNodes
if !preRes.AllNodes() {
nodes = make([]*framework.NodeInfo, 0, len(preRes.NodeNames))
for n := range preRes.NodeNames {
nInfo, err := sched.nodeInfoSnapshot.NodeInfos().Get(n)
if err != nil {
return nil, diagnosis, err
}
nodes = append(nodes, nInfo)
}
}

after 119779 it's fixed

nodes := allNodes
if !preRes.AllNodes() {
nodes = make([]*framework.NodeInfo, 0, len(preRes.NodeNames))
for _, n := range allNodes {
if !preRes.NodeNames.Has(n.Node().Name) {
// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
continue
}
nodes = append(nodes, n)
}
}

since version 1.30, it can run normally

@sanposhiho PTAL, need to cherry-pick for the previous version?

@sanposhiho
Copy link
Member

sanposhiho commented Apr 20, 2024

Looks like my PR #119779 accidentally fixed this issue. 😅

380c7f2 is the trigger of this issue, introduced in v1.27; volumebinding plugin just returns node names in PreFilterResult based on nodeAffinity in PV, and thus PreFilterResult could contain node names actually doesn't exist.


@kubernetes/sig-scheduling-leads

One Line Summary: If nodeaffinity in PV has non-existing node name in In with kubernetes.io/hostname, the scheduling always fails.
And, this bug is already coincidentally fixed by #119779.

Questions:

@alculquicondor
Copy link
Member

If it's a regression, we should consider cherry-pick.

🤔
#119779 looks big, but the actual code changes are about 10 lines (the rest are comments and tests), so I think it would be ok to cherry-pick.

Can you explain why your PR fixes the problem?

@sanposhiho
Copy link
Member

Can you explain why your PR fixes the problem?

Before my PR, we take NodeInfo of nodes in preFilterResult via sched.nodeInfoSnapshot.NodeInfos().Get(n) (ref). Consequently, we could get not found errors from it when preFilterResult has non-existing node name(s).

And, after my PR, we are not using sched.nodeInfoSnapshot.NodeInfos().Get(n) anymore.

@Huang-Wei
Copy link
Member

Huang-Wei commented Apr 25, 2024

The pitfall is that when PreFilterResult was introduced, scheduler directly fetch the nodeNames returned by PreFilter() in its node snapshot, so technically speaking, when a plugin (in-tree or out-of-tree) return non-existent/illegal nodes, the pod scheduling flow will abort immediately.

And the PR #109877 happened to stomp on this pitfall.

@alculquicondor
Copy link
Member

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

Since this is already released code, I would suggest cherry-picking as-is.

Also, worth testing E2E whether the issue reproduces in v1.30.

@dsgnr do you think you can give it a try?

@Huang-Wei
Copy link
Member

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:

		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)

b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei
Copy link
Member

Also, worth testing E2E whether the issue reproduces in v1.30.

#124539 can guard it.

@chengjoey
Copy link
Contributor

chengjoey commented Apr 26, 2024

@Huang-Wei do you mean that you would like to see a smaller version of #119779 for cherry-pick? Note that it is only about 10 lines of actual change. The rest is comments and tests.

yup, i'm in favor of picking the minimum fix, along with a minimum test #124539:

		for _, n := range allNodes {
			if !preRes.NodeNames.Has(n.Node().Name) {
				// We consider Nodes that are filtered out by PreFilterResult as rejected via UnschedulableAndUnresolvable.
				// We have to record them in NodeToStatusMap so that they won't be considered as candidates in the preemption.
				diagnosis.NodeToStatusMap[n.Node().Name] = framework.NewStatus(framework.UnschedulableAndUnresolvable, "node is filtered out by the prefilter result")
				continue
			}
			nodes = append(nodes, n)

b/c in addition to the "fix" part, #119779 actually introduced a slight behavior change. @sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei @alculquicondor @sanposhiho Could you let me try it?,picking the minimum fix and add e2e test

@chengjoey
Copy link
Contributor

A simplified version of the sample is used for testing,manifest.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /tmp/data
  claimRef:
    namespace: default
    name: data
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - 1280-test-worker
            - 1280-prod-worker

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data
  namespace: default
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  volumeName: data
---
apiVersion: v1
kind: Pod
metadata:
  name: pause
  labels:
    app: pause
spec:
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.6
    volumeMounts:
    - mountPath: /tmp
      name: data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: data

also get this failedScheduling error, kubectl describe pod pause:
image

@sanposhiho
Copy link
Member

@sanposhiho do you think the above ^^ minimum fix + #124539 makes sense?

@Huang-Wei That small fix should work.
I don't have a strong opinion on whether we should have a minimum fix vs cherry-pick as it is.
Defer to you both @alculquicondor @Huang-Wei on that point.

@alculquicondor
Copy link
Member

Let's move ahead with the small fix, since we all agree on it.

@sanposhiho
Copy link
Member

/reopen

Cherry-pick PR(s) aren't done

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Reopened this issue.

In response to this:

/reopen

Cherry-pick PR(s) aren't done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Member

FYI on a related crash #124930

@0xdiba
Copy link

0xdiba commented Jun 4, 2024

Are there any news on the backporting the solution to older versions (v1.27+)?

@alculquicondor
Copy link
Member

This is already backported and released. But the fix for the crash reported in #124930 was not released yet.

Let's close this issue for now, as it is completed.

@alculquicondor
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jan-g
Copy link

jan-g commented Jun 5, 2024

The issue here stems from the observation by Mike Sarfaty, above that volume-binding's prefilter assumes that it can equate the matchConstraint hostname labels with node names. This doesn't have to be the case. See #125336 here. Prefiltering will always fail in this situation. AFAICT this issue persists with the current head of the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet