bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

googs1025 · 2024-12-22T14:12:42Z

What happened?

root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl apply -f gpu-test2-dep.yaml
namespace/gpu-test2 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
deployment.apps/gpu-deployment created
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get all -ngpu-test2
NAME                                  READY   STATUS    RESTARTS   AGE
pod/gpu-deployment-6965899554-zmq5j   2/2     Running   0          16s

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-deployment   1/1     1            1           16s

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-deployment-6965899554   1         1         1       16s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get resourceclaimtemplates -ngpu-test2
NAME         AGE
single-gpu   47s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart#

delete resourceclaimtemplates
delete pod

root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get resourceclaimtemplates -ngpu-test2
NAME         AGE
single-gpu   2m33s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl delete resourceclaimtemplates single-gpu -ngpu-test2
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get pods -ngpu-test2
NAME                              READY   STATUS    RESTARTS   AGE
gpu-deployment-6965899554-zmq5j   2/2     Running   0          2m50s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl delete pods gpu-deployment-6965899554-zmq5j -ngpu-test2
pod "gpu-deployment-6965899554-zmq5j" deleted
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get pods -ngpu-test2
NAME                              READY   STATUS    RESTARTS   AGE
gpu-deployment-6965899554-wl2dj   0/2     Pending   0          3s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart#

pod is always Pending

root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl describe pods gpu-deployment-6965899554-wl2dj -ngpu-test2
Name:             gpu-deployment-6965899554-wl2dj
Namespace:        gpu-test2
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=gpu-app
                  pod-template-hash=6965899554
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/gpu-deployment-6965899554
Containers:
  ctr0:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      bash
      -c
    Args:
      nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m26j7 (ro)
  ctr1:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      bash
      -c
    Args:
      nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m26j7 (ro)
Volumes:
  kube-api-access-m26j7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                       Age                 From            Message
  ----     ------                       ----                ----            -------
  Warning  FailedResourceClaimCreation  11s (x13 over 31s)  resource_claim  PodResourceClaim shared-gpu: resource claim template "single-gpu": resourceclaimtemplate.resource.k8s.io "single-gpu" not found

What did you expect to happen?

When resourceclaimtemplate is deleted, the pod can still run successfully after restart. Or resourceclaimtemplate cannot be deleted when it is in use.

How can we reproduce it (as minimally and precisely as possible)?

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test2

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test2
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: gpu-test2
  name: gpu-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      containers:
      - name: ctr0
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: shared-gpu
      - name: ctr1
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: shared-gpu
      resourceClaims:
      - name: shared-gpu
        resourceClaimTemplateName: single-gpu
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu.present: "true"
      restartPolicy: Always

use this sample yaml, and delete pod.

Anything else we need to know?

I'm not sure if this is by design. But according to common sense, there is a possibility that resourceclaimtemplate was accidentally deleted. If the training task is restarted, the above problem may occur. 🤔

Kubernetes version

root@VM-0-6-ubuntu:/home/ubuntu# kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot · 2024-12-22T14:12:51Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

googs1025 · 2024-12-22T14:13:46Z

/wg device-management

googs1025 · 2024-12-22T14:14:28Z

/cc @pohly @klueska

googs1025 · 2024-12-22T14:17:04Z

Do we need to add a field in resourceclaimtemplate to manage which pods are using it? Or add a Finalizer

kannon92 · 2024-12-22T16:12:46Z

Can you reproduce this with the dra example driver?

pohly · 2024-12-22T19:02:37Z

Was the ResourceClaim already created for the pod?

This events indicates otherwise:

  Warning  FailedResourceClaimCreation  11s (x13 over 31s)  resource_claim  PodResourceClaim shared-gpu: resource claim template "single-gpu": resourceclaimtemplate.resource.k8s.io "single-gpu" not found

Without the ResourceClaim, scheduling the pod cannot proceed, and without the ResourceClaimTemplate, the ResourceClaim cannot be created.

Once the ResourceClaim exists, it should be safe to remove the ResourceClaimTemplate.

There is no concept of "ResourceClaimTemplate is in use". It's used only for very brief moments in time when creating a ResourceClaim for a pod.

googs1025 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 22, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 22, 2024

k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2024

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation Dec 22, 2024

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

googs1025 commented Dec 22, 2024 •

edited

Loading

k8s-ci-robot commented Dec 22, 2024

googs1025 commented Dec 22, 2024

googs1025 commented Dec 22, 2024

googs1025 commented Dec 22, 2024

kannon92 commented Dec 22, 2024

pohly commented Dec 22, 2024

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

Comments

googs1025 commented Dec 22, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Dec 22, 2024

googs1025 commented Dec 22, 2024

googs1025 commented Dec 22, 2024

googs1025 commented Dec 22, 2024

kannon92 commented Dec 22, 2024

pohly commented Dec 22, 2024

googs1025 commented Dec 22, 2024 •

edited

Loading