Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

Open
googs1025 opened this issue Dec 22, 2024 · 6 comments
Open

bug(dra): when deleting resourceclaimtemplate, pod can't running again #129362

googs1025 opened this issue Dec 22, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@googs1025
Copy link
Contributor

googs1025 commented Dec 22, 2024

What happened?

root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl apply -f gpu-test2-dep.yaml
namespace/gpu-test2 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
deployment.apps/gpu-deployment created
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get all -ngpu-test2
NAME                                  READY   STATUS    RESTARTS   AGE
pod/gpu-deployment-6965899554-zmq5j   2/2     Running   0          16s

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-deployment   1/1     1            1           16s

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-deployment-6965899554   1         1         1       16s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get resourceclaimtemplates -ngpu-test2
NAME         AGE
single-gpu   47s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart#
  1. delete resourceclaimtemplates
  2. delete pod
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get resourceclaimtemplates -ngpu-test2
NAME         AGE
single-gpu   2m33s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl delete resourceclaimtemplates single-gpu -ngpu-test2
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get pods -ngpu-test2
NAME                              READY   STATUS    RESTARTS   AGE
gpu-deployment-6965899554-zmq5j   2/2     Running   0          2m50s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl delete pods gpu-deployment-6965899554-zmq5j -ngpu-test2
pod "gpu-deployment-6965899554-zmq5j" deleted
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl get pods -ngpu-test2
NAME                              READY   STATUS    RESTARTS   AGE
gpu-deployment-6965899554-wl2dj   0/2     Pending   0          3s
root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart#

pod is always Pending

root@VM-0-6-ubuntu:/home/ubuntu/k8s-dra-driver/demo/specs/quickstart# kubectl describe pods gpu-deployment-6965899554-wl2dj -ngpu-test2
Name:             gpu-deployment-6965899554-wl2dj
Namespace:        gpu-test2
Priority:         0
Service Account:  default
Node:             <none>
Labels:           app=gpu-app
                  pod-template-hash=6965899554
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Controlled By:    ReplicaSet/gpu-deployment-6965899554
Containers:
  ctr0:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      bash
      -c
    Args:
      nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m26j7 (ro)
  ctr1:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      bash
      -c
    Args:
      nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m26j7 (ro)
Volumes:
  kube-api-access-m26j7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.present=true
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason                       Age                 From            Message
  ----     ------                       ----                ----            -------
  Warning  FailedResourceClaimCreation  11s (x13 over 31s)  resource_claim  PodResourceClaim shared-gpu: resource claim template "single-gpu": resourceclaimtemplate.resource.k8s.io "single-gpu" not found

What did you expect to happen?

When resourceclaimtemplate is deleted, the pod can still run successfully after restart. Or resourceclaimtemplate cannot be deleted when it is in use.

How can we reproduce it (as minimally and precisely as possible)?

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test2

---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test2
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com

---
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: gpu-test2
  name: gpu-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      containers:
      - name: ctr0
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: shared-gpu
      - name: ctr1
        image: ubuntu:22.04
        command: ["bash", "-c"]
        args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
        resources:
          claims:
          - name: shared-gpu
      resourceClaims:
      - name: shared-gpu
        resourceClaimTemplateName: single-gpu
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu.present: "true"
      restartPolicy: Always

use this sample yaml, and delete pod.

Anything else we need to know?

I'm not sure if this is by design. But according to common sense, there is a possibility that resourceclaimtemplate was accidentally deleted. If the training task is restarted, the above problem may occur. 🤔

Kubernetes version

root@VM-0-6-ubuntu:/home/ubuntu# kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@googs1025 googs1025 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 22, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 22, 2024
@googs1025
Copy link
Contributor Author

/wg device-management

@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2024
@googs1025
Copy link
Contributor Author

/cc @pohly @klueska

@googs1025
Copy link
Contributor Author

Do we need to add a field in resourceclaimtemplate to manage which pods are using it? Or add a Finalizer

@kannon92
Copy link
Contributor

Can you reproduce this with the dra example driver?

@pohly
Copy link
Contributor

pohly commented Dec 22, 2024

Was the ResourceClaim already created for the pod?

This events indicates otherwise:

  Warning  FailedResourceClaimCreation  11s (x13 over 31s)  resource_claim  PodResourceClaim shared-gpu: resource claim template "single-gpu": resourceclaimtemplate.resource.k8s.io "single-gpu" not found

Without the ResourceClaim, scheduling the pod cannot proceed, and without the ResourceClaimTemplate, the ResourceClaim cannot be created.

Once the ResourceClaim exists, it should be safe to remove the ResourceClaimTemplate.

There is no concept of "ResourceClaimTemplate is in use". It's used only for very brief moments in time when creating a ResourceClaim for a pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Development

No branches or pull requests

4 participants