Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Using All allocation mode will schedule to nodes with zero devices #129310

Open
johnbelamaric opened this issue Dec 19, 2024 · 2 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@johnbelamaric
Copy link
Member

What happened?

I created a resource claim template to get "All" GPUs on a node:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: all-gpus
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com
        allocationMode: All

I then created a deployment that had a Pod that used that claim. The Pod was scheduled to a node. However, my DRA driver on that node was not running, so there were no resource slices for that node.

What did you expect to happen?

I expected the pod to not schedule, since there were no available devices meeting the request. "All" should mean "at least one".

How can we reproduce it (as minimally and precisely as possible)?

Create the resource claim template as shown and a deployment, with no DRA driver running. The pod will still schedule.

Anything else we need to know?

/wg device-management

Kubernetes version

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0-gke.1358000

Cloud provider

GKE

OS version

$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux rodete"
NAME="Debian GNU/Linux rodete"
VERSION_CODENAME=rodete
ID=debian
HOME_URL="https://go/glinux"
SUPPORT_URL="https://go/techstop"
BUG_REPORT_URL="https://go/techstop"
$ uname -a
Linux jbelamaric.c.googlers.com 6.10.11-1rodete2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1rodete2 (2024-10-16) x86_64 GNU/Linux

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@johnbelamaric johnbelamaric added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2024
@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Dec 19, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 19, 2024
@pohly
Copy link
Contributor

pohly commented Dec 20, 2024

In the mathematical sense, "all devices in an empty set" is the empty set. But I agree that the "all devices that match and at least one" is the better semantic of this feature. Let's treat it as a bug, then we can backport.

If someone really wants "all devices that match, none is okay, too" then in 1.33 they can use firstAvailableOf (first alternative is "all devices", second is "none").

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024
@pohly pohly moved this from 🆕 New to 🔖 Ready in SIG Node: Dynamic Resource Allocation Dec 20, 2024
@sftim
Copy link
Contributor

sftim commented Dec 20, 2024

I like the idea of explicitly documenting an all-or-nothing mode. Eventually we'll find someone who'd like it, I'm sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🔖 Ready
Development

No branches or pull requests

4 participants