-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA GPU support #24836
NVIDIA GPU support #24836
Conversation
I have already submitted the scheduler changes for NVIDIA GPU at 14/Apr( #21446 ), the scheduler changes you did like that, but I treat the NVIDIA GPU as Digit = Format("Digit"). |
14794b5
to
cb92710
Compare
@davidopp, I assigned this to you to review or delegate scheduler predicate changes. |
/cc @kubernetes/sig-node |
Rebase, squashed and changed int->int32, per @smarterclayton's recent changes |
@@ -1817,6 +1817,7 @@ const ( | |||
// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024) | |||
ResourceStorage ResourceName = "storage" | |||
// Number of Pods that may be running on this Node: see ResourcePods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comments seema to be unrelated to the field below. Is github rendering it incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I think this is from the rebase. Fixing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a placeholder for the defunct maxpods
. It misled me into thinking I had already added a description for ResourceNvidiaGPU, so I just moved it to the end.
Kubelet changes LGTM |
@@ -231,6 +232,7 @@ func (s *KubeletServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.BoolVar(&s.BabysitDaemons, "babysit-daemons", s.BabysitDaemons, "If true, the node has babysitter process monitoring docker and kubelet.") | |||
fs.MarkDeprecated("babysit-daemons", "Will be removed in a future version.") | |||
fs.Int32Var(&s.MaxPods, "max-pods", s.MaxPods, "Number of Pods that can run on this Kubelet.") | |||
fs.Int32Var(&s.NvidiaGPUs, "experimental-nvidia-gpus", s.NvidiaGPUs, "Number of NVIDIA GPU devices on this node.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say that only values 0 and 1 currently supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Updated with validation. I hadn't given that much urgency because this is an experimental feature, probably subject to lots of changes. |
@josephjacks for v0, only one GPU per machine is supported. See https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/gpu-support.md and #24071 |
// For GPUs, require that no request be set. | ||
if resourceName == api.ResourceNvidiaGPU { | ||
allErrs = append(allErrs, field.Invalid(reqPath, requestQuantity.String(), "cannot be set")) | ||
} else if quantity.Cmp(requestQuantity) < 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also add quantity.Cmp(requrestQuantity) > 1
? Because so far we only support /dev/nvidia0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. We would not, for example, validate that your ram is smaller than the largest ram machine.
LGTM |
@k8s-bot test this issue: #IGNORE Tests have been pending for 24 hours |
GCE e2e build/test passed for commit 362c763. |
const ( | ||
// CPU, in cores. (500m = .5 cores) | ||
ResourceCPU ResourceName = "cpu" | ||
// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024) | ||
ResourceMemory ResourceName = "memory" | ||
// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024) | ||
ResourceStorage ResourceName = "storage" | ||
// NVIDIA GPU, in devices. Alpha, might change: although fractional and allowing values >1, only one whole device per node is assigned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this comment. The resource is integer GPUs, not milliGPUs, so how can you specify fractional?
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 362c763. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 362c763. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit 362c763. |
Can you change the title of this PR? @therc |
nvm, I did it for you :) |
Automatic merge from submit-queue |
Implements part of #24071 for #23587
I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.
Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs
cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht