NVIDIA GPU support #24836

therc · 2016-04-27T01:09:07Z

* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource

Implements part of #24071 for #23587

I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.

Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs

cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht

Hui-Zhi · 2016-04-27T03:39:28Z

I have already submitted the scheduler changes for NVIDIA GPU at 14/Apr( #21446 ), the scheduler changes you did like that, but I treat the NVIDIA GPU as

Digit = Format("Digit").

yujuhong · 2016-04-27T19:43:30Z

@davidopp, I assigned this to you to review or delegate scheduler predicate changes.

yujuhong · 2016-04-27T19:43:50Z

/cc @kubernetes/sig-node

therc · 2016-05-01T17:53:31Z

Rebase, squashed and changed int->int32, per @smarterclayton's recent changes

vishh · 2016-05-02T20:52:24Z

pkg/api/types.go

@@ -1817,6 +1817,7 @@ const (
 	// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
 	ResourceStorage ResourceName = "storage"
 	// Number of Pods that may be running on this Node: see ResourcePods


This comments seema to be unrelated to the field below. Is github rendering it incorrectly?

Good catch, I think this is from the rebase. Fixing it.

It's a placeholder for the defunct maxpods. It misled me into thinking I had already added a description for ResourceNvidiaGPU, so I just moved it to the end.

vishh · 2016-05-02T20:55:19Z

Kubelet changes LGTM

erictune · 2016-05-03T21:01:50Z

cmd/kubelet/app/options/options.go

@@ -231,6 +232,7 @@ func (s *KubeletServer) AddFlags(fs *pflag.FlagSet) {
 	fs.BoolVar(&s.BabysitDaemons, "babysit-daemons", s.BabysitDaemons, "If true, the node has babysitter process monitoring docker and kubelet.")
 	fs.MarkDeprecated("babysit-daemons", "Will be removed in a future version.")
 	fs.Int32Var(&s.MaxPods, "max-pods", s.MaxPods, "Number of Pods that can run on this Kubelet.")
+	fs.Int32Var(&s.NvidiaGPUs, "experimental-nvidia-gpus", s.NvidiaGPUs, "Number of NVIDIA GPU devices on this node.")


Say that only values 0 and 1 currently supported?

therc · 2016-05-10T15:28:05Z

Updated with validation. I hadn't given that much urgency because this is an experimental feature, probably subject to lots of changes.

josephjacks · 2016-05-10T18:37:16Z

@therc for 1 K8s node with 4 gpus and 4 1-gpu jobs, how would allocation of those gpus to the jobs work? // @gdb

therc · 2016-05-10T18:42:16Z

@josephjacks for v0, only one GPU per machine is supported. See https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/gpu-support.md and #24071

Hui-Zhi · 2016-05-11T07:40:07Z

pkg/api/validation/validation.go

+			// For GPUs, require that no request be set.
+			if resourceName == api.ResourceNvidiaGPU {
+				allErrs = append(allErrs, field.Invalid(reqPath, requestQuantity.String(), "cannot be set"))
+			} else if quantity.Cmp(requestQuantity) < 0 {


should we also add quantity.Cmp(requrestQuantity) > 1? Because so far we only support /dev/nvidia0

I don't think this is necessary. We would not, for example, validate that your ram is smaller than the largest ram machine.

erictune · 2016-05-11T22:50:18Z

LGTM

k8s-github-robot · 2016-05-11T22:51:39Z

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

k8s-bot · 2016-05-11T23:29:41Z

GCE e2e build/test passed for commit 362c763.

davidopp · 2016-05-12T07:54:12Z

pkg/api/types.go

 const (
 	// CPU, in cores. (500m = .5 cores)
 	ResourceCPU ResourceName = "cpu"
 	// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
 	ResourceMemory ResourceName = "memory"
 	// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
 	ResourceStorage ResourceName = "storage"
+	// NVIDIA GPU, in devices. Alpha, might change: although fractional and allowing values >1, only one whole device per node is assigned.


I didn't understand this comment. The resource is integer GPUs, not milliGPUs, so how can you specify fractional?

k8s-github-robot · 2016-05-12T14:12:39Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-12T14:54:06Z

GCE e2e build/test passed for commit 362c763.

k8s-github-robot · 2016-05-12T16:48:00Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-12T17:31:25Z

GCE e2e build/test passed for commit 362c763.

therc · 2016-05-12T18:09:48Z

@k8s-bot unit test this issue: #25539

k8s-github-robot · 2016-05-12T20:03:57Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-bot · 2016-05-12T20:58:14Z

GCE e2e build/test passed for commit 362c763.

yifan-gu · 2016-05-12T21:00:48Z

Can you change the title of this PR? @therc

yifan-gu · 2016-05-12T21:01:24Z

nvm, I did it for you :)

k8s-github-robot · 2016-05-12T21:04:15Z

Automatic merge from submit-queue

googlebot added the cla: yes label Apr 27, 2016

k8s-github-robot assigned yujuhong Apr 27, 2016

k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 27, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2016

therc force-pushed the gpu-impl branch 2 times, most recently from 14794b5 to cb92710 Compare April 27, 2016 18:39

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2016

yujuhong assigned davidopp and unassigned yujuhong Apr 27, 2016

k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Apr 29, 2016

eparis mentioned this pull request Apr 29, 2016

mungegithub misbehaves when github returns 502s kubernetes-retired/contrib#867

Closed

eparis removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Apr 29, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2016

therc force-pushed the gpu-impl branch from cb92710 to 2829769 Compare May 1, 2016 17:52

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2016

vishh reviewed May 2, 2016
View reviewed changes

erictune mentioned this pull request May 3, 2016

Changes to be committed: #21446

Closed

erictune reviewed May 3, 2016
View reviewed changes

erictune assigned erictune and unassigned erictune May 10, 2016

Hui-Zhi reviewed May 11, 2016
View reviewed changes

yifan-gu mentioned this pull request May 11, 2016

rkt: Support nvdia GPU resource #25494

Closed

erictune added the priority/backlog Higher priority than priority/awaiting-more-evidence. label May 11, 2016

erictune added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels May 11, 2016

davidopp reviewed May 12, 2016
View reviewed changes

yifan-gu changed the title ~~WIP v0 NVIDIA GPU support~~ NVIDIA GPU support May 12, 2016

k8s-github-robot merged commit 08440b5 into kubernetes:master May 12, 2016

Hui-Zhi mentioned this pull request May 13, 2016

NVIDIA GPU: The details needed and risks for NVIDIA GPU workloads, what should be included in kube-scheduler/kubelet? #25557

Closed

foxish mentioned this pull request Jan 21, 2017

mungegithub misbehaves when github returns 502s kubernetes/test-infra#1628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA GPU support #24836

NVIDIA GPU support #24836

therc commented Apr 27, 2016

Hui-Zhi commented Apr 27, 2016

yujuhong commented Apr 27, 2016

yujuhong commented Apr 27, 2016

therc commented May 1, 2016

vishh May 2, 2016

therc May 2, 2016

therc May 2, 2016

vishh commented May 2, 2016

erictune May 3, 2016

therc May 3, 2016

therc commented May 10, 2016

josephjacks commented May 10, 2016

therc commented May 10, 2016

Hui-Zhi May 11, 2016

erictune May 11, 2016

erictune commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-bot commented May 11, 2016

davidopp May 12, 2016

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

therc commented May 12, 2016

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

yifan-gu commented May 12, 2016

yifan-gu commented May 12, 2016

k8s-github-robot commented May 12, 2016

NVIDIA GPU support #24836

NVIDIA GPU support #24836

Conversation

therc commented Apr 27, 2016

Hui-Zhi commented Apr 27, 2016

yujuhong commented Apr 27, 2016

yujuhong commented Apr 27, 2016

therc commented May 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vishh commented May 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

therc commented May 10, 2016

josephjacks commented May 10, 2016

therc commented May 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erictune commented May 11, 2016

k8s-github-robot commented May 11, 2016

k8s-bot commented May 11, 2016

Choose a reason for hiding this comment

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

therc commented May 12, 2016

k8s-github-robot commented May 12, 2016

k8s-bot commented May 12, 2016

yifan-gu commented May 12, 2016

yifan-gu commented May 12, 2016

k8s-github-robot commented May 12, 2016