Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA GPU support #24836

Merged
merged 1 commit into from
May 12, 2016
Merged

NVIDIA GPU support #24836

merged 1 commit into from
May 12, 2016

Conversation

therc
Copy link
Member

@therc therc commented Apr 27, 2016

* Alpha support for scheduling pods on machines with NVIDIA GPUs whose kubelets use the `--experimental-nvidia-gpus` flag, using the alpha.kubernetes.io/nvidia-gpu resource 

Implements part of #24071 for #23587

I am not familiar with the scheduler enough to know what to do with the scores. Mostly punting for now.

Missing items from the implementation plan: limitranger, rkt support, kubectl
support and docs

cc @erictune @davidopp @dchen1107 @vishh @Hui-Zhi @gopinatht

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/old-docs size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 27, 2016
@Hui-Zhi
Copy link
Contributor

Hui-Zhi commented Apr 27, 2016

I have already submitted the scheduler changes for NVIDIA GPU at 14/Apr( #21446 ), the scheduler changes you did like that, but I treat the NVIDIA GPU as

Digit = Format("Digit").

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2016
@therc therc force-pushed the gpu-impl branch 2 times, most recently from 14794b5 to cb92710 Compare April 27, 2016 18:39
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2016
@yujuhong yujuhong assigned davidopp and unassigned yujuhong Apr 27, 2016
@yujuhong
Copy link
Contributor

@davidopp, I assigned this to you to review or delegate scheduler predicate changes.

@yujuhong
Copy link
Contributor

/cc @kubernetes/sig-node

@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Apr 29, 2016
@eparis eparis removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Apr 29, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 30, 2016
@therc
Copy link
Member Author

therc commented May 1, 2016

Rebase, squashed and changed int->int32, per @smarterclayton's recent changes

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 1, 2016
@@ -1817,6 +1817,7 @@ const (
// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
ResourceStorage ResourceName = "storage"
// Number of Pods that may be running on this Node: see ResourcePods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments seema to be unrelated to the field below. Is github rendering it incorrectly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I think this is from the rebase. Fixing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a placeholder for the defunct maxpods. It misled me into thinking I had already added a description for ResourceNvidiaGPU, so I just moved it to the end.

@vishh
Copy link
Contributor

vishh commented May 2, 2016

Kubelet changes LGTM

@@ -231,6 +232,7 @@ func (s *KubeletServer) AddFlags(fs *pflag.FlagSet) {
fs.BoolVar(&s.BabysitDaemons, "babysit-daemons", s.BabysitDaemons, "If true, the node has babysitter process monitoring docker and kubelet.")
fs.MarkDeprecated("babysit-daemons", "Will be removed in a future version.")
fs.Int32Var(&s.MaxPods, "max-pods", s.MaxPods, "Number of Pods that can run on this Kubelet.")
fs.Int32Var(&s.NvidiaGPUs, "experimental-nvidia-gpus", s.NvidiaGPUs, "Number of NVIDIA GPU devices on this node.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say that only values 0 and 1 currently supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@erictune erictune assigned erictune and unassigned erictune May 10, 2016
@therc
Copy link
Member Author

therc commented May 10, 2016

Updated with validation. I hadn't given that much urgency because this is an experimental feature, probably subject to lots of changes.

@josephjacks
Copy link

@therc for 1 K8s node with 4 gpus and 4 1-gpu jobs, how would allocation of those gpus to the jobs work? // @gdb

@therc
Copy link
Member Author

therc commented May 10, 2016

@josephjacks for v0, only one GPU per machine is supported. See https://github.com/kubernetes/kubernetes/blob/master/docs/proposals/gpu-support.md and #24071

// For GPUs, require that no request be set.
if resourceName == api.ResourceNvidiaGPU {
allErrs = append(allErrs, field.Invalid(reqPath, requestQuantity.String(), "cannot be set"))
} else if quantity.Cmp(requestQuantity) < 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add quantity.Cmp(requrestQuantity) > 1? Because so far we only support /dev/nvidia0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. We would not, for example, validate that your ram is smaller than the largest ram machine.

@erictune erictune added the priority/backlog Higher priority than priority/awaiting-more-evidence. label May 11, 2016
@erictune
Copy link
Member

LGTM

@erictune erictune added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels May 11, 2016
@k8s-github-robot
Copy link

@k8s-bot test this issue: #IGNORE

Tests have been pending for 24 hours

@k8s-bot
Copy link

k8s-bot commented May 11, 2016

GCE e2e build/test passed for commit 362c763.

const (
// CPU, in cores. (500m = .5 cores)
ResourceCPU ResourceName = "cpu"
// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
ResourceMemory ResourceName = "memory"
// Volume size, in bytes (e,g. 5Gi = 5GiB = 5 * 1024 * 1024 * 1024)
ResourceStorage ResourceName = "storage"
// NVIDIA GPU, in devices. Alpha, might change: although fractional and allowing values >1, only one whole device per node is assigned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment. The resource is integer GPUs, not milliGPUs, so how can you specify fractional?

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 12, 2016

GCE e2e build/test passed for commit 362c763.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 12, 2016

GCE e2e build/test passed for commit 362c763.

@therc
Copy link
Member Author

therc commented May 12, 2016

@k8s-bot unit test this issue: #25539

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented May 12, 2016

GCE e2e build/test passed for commit 362c763.

@yifan-gu
Copy link
Contributor

Can you change the title of this PR? @therc

@yifan-gu yifan-gu changed the title WIP v0 NVIDIA GPU support NVIDIA GPU support May 12, 2016
@yifan-gu
Copy link
Contributor

nvm, I did it for you :)

@k8s-github-robot
Copy link

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.