Changes to be committed: #21446

Hui-Zhi · 2016-02-18T01:37:40Z

new file:   devices.md

This change is

k8s-bot · 2016-02-18T01:38:10Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-bot · 2016-02-18T01:39:09Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

derekwaynecarr · 2016-02-18T01:43:39Z

/cc @derekwaynecarr to track

Hui-Zhi · 2016-02-18T01:45:08Z

ok to test

davidopp · 2016-02-23T07:48:36Z

Will take a look after 1.2 push.

k8s-bot · 2016-02-23T21:15:05Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

Hui-Zhi · 2016-02-29T07:17:40Z

Could you let me know when 1.2 push could be done? @davidopp

davidopp · 2016-02-29T09:54:08Z

Realistically, I will be able to take a look at this towards the middle/end of this week. Very sorry for the delay.

davidopp · 2016-03-02T08:09:10Z

cc/ @erictune

k8s-bot · 2016-03-05T01:16:16Z

Can one of the admins verify that this patch is reasonable to test in $JOB_NAME? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

dchen1107 · 2016-03-05T01:31:06Z

cc/ @dchen1107

vishh · 2016-03-09T22:18:04Z

docs/design/devices.md

@@ -0,0 +1,34 @@
+# K8s support "/sys/fs/cgroup/devices"


device cgroups is one of the access control mechanisms. I'd like to keep that a kubelet implementation detail. This PR and this proposal should center around GPU support.

What you said make sense, but I think different devices could have difference implementation in kubelet, absolutely, and this time I only want to enable the NVIDIA GPU. But regarding the changes in scheduler, should have an unified interface for all the devices. It kind of like enable all the devices by scheduler's perspective. So considering all changes in kubelet and scheduler to make GPU enabled, what should be the name for the changes? Do you have any ideas about it?

erictune · 2016-03-10T17:30:20Z

Could you add a section that explains what things are the responsibility of:

plugin code
kubelet (not plugin code)
the person who sets up the node
the container image that will use the GPU

Specifically, classify these tasks in the above categories, or state if it is not in scope initially for this design.

modprobe nvidia and nvidia-uvm kernel modules. (or raedeon or whatever)
detect what pci devices exist and their characteristics (e.g. Vendor = Nvidia, Model = "GeForce GTX 480")
map device characteristics to node attributes that can be used for scheduling constraints if multiple devices present.
mknod /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, etc.
build the list of devices to be exposed in the container (e.g. /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm).
install the CUDA driver (not the kernel module).

I think probably 1 and 4 are the responsibility of the person who sets up the node.
2 and 3 might no be in scope initially (e.g. assume all nodes have the same GPUs, or that the cluster admin applied appropriate labels to nodes already)
I'm not sure about 5.
I think maybe 6 is the responsibility of the container image that uses the GPU (e.g. see https://github.com/ozzyjohnson/docker-cuda/blob/master/Dockerfile). This would allow multiple CUDA driver versions to be used at once, maybe, and to rolling update CUDA versions, to test out new versions, support OpenCL, etc.

Hui-Zhi · 2016-03-14T08:19:42Z

@erictune Sure, I'll add it later.

Hui-Zhi · 2016-04-11T11:03:28Z

@therc @davidopp @erictune @vishh I just updated the documents, please let me know if the answer is not clear for your questions, any feedback is welcome.

therc · 2016-04-11T20:35:31Z

I'm wondering if instead of GPUs and devices, we should just bite the bullet and have a generic resource, like @davidopp suggested in #19049 (comment). For example, I might want to have a VPN tunnel set up. Or a special iptables rule to intercept traffic to 169.254.169.254 and forward it to my container, which is listening on a hostport.

These are all resources whose creation needs root-equivalent capabilities, which hopefully a container will lack, because the administrator has tightened things down. Then a resource plugin, after authentication and authorization, would mediate creation and mapping of the resources on behalf of the user. Administrators could define and implement helpers for all the custom resources they want. In addition to all that @davidopp identified for the GPU case, it'd be great to be able to pass user-defined data to the setup command. In my examples above, that would be the VPN tunnel configuration or the address that the iptables rule needs to intercept.

GPUs would just be the first real-world use of the feature and would go a long way in proving that it works.

Hui-Zhi · 2016-04-12T07:59:08Z

@therc @davidopp I think in the final version of this proposal, the scheduler should have a generic resource, and these resource descriptions should be written in https://github.com/kubernetes/kubernetes/blob/master/docs/design/resources.md. But so far we just need to find a way to enable NVIDIA GPU in Kubernetes, and verify it to ensure it could work, also the changes should be approved by community. I think most part of my code are fine. BTW, treat GPU as a first-class integer resource is what I did in the code. I would like to implement this just use scheduler and kubelet "--device" ability to enable GPU directly, and the code is already there, but need to be refined.
@therc Regarding the PR, I think it's better I merge your part into mine, because I want to just refine my code and submit it soon, and the part of scheduler is what @davidopp wants. There no need to write 2 PR for 1 proposal, but there has something different between our proposal, I think it's about how we use the container, in my proposal, I only put the app which use CUDA libraries into container is enough, the CUDA libraries could be mounted by kubelet from host, but if i understand correctly, you suppose to mount the CUDA libraries by docker volume. We also should add some code to Docker plugin, right? If we could create multiple Docker volumes for different CUDA version, and it works fine, I think it's better. But should we also create volumes for RKT? If we only let kubelet mount libs from host it could be an easier way to support both Docker and RKT.

erictune · 2016-05-03T20:22:16Z

I am now going to look at both this PR and #24836

erictune · 2016-05-03T20:30:48Z

docs/design/devices.md

+      requests:
+        cpu: 100m
+        memory: 100Mi
+        nvidiagpu: 1


I prefer how it is alpha.kubernetes.io/nvidia-gpu, in #24836.

For now. Makes it clear feature is alpha. Later, maybe we rename it to alpha.kubernetes.io/nvidia-gpu, or maybe we find a way to abstract GPU vendors and can call it "gpu"

erictune · 2016-05-03T23:02:48Z

This has some good stuff in it. However #24836, while very similar, is slightly more complete. So I am inclined to take that one. Your comments on that PR wanted.

googlebot · 2016-05-12T01:45:25Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please let us know the company's name.

googlebot added the cla: yes label Feb 18, 2016

k8s-github-robot assigned brendandburns Feb 18, 2016

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 18, 2016

mikedanese assigned davidopp and unassigned brendandburns Feb 18, 2016

mikedanese added the kind/design Categorizes issue or PR as related to design. label Feb 18, 2016

davidopp assigned erictune and unassigned davidopp Mar 5, 2016

Hui-Zhi mentioned this pull request Mar 8, 2016

Nvidia GPU should be supported in K8s, and the feature could be added like this. #19049

Closed

vishh reviewed Mar 9, 2016
View reviewed changes

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 11, 2016

k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. labels Apr 13, 2016

Hui-Zhi mentioned this pull request Apr 27, 2016

NVIDIA GPU support #24836

Merged

erictune reviewed May 3, 2016
View reviewed changes

googlebot added cla: no and removed cla: yes labels May 12, 2016

Hui-Zhi closed this May 13, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to be committed: #21446

Changes to be committed: #21446

Hui-Zhi commented Feb 18, 2016 •

edited by mikedanese

Loading

k8s-bot commented Feb 18, 2016

k8s-bot commented Feb 18, 2016

derekwaynecarr commented Feb 18, 2016

Hui-Zhi commented Feb 18, 2016

davidopp commented Feb 23, 2016

k8s-bot commented Feb 23, 2016

Hui-Zhi commented Feb 29, 2016

davidopp commented Feb 29, 2016

davidopp commented Mar 2, 2016

k8s-bot commented Mar 5, 2016

dchen1107 commented Mar 5, 2016

vishh Mar 9, 2016

Hui-Zhi Mar 14, 2016

erictune commented Mar 10, 2016

Hui-Zhi commented Mar 14, 2016

Hui-Zhi commented Apr 11, 2016

therc commented Apr 11, 2016

Hui-Zhi commented Apr 12, 2016

erictune commented May 3, 2016

erictune May 3, 2016

erictune May 3, 2016

erictune commented May 3, 2016

googlebot commented May 12, 2016

Changes to be committed: #21446

Changes to be committed: #21446

Conversation

Hui-Zhi commented Feb 18, 2016 • edited by mikedanese Loading

k8s-bot commented Feb 18, 2016

k8s-bot commented Feb 18, 2016

derekwaynecarr commented Feb 18, 2016

Hui-Zhi commented Feb 18, 2016

davidopp commented Feb 23, 2016

k8s-bot commented Feb 23, 2016

Hui-Zhi commented Feb 29, 2016

davidopp commented Feb 29, 2016

davidopp commented Mar 2, 2016

k8s-bot commented Mar 5, 2016

dchen1107 commented Mar 5, 2016

vishh Mar 9, 2016

Choose a reason for hiding this comment

Hui-Zhi Mar 14, 2016

Choose a reason for hiding this comment

erictune commented Mar 10, 2016

Hui-Zhi commented Mar 14, 2016

Hui-Zhi commented Apr 11, 2016

therc commented Apr 11, 2016

Hui-Zhi commented Apr 12, 2016

erictune commented May 3, 2016

erictune May 3, 2016

Choose a reason for hiding this comment

erictune May 3, 2016

Choose a reason for hiding this comment

erictune commented May 3, 2016

googlebot commented May 12, 2016

Hui-Zhi commented Feb 18, 2016 •

edited by mikedanese

Loading