Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to be committed: #21446

Closed
wants to merge 0 commits into from
Closed

Changes to be committed: #21446

wants to merge 0 commits into from

Conversation

Hui-Zhi
Copy link
Contributor

@Hui-Zhi Hui-Zhi commented Feb 18, 2016

new file:   devices.md

This change is Reviewable

@k8s-bot
Copy link

k8s-bot commented Feb 18, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

1 similar comment
@k8s-bot
Copy link

k8s-bot commented Feb 18, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@derekwaynecarr
Copy link
Member

/cc @derekwaynecarr to track

@Hui-Zhi
Copy link
Contributor Author

Hui-Zhi commented Feb 18, 2016

ok to test

@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 18, 2016
@mikedanese mikedanese assigned davidopp and unassigned brendandburns Feb 18, 2016
@mikedanese mikedanese added the kind/design Categorizes issue or PR as related to design. label Feb 18, 2016
@davidopp
Copy link
Member

Will take a look after 1.2 push.

@k8s-bot
Copy link

k8s-bot commented Feb 23, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@Hui-Zhi
Copy link
Contributor Author

Hui-Zhi commented Feb 29, 2016

Could you let me know when 1.2 push could be done? @davidopp

@davidopp
Copy link
Member

Realistically, I will be able to take a look at this towards the middle/end of this week. Very sorry for the delay.

@davidopp
Copy link
Member

davidopp commented Mar 2, 2016

cc/ @erictune

@k8s-bot
Copy link

k8s-bot commented Mar 5, 2016

Can one of the admins verify that this patch is reasonable to test in $JOB_NAME? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@dchen1107
Copy link
Member

cc/ @dchen1107

@@ -0,0 +1,34 @@
# K8s support "/sys/fs/cgroup/devices"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device cgroups is one of the access control mechanisms. I'd like to keep that a kubelet implementation detail. This PR and this proposal should center around GPU support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you said make sense, but I think different devices could have difference implementation in kubelet, absolutely, and this time I only want to enable the NVIDIA GPU. But regarding the changes in scheduler, should have an unified interface for all the devices. It kind of like enable all the devices by scheduler's perspective. So considering all changes in kubelet and scheduler to make GPU enabled, what should be the name for the changes? Do you have any ideas about it?

@erictune
Copy link
Member

Could you add a section that explains what things are the responsibility of:

  1. plugin code
  2. kubelet (not plugin code)
  3. the person who sets up the node
  4. the container image that will use the GPU

Specifically, classify these tasks in the above categories, or state if it is not in scope initially for this design.

  1. modprobe nvidia and nvidia-uvm kernel modules. (or raedeon or whatever)
  2. detect what pci devices exist and their characteristics (e.g. Vendor = Nvidia, Model = "GeForce GTX 480")
  3. map device characteristics to node attributes that can be used for scheduling constraints if multiple devices present.
  4. mknod /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, etc.
  5. build the list of devices to be exposed in the container (e.g. /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm).
  6. install the CUDA driver (not the kernel module).

I think probably 1 and 4 are the responsibility of the person who sets up the node.
2 and 3 might no be in scope initially (e.g. assume all nodes have the same GPUs, or that the cluster admin applied appropriate labels to nodes already)
I'm not sure about 5.
I think maybe 6 is the responsibility of the container image that uses the GPU (e.g. see https://github.com/ozzyjohnson/docker-cuda/blob/master/Dockerfile). This would allow multiple CUDA driver versions to be used at once, maybe, and to rolling update CUDA versions, to test out new versions, support OpenCL, etc.

@Hui-Zhi
Copy link
Contributor Author

Hui-Zhi commented Mar 14, 2016

@erictune Sure, I'll add it later.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 11, 2016
@Hui-Zhi
Copy link
Contributor Author

Hui-Zhi commented Apr 11, 2016

@therc @davidopp @erictune @vishh I just updated the documents, please let me know if the answer is not clear for your questions, any feedback is welcome.

@therc
Copy link
Member

therc commented Apr 11, 2016

I'm wondering if instead of GPUs and devices, we should just bite the bullet and have a generic resource, like @davidopp suggested in #19049 (comment). For example, I might want to have a VPN tunnel set up. Or a special iptables rule to intercept traffic to 169.254.169.254 and forward it to my container, which is listening on a hostport.

These are all resources whose creation needs root-equivalent capabilities, which hopefully a container will lack, because the administrator has tightened things down. Then a resource plugin, after authentication and authorization, would mediate creation and mapping of the resources on behalf of the user. Administrators could define and implement helpers for all the custom resources they want. In addition to all that @davidopp identified for the GPU case, it'd be great to be able to pass user-defined data to the setup command. In my examples above, that would be the VPN tunnel configuration or the address that the iptables rule needs to intercept.

GPUs would just be the first real-world use of the feature and would go a long way in proving that it works.

@Hui-Zhi
Copy link
Contributor Author

Hui-Zhi commented Apr 12, 2016

@therc @davidopp I think in the final version of this proposal, the scheduler should have a generic resource, and these resource descriptions should be written in https://github.com/kubernetes/kubernetes/blob/master/docs/design/resources.md. But so far we just need to find a way to enable NVIDIA GPU in Kubernetes, and verify it to ensure it could work, also the changes should be approved by community. I think most part of my code are fine. BTW, treat GPU as a first-class integer resource is what I did in the code. I would like to implement this just use scheduler and kubelet "--device" ability to enable GPU directly, and the code is already there, but need to be refined.
@therc Regarding the PR, I think it's better I merge your part into mine, because I want to just refine my code and submit it soon, and the part of scheduler is what @davidopp wants. There no need to write 2 PR for 1 proposal, but there has something different between our proposal, I think it's about how we use the container, in my proposal, I only put the app which use CUDA libraries into container is enough, the CUDA libraries could be mounted by kubelet from host, but if i understand correctly, you suppose to mount the CUDA libraries by docker volume. We also should add some code to Docker plugin, right? If we could create multiple Docker volumes for different CUDA version, and it works fine, I think it's better. But should we also create volumes for RKT? If we only let kubelet mount libs from host it could be an easier way to support both Docker and RKT.

@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. labels Apr 13, 2016
@Hui-Zhi Hui-Zhi mentioned this pull request Apr 27, 2016
@erictune
Copy link
Member

erictune commented May 3, 2016

I am now going to look at both this PR and #24836

requests:
cpu: 100m
memory: 100Mi
nvidiagpu: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer how it is alpha.kubernetes.io/nvidia-gpu, in #24836.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now. Makes it clear feature is alpha. Later, maybe we rename it to alpha.kubernetes.io/nvidia-gpu, or maybe we find a way to abstract GPU vendors and can call it "gpu"

@erictune
Copy link
Member

erictune commented May 3, 2016

This has some good stuff in it. However #24836, while very similar, is slightly more complete. So I am inclined to take that one. Your comments on that PR wanted.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@k8s-github-robot k8s-github-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. kind/old-docs size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 12, 2016
@Hui-Zhi Hui-Zhi closed this May 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/design Categorizes issue or PR as related to design. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.