-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Tensorboard Controller to Support PVCs #5039
Comments
Issue Label Bot is not confident enough to auto-label this issue. |
1 similar comment
Issue Label Bot is not confident enough to auto-label this issue. |
I have extended the Tensorboard Controller so that it supports PVCs, as described above. Also, there is another problem, which is related to this kubernetes issue. More specifically, If we use a I think it is really important that PVCs with ReadWriteOnce access mode are fully supported, because ReadWriteOnce volumes are the most widely used volumes out there. In order to fix the problem, I propose that |
This feels like a bit of a brittle solution. My interpretation of Its not clear to me why supporting If people really want to use PVs then it seems reasonable to require that either
|
Yes, supporting object stores is an option for storing/retrieving event files, but I think that PVCs provide an easier/more intuitive workflow for the end user. More specifically, users might prefer using PVCs instead of object stores because:
Let's also not forget that users are already using PVCs, and mostly
I think there might be a misunderstanding here. Now, the problem arises when pods, running on different nodes, try to access the same |
Thanks @kandrio98 these are great points. Based on your points I took a deeper look at the code to understand what's going on. See my comment here: #3578 (comment) Should we resolve that first before getting too deep into how we manage PVCs? Some inline comments below.
If you make this work by forcing a bunch of pods to be scheduled on the same node, will this lead to unexpected failures? e.g. suppose I start a TB pod and that is the first to schedule a PVC. TB doesn't really need GPUs so it gets scheduled on a node with no GPUs. Now I schedule a jupyter pod which requests the PVC and a GPU. So now you have conflicting scheduling constraints; you want to schedule on a different node with GPU but also schedule on the same node with the PVC. Can we offload these type of scheduling decisions to the cluster?
Even for Cloud? In Cloud if you store your data on GCS or S3 you get the benefit of
Isn't that easier then working about moving data to/from disks? When would you recommend using PVCs over S3/GCS in Cloud? Generally, I recommend PVCs on when dealing with legacy applications that don't speak object store. TensorBoard is not such an application. TensorBoard and TensorFlow can read/write to S3 and GCS directly. Jupyter is a legacy application. The notebook gets stored on PVC because Jupyter doesn't know how to read/write GCS. I think storing on object store would be much better. I think that's why Netflix built bookstore to make it easy to save your notebooks on S3. |
Currently, the Tensorboard Controller provides very basic functionalities for PVCs. My proposition aimed at making the use of RWO PVCs a little easier for users. That said, the extra scheduling functionalities that I proposed could become optional. More specifically, we could add an enviromental variable, say @jlewi what are your thoughts on that? |
@kandrio98 See my comment #3578 (comment) Would that help us in terms of making progress to a long term goal of having a solid story for web apps? Or would it end up just end up being removed later? If its just going to be removed later in favor of something more general (e.g. a pod template spec) is it still worth doing? |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
First of all, I really agree with your thoughts on creating a solid story for managing our stateful web apps in a cohesive way.
I believe that this added scheduling functionality (even if optional) would be useful, currently and in the future. It would make it easier for users to use the Tensorboard web-app in the case that their event files are stored in a RWO PVC, regardless of the Custom Resource that we are using. Also, since the implementation of this UX feature is part of my GSoC proposal, I have written the necessary code. @jlewi what do you propose? |
/kind feature
Currently, the Tensorboard controller does not fully support arbitrary PVCs being mounted to pods running the Tensorboard server. In fact, it can only support PVCs named tb-volume.
I would like to extend the Tensorboard controller so that it fully supports any PVC created by the user in the namespace of the Tensorboard object applied. This is actually a part of my Google Summer of Code project for building a Tensorboard UI that also supports PVCs.
My proposition would be to extend the Tensorboard controller so that it obtains the name of the PVC by the
spec.logspath
field of the Tensorboard CR applied by the user. In detail, if a user wants to mount a PVC to a pod running the Tensorboard server, then the value ofspec.logspath
should follow the format:pvc://{pvc-name}/{local-path}
(similar to whatkfserving
uses in itsInferenceService
CRDs). This way, I will be able to extend the code of the controller so that it creates a deployment that mounts the desired PVC to the pod running the Tensorboard server.The extended controller will also be backwards compatible.
cc @jlewi @kimwnasptd @elikatsis
The text was updated successfully, but these errors were encountered: