Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web-app to manage tensorboard instances #3578

Closed
jlewi opened this issue Jun 30, 2019 · 33 comments
Closed

Web-app to manage tensorboard instances #3578

jlewi opened this issue Jun 30, 2019 · 33 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jun 30, 2019

We should consider creating a standalone web app similar to our jupyter web app that makes it easy for folks to create/delete tensorboard instances.

I suspect copying and modifying the jupyter web app to create a web app for managing tensorboard would be pretty straightforward.

Some pointers to get started

  • Here is a link to the current jupyter web app code
  • The notebook in kubeflow/examples#/723 has an example of deploying TensorBoard by creating
    1. An ISTIO virtual service
    2. A K8s service
    3. A Deployment
  • The code for the tensorboard controller is here

This project could be broken down into several pieces

  1. Create a Web App for spinning up TensorBoard instances
  2. Create a Docker image for building it
  3. Create a kustomize manifest for deploying it
  4. Update Kubeflow's CD Pipelines to continually update the kustomize manifest with the latest image

Pipelines (@neuromage ) create a Viewer CRD which can be used for tensboard.
https://github.com/kubeflow/pipelines/tree/master/backend/src/crd/controller/viewer

Currently I believe that is integrated into pipelines and used to auto-visualize tensorboard data reported by pipelines.

@kimwnasptd Thoughts? Any interest in possibly tackling this as part of 0.7?

/cc @karthikv2k
/cc @neuromage
/cc @vkoukis

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label improvement/enhancement to this issue, with a confidence of 0.90. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@kimwnasptd
Copy link
Member

@jlewi yes I would like to contribute to this for 0.7. We could have a first implementation that would provide a very similar UX with the jupyter webapp and then iterate on that.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 15, 2019

/assign @kimwnasptd

@jlewi
Copy link
Contributor Author

jlewi commented Aug 27, 2019

@kimwnasptd I'm punting this from 0.7 and downgrading to P2.

I think graduating the Jupyter infrastructure to 1.0 is more important.

@jbottum jbottum added area/centraldashboard UI/UX improvements for Kubeflow central dashboard / landing page area/design and removed area/centraldashboard UI/UX improvements for Kubeflow central dashboard / landing page labels Oct 2, 2019
@jlewi
Copy link
Contributor Author

jlewi commented Nov 26, 2019

@kimwnasptd any update on this?

I think there was a mention of the fact that you had a prototype for a UI? Do you have an ETA for when you will have something ready to demo and then a potential PR?

A TensorBoard controller was recently added
https://github.com/kubeflow/kubeflow/tree/master/components/tensorboard-controller

@gaocegege
Copy link
Member

tensorboard CRD seems to be very simple now. I am wondering if we should use deployment directly. Is there any benefit of introducing a new CRD here?

@jlewi
Copy link
Contributor Author

jlewi commented Mar 12, 2020

@kimwnasptd are you interested in potentially being a mentor for any GSOC students interested in working on this?

@KaviyaPeriyasamy
Copy link

@kimwnasptd. I am interested to do my gsoc contribution for this project.Can you guide me for the next steps and environmental setup?

@sarahmaddox
Copy link
Contributor

/area gsoc

@k8s-ci-robot
Copy link
Contributor

@sarahmaddox: The label(s) area/gsoc cannot be applied, because the repository doesn't have them

In response to this:

/area gsoc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sarahmaddox
Copy link
Contributor

The label(s) area/gsoc cannot be applied, because the repository doesn't have them

@jlewi Please could you check the label sync job? I added the area/gsoc label in PR kubeflow/testing#623.

@jlewi
Copy link
Contributor Author

jlewi commented Mar 19, 2020

@sarahmaddox the label is there.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 7, 2020

Thanks @kimwnasptd

Here's my conjecture; it looks like the path you are on will lead you to eventually putting a PodTemplateSpec in the TensorBoardController (or the equivalent)

Per #5039 you want to add nodeAffinity and presumably volumes as well.

To support object storage you need to support setting secrets and service accounts

  • Since the secrets are hard coded today object store isn't really supported in practice; e.g. the controller is hardcoding GCP secrets which are no longer created when using workload identity

To accomodate different versions of tensorboard we will likely need to allow the docker image to specified as well.

At which point if we aren't using a podTemplateSpec the question arises what podTemplateSpec fields aren't being exposed?

So it seems like what we really need is an easily extensible story for managing stateful web applications; e.g. jupyter and tensorboard.

@kimwnasptd
Copy link
Member

@jlewi

To support object storage you need to support setting secrets and service accounts.

To accomodate different versions of tensorboard we will likely need to allow the docker image to specified as well.

So it seems like what we really need is an easily extensible story for managing stateful web applications; e.g. jupyter and tensorboard.

I really agree with your thought process and points and I also think that having an extensible story for deploying our stateful apps, like Jupyter, Tensorboard, Theia etc is a step towards the right direction.

My only concern as of right now is that I don't want the GSoC project to get off schedule or blocked from this transition to a more abstract/reusable way of deploying our apps.

The ideal scenario for me would be to:

  1. Continue the GSoC project with the existing Tensorboard Controller, which provides very basic functionalities for Object Stores and PVCs. It would be the alpha version just to introduce users to using Tensroboard with Kubeflow
  2. Start a discussion in parallel and figure out a cohesive story for deploying our stateful apps.
    • I think a good first step here would be to create a design doc based on our Notebooks Controller. It should address what features it already provides and how it deploys/exposes Notebooks. Then we could start iterating from there on which parts could be generalized or extended [ for example the VirtualService's endpoint, or the ports that get exposed with the k8s Service ]
  3. Once we've settled down and agreed on our story for stateful apps and have an implementation [ could be a new Custom Resource? ] then lets make the Tensorboard web app align with it.

@jlewi do you find the above plan reasonable?
Would you like us to proceed differently?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 10, 2020

LGTM

@kimwnasptd
Copy link
Member

@jlewi @sarahmaddox, @kandrio98 @elikatsis and I are really excited to inform you that we have a first iteration of a web app for managing Tensorboard instances!

@kandrio98 did a lot of contributions both for in the Tensorboards Controller, #5069 #5218 #5262 #5266, as well as the actual web app, #5259 #5180 #5267. He has an e2e view of how to deploy a Tensorboard instance all the way from the user's perspective using the UI up to the k8s controller that is handling the CRs.

You can take a quick look of the app from the frontend's PR #5267.

With this we can start discussing together what the next steps can be. Some things that come to mind:

  • Extend the Central Dashboard to allow users to navigate to the new web app
  • Create scripts for continuously building images for the web app
  • Create the manifests for deploying the web app and the Tensorboard Controller
  • Deprecate the Tensorboard CR and move towards a more generalized Kubeflow CR as we've discussed in this issue. I'll be submitting the PRs for creating a Notebooks WG soon and this will be one of the top priorities we will be discussing.

All in all, I believe @kandrio98 learned a lot through this project and it was a pleasure mentoring him for both @elikatsis and I.

@sarahmaddox
Copy link
Contributor

Well done @kandrio98, @kimwnasptd, and @elikatsis!

@stale
Copy link

stale bot commented Nov 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Dec 12, 2020
@davidspek
Copy link
Contributor

/lifecycle frozen

@davidspek
Copy link
Contributor

@kimwnasptd

I really agree with your thought process and points and I also think that having an extensible story for deploying our stateful apps, like Jupyter, Tensorboard, Theia etc is a step towards the right direction.

This would also address an old issue which has since been closed regarding "how to spin up a web application on a Kubeflow cluster": kubeflow/website#2044

@wyljpn
Copy link

wyljpn commented Feb 7, 2022

@kimwnasptd @jlewi Thank you for your contribution to TWA.
I was wondering whether we can load logs from MinIO for Tensorboard because my logs were saved in MinIO.

I found an example to load logs saved in gs.

https://github.com/kubeflow/kubeflow/blob/v1.3-branch/components/tensorboard-controller/config/samples/tensorboard_v1alpha1_tensorboard.yaml

apiVersion: [tensorboard.kubeflow.org/v1alpha1](http://tensorboard.kubeflow.org/v1alpha1)
kind: Tensorboard
metadata:
  name: tensorboard-sample1
  namespace: kubeflow-quanlin
spec:
  # Add fields here
  logspath: "gs://quanlinkubeflow1_tf_mnist_logs1/mnist_tutorial/"

I tried to replace the logspath with my MinIO path, but it did not work.
image
Could you kindly tell me how to configure it right?

@ConverJens
Copy link

@wyljpn That's because tensorboard uses tf io to connect to filesytems other than local and GCS and it requires some env vars to be set on the tensorboard container. You need to set the following:

S3_ENDPOINT: "your-s3-endpoint:minio-port" #Notice there should be no http/https prefix here!
AWS_ACCESS_KEY_ID: "your-s3-access-key"
AWS_SECRET_ACCESS_KEY: "your-s3-secret-key"
S3_USE_HTTPS: "1"
S3_VERIFY_SSL: "1"

@wyljpn
Copy link

wyljpn commented Feb 7, 2022

Hi, @ConverJens so glad to hear from you again.
I exec into a Tensorboard pod and tested it, it worked.

$ export S3_ENDPOINT=minio-service-second.rflow-test.svc.cluster.local:9000
$ export AWS_ACCESS_KEY_ID=minio
$ export AWS_SECRET_ACCESS_KEY=minio123
$ tensorboard --logdir s3://tfx/rflow-test/ccp5/Trainer/model_run/248/
2022-02-07 09:18:03.242317: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-07 09:18:03.242400: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all

So I think if we can pass the env vars to a Tensorboard pod, it could load logs from MinIO.

But how to pass the env vars conveniently?
Can we define a secret, then configure it in the yaml file?

apiVersion: [tensorboard.kubeflow.org/v1alpha1](http://tensorboard.kubeflow.org/v1alpha1)
kind: Tensorboard
metadata:
  name: tensorboard-sample1
  namespace: kubeflow-quanlin
spec:
  logspath: "gs://quanlinkubeflow1_tf_mnist_logs1/mnist_tutorial/"
  # SecretName: minio-s3-secret # Can we specify secret here?

I read source code of the Tensorboard-controller. It seems that it supports only gcp, mount a secret for gcp in hard code.
https://github.com/kubeflow/kubeflow/blob/v1.3-branch/components/tensorboard-controller/controllers/tensorboard_controller.go#L213

There is no code for supporting s3 in the Tensorboard-controller.
Does it mean we have to modify it for s3 like what I wrote below?

        else if isS3Path(tb.Spec.LogsPath) {
		//In this case, a s3 bucket is used as a log storage for the Tensorboard server.
		volumeMounts = append(volumeMounts, corev1.VolumeMount{
			Name:      "s3-creds",
			ReadOnly:  true,
			MountPath: "/secret/s3",
		})
		volumes = append(volumes, corev1.Volume{
			Name: "s3-creds",
			VolumeSource: corev1.VolumeSource{
				Secret: &corev1.SecretVolumeSource{
					SecretName: "minio-s3-secret",
				},
			},
		})
	}

@wyljpn
Copy link

wyljpn commented Feb 8, 2022

I added EnvFrom in Containers to make it supports S3 Compatible Object successfully.
https://github.com/kubeflow/kubeflow/blob/v1.3-branch/components/tensorboard-controller/controllers/tensorboard_controller.go

Containers: []corev1.Container{
{
	Name:            "tensorboard",
	Image:           "tensorflow/tensorflow:2.1.0",
        ...
	EnvFrom: []corev1.EnvFromSource{
		corev1.EnvFromSource{
			SecretRef: &corev1.SecretEnvSource{
				LocalObjectReference: corev1.LocalObjectReference{
					Name: "minio-tb-secret",
				},
			},
		},
	},
},
apiVersion: v1
kind: Secret
metadata:
  name: minio-tb-secret
  namespace: rflow
  annotations:
      serving.kubeflow.org/s3-endpoint: minio-service-second.rflow-test.svc.cluster.local:9000
      serving.kubeflow.org/s3-usehttps: "0"
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: minio
  AWS_SECRET_ACCESS_KEY: minio123
  S3_ENDPOINT: minio-service.kubeflow.svc.cluster.local:9000
  S3_USE_HTTPS: "0"
  S3_VERIFY_SSL: "false"

image
image

@laserK3000
Copy link

Can this change be merged?

@kimwnasptd
Copy link
Member

cross posting here, let's track this feature in #6493

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests