TensorBoard Integration #13

wbuchwalter · 2017-08-02T21:28:54Z

How do you see TensorBoard integrating with this solution?
It would be really cool if I could create a template, ask for TensorBoard to be deployed as well and receive either a ClusterIP or Public IP.

For example this could look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  addons:
    - tensorboard:
         ip-type: LoadBalancer
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

TensorBoard would then run as a sidecar in the master's pod.
Now the main issue here is accessing the log files.
An easy way would be to document a certain convention. For example, we assume that the log files are saved under /var/tensorflow/logs and then mount this directory into the TensorBoard container through the node.

This also begs the question of data persistence: In this state, once the job shutdowns, all data is lost. Do you think we need to address this question right away, or could this be discussed later on?

Happy to work on this if you approve.

The text was updated successfully, but these errors were encountered:

jlewi · 2017-08-02T22:03:27Z

I was thinking that it would work something like the following.

The user would just specify the location of the checkpoint in the TfJob e.g. something like this

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tensorboard:
      log_dir: gs://output_dir/check_point_dir
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
          restartPolicy: OnFailure

The TfJob operator would then just create/manage a ReplicaSet that was running TensorBoard with the specified log_dir. The lifetime of that replica set would be tied to the lifetime of the TfJob.

My first assumption is that users are already using some sort of storage not tied to pod lifetime to perserve their models; e.g. GCS, HDFS, NFS. So I don't think we need to run TensorBoard in a sidecar.

To support this we might need to extend the spec for TensorBoard to allow specifying volume mounts where the checkpoints are located.

My second simplifying assumption is that we don't need to expose the service for TensorBoard outside of the cluster. My expectation is that users will have some method of connecting to services running on the cluster e.g. kubectl proxy or on GKE, IAP.

For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately.

Thoughts?

wbuchwalter · 2017-08-03T15:30:49Z

Regarding storage, to make sure I understand correctly:
Your example gs://output_dir/check_point_dir was when using some kind of in-cluster distrbuted storage where there is no need for any kind of credentials, or url to access the storage.

And when dealing with something like Azure Files or GCE Persistent Disks, the template would instead look like:

apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tensorboard:
      log_dir: /var/tensorflow/check_point_dir
      volumes:
        - name: azurefile
          azureFile:
              secretName: azure-secret
              shareName: data
      volumeMounts:
        - name: azurefile
          mountPath: "/var/tensorflow/"
  replica_specs:
    - replicas: 1
      tf_port: 2222
      tf_replica_type: MASTER
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
              name: tensorflow
              resources:
                limits:
                  alpha.kubernetes.io/nvidia-gpu: 1
              volumeMounts:
                - name: azurefile
                  mountPath: "/var/tensorflow/"
          volumes:
            - name: azurefile
              azureFile:
                  secretName: azure-secret
                  shareName: data

Am I understanding this correctly?

My second simplifying assumption is that we don't need to expose the service for TensorBoard outside of the cluster. My expectation is that users will have some method of connecting to services running on the cluster e.g. kubectl proxy or on GKE, IAP.

For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately.

Agreed 👍

jlewi · 2017-08-03T17:39:20Z

Your spec looks good to me. If you want to take that one that would be great.

On GCP you don't have to mount GCS because GCS is GCP's object storage system. So any process can just read/write to GCS as long as you have appropriate credentials. So you don't have to explicitly mount GCS as a volume as TensorFlow supports reading/writing directly to GCS.

You could do something similar with HDFS since TensorFlow can support reading/writing HDFS.

Can Azure mount a volume in multiple pods simultaneously?

On GCP a PD can only be mounted with write permissions on a single VM. So on GCP if you wanted to use PD you would probably want to setup NFS backed by PD and then use K8s support for NFS volumes. (That's just an FYI not really related to this issue).

wbuchwalter · 2017-08-03T18:23:59Z

Your spec looks good to me. If you want to take that one that would be great.

Sure thing!

Can Azure mount a volume in multiple pods simultaneously?

Yes Azure files supports multiple mounting simultaneously which is quite nice for this use case.

* add annotation support for worker0 completed * modify annotation name * modify annotation name * modify check pod ttl way * fix from comments * fix from comments * fix from comments * modify ttl way * update check ttl code

wbuchwalter mentioned this issue Aug 6, 2017

Add TensorBoard Integration #15

Merged

4 tasks

jlewi closed this as completed in #15 Aug 16, 2017

tenzen-y mentioned this issue Mar 14, 2024

fix: Upgrade controller-gen to v0.14.0 #2026

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorBoard Integration #13

TensorBoard Integration #13

wbuchwalter commented Aug 2, 2017 •

edited

Loading

jlewi commented Aug 2, 2017

wbuchwalter commented Aug 3, 2017

jlewi commented Aug 3, 2017

wbuchwalter commented Aug 3, 2017 •

edited

Loading

TensorBoard Integration #13

TensorBoard Integration #13

Comments

wbuchwalter commented Aug 2, 2017 • edited Loading

jlewi commented Aug 2, 2017

wbuchwalter commented Aug 3, 2017

jlewi commented Aug 3, 2017

wbuchwalter commented Aug 3, 2017 • edited Loading

wbuchwalter commented Aug 2, 2017 •

edited

Loading

wbuchwalter commented Aug 3, 2017 •

edited

Loading