-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorBoard Integration #13
Comments
I was thinking that it would work something like the following. The user would just specify the location of the checkpoint in the TfJob e.g. something like this
The TfJob operator would then just create/manage a ReplicaSet that was running TensorBoard with the specified log_dir. The lifetime of that replica set would be tied to the lifetime of the TfJob. My first assumption is that users are already using some sort of storage not tied to pod lifetime to perserve their models; e.g. GCS, HDFS, NFS. So I don't think we need to run TensorBoard in a sidecar. To support this we might need to extend the spec for TensorBoard to allow specifying volume mounts where the checkpoints are located. My second simplifying assumption is that we don't need to expose the service for TensorBoard outside of the cluster. My expectation is that users will have some method of connecting to services running on the cluster e.g. kubectl proxy or on GKE, IAP. For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately. Thoughts? |
Regarding storage, to make sure I understand correctly: And when dealing with something like Azure Files or GCE Persistent Disks, the template would instead look like: apiVersion: "mlkube.io/v1beta1"
kind: "TfJob"
metadata:
name: "tf-smoke-gpu"
spec:
tensorboard:
log_dir: /var/tensorflow/check_point_dir
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: data
volumeMounts:
- name: azurefile
mountPath: "/var/tensorflow/"
replica_specs:
- replicas: 1
tf_port: 2222
tf_replica_type: MASTER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample_gpu:latest
name: tensorflow
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- name: azurefile
mountPath: "/var/tensorflow/"
volumes:
- name: azurefile
azureFile:
secretName: azure-secret
shareName: data Am I understanding this correctly?
For more complicated scenarios e.g. tying TensorBoard to a load balancer, it might make sense for users to set that job up separately. Agreed 👍 |
Your spec looks good to me. If you want to take that one that would be great. On GCP you don't have to mount GCS because GCS is GCP's object storage system. So any process can just read/write to GCS as long as you have appropriate credentials. So you don't have to explicitly mount GCS as a volume as TensorFlow supports reading/writing directly to GCS. You could do something similar with HDFS since TensorFlow can support reading/writing HDFS. Can Azure mount a volume in multiple pods simultaneously? On GCP a PD can only be mounted with write permissions on a single VM. So on GCP if you wanted to use PD you would probably want to setup NFS backed by PD and then use K8s support for NFS volumes. (That's just an FYI not really related to this issue). |
Sure thing!
Yes Azure files supports multiple mounting simultaneously which is quite nice for this use case. |
* add annotation support for worker0 completed * modify annotation name * modify annotation name * modify check pod ttl way * fix from comments * fix from comments * fix from comments * modify ttl way * update check ttl code
How do you see TensorBoard integrating with this solution?
It would be really cool if I could create a template, ask for TensorBoard to be deployed as well and receive either a ClusterIP or Public IP.
For example this could look like:
TensorBoard would then run as a sidecar in the master's pod.
Now the main issue here is accessing the log files.
An easy way would be to document a certain convention. For example, we assume that the log files are saved under
/var/tensorflow/logs
and then mount this directory into the TensorBoard container through the node.This also begs the question of data persistence: In this state, once the job shutdowns, all data is lost. Do you think we need to address this question right away, or could this be discussed later on?
Happy to work on this if you approve.
The text was updated successfully, but these errors were encountered: