Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

Closed
zhan849 opened this issue Mar 5, 2019 · 10 comments
Closed

Comments

@zhan849
Copy link

zhan849 commented Mar 5, 2019

There are a lot of machine learning use cases where we need large scratch spaces for jobs i.e. job needs to download 100s of GBs of data for processing. In cloud environment, block devices (EBS for example) has its advantages as we don't need to over provision the hosts with large host volume, nor do we need expensive shared file system (EFS for example), as machine learning workloads usually don't need to share local data.

Given such use cases, in Kubernetes, persistent volume would be a good and efficient type to support such use case. Current kubeflow requires user to provision persistent volume claims separately as shown in https://github.com/kubeflow/tf-operator/tree/631dd0e31b8bfbb59b2b6ab7a3ea501cb289d479/examples/v1beta1/mnist_with_summaries, which causes additional operation overheads.

I was wondering if we can add support for dynamic volume provisioning in TFJobSpec and PyTorchJobSpec. A rough thought would be something similar to that of StatefulSet.

We are happy to contribute and send out PR for implementing the feature.

@johnugeorge
Copy link
Member

@zhan849 Can you explain more on it wrt to the support required and changes in API?

@zhan849
Copy link
Author

zhan849 commented Mar 5, 2019

@johnugeorge I'm thinking about the following:

  1. Add a slice of volume claim templates in ReplicaSpec
VolumeClaimTemplates []corev1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
  1. In ReplicaSpec.Spec.InitContainers and ReplicaSpec.Spec.Containers, if any of the container requests a volume whose name is NOT in ReplicaSpec.Spec.Volumes but in ReplicaSpec.VolumeClaimTemplates, we create PersistentVolumeClaim objects for that container. Since containers are named with "-i" suffix, we can name the PVCs similarly.

  2. When the managing object (i.e TFJob or PyTorchJob) is deleted, we delete the volumes as well

@zhan849
Copy link
Author

zhan849 commented Mar 5, 2019

I can send out the PR for the community to review if the general direction is agreed.

@johnugeorge
Copy link
Member

/cc @richardsliu
/cc @gaocegege

@zhan849 zhan849 changed the title [FeatureRequest] Support dynamic volume provisioning for TFJob [FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob Mar 6, 2019
@richardsliu richardsliu mentioned this issue Mar 27, 2019
4 tasks
@gaocegege
Copy link
Member

The idea SGTM while I think storage class already supports dynamic volume provision, what things should we do in the use case?

@zhan849
Copy link
Author

zhan849 commented Mar 28, 2019

@gaocegege yes storage class should define how the volume should actually be provisioned. We already started some experiments in a forked branch, do you guys want me to submit a brief proposal or you want to implement it. it'd be something similar to stateful set

@gaocegege
Copy link
Member

@zhan849 I am glad to see your proposal! Thanks for your contribution.

@gaocegege
Copy link
Member

@richardsliu @zhan849

Should we place the feature in common-operator? It is general for all PS-worker based training jobs

@richardsliu
Copy link
Contributor

@gaocegege Yes I think so. So it should be in the next API version.

@zhan849
Copy link
Author

zhan849 commented Apr 19, 2019

@gaocegege @richardsliu agreed. will close this one and open new one under common-operator.

And also for design doc, given all the discussion threads under tf-operator, I'd suggest that we keep it there and check into community repo after the PR is finalized and merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants