[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

zhan849 · 2019-03-05T01:27:00Z

There are a lot of machine learning use cases where we need large scratch spaces for jobs i.e. job needs to download 100s of GBs of data for processing. In cloud environment, block devices (EBS for example) has its advantages as we don't need to over provision the hosts with large host volume, nor do we need expensive shared file system (EFS for example), as machine learning workloads usually don't need to share local data.

Given such use cases, in Kubernetes, persistent volume would be a good and efficient type to support such use case. Current kubeflow requires user to provision persistent volume claims separately as shown in https://github.com/kubeflow/tf-operator/tree/631dd0e31b8bfbb59b2b6ab7a3ea501cb289d479/examples/v1beta1/mnist_with_summaries, which causes additional operation overheads.

I was wondering if we can add support for dynamic volume provisioning in TFJobSpec and PyTorchJobSpec. A rough thought would be something similar to that of StatefulSet.

We are happy to contribute and send out PR for implementing the feature.

The text was updated successfully, but these errors were encountered:

johnugeorge · 2019-03-05T01:43:41Z

@zhan849 Can you explain more on it wrt to the support required and changes in API?

zhan849 · 2019-03-05T01:53:36Z

@johnugeorge I'm thinking about the following:

Add a slice of volume claim templates in ReplicaSpec

VolumeClaimTemplates []corev1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`

In ReplicaSpec.Spec.InitContainers and ReplicaSpec.Spec.Containers, if any of the container requests a volume whose name is NOT in ReplicaSpec.Spec.Volumes but in ReplicaSpec.VolumeClaimTemplates, we create PersistentVolumeClaim objects for that container. Since containers are named with "-i" suffix, we can name the PVCs similarly.
When the managing object (i.e TFJob or PyTorchJob) is deleted, we delete the volumes as well

zhan849 · 2019-03-05T01:54:29Z

I can send out the PR for the community to review if the general direction is agreed.

johnugeorge · 2019-03-05T02:01:45Z

/cc @richardsliu
/cc @gaocegege

gaocegege · 2019-03-28T04:15:40Z

The idea SGTM while I think storage class already supports dynamic volume provision, what things should we do in the use case?

zhan849 · 2019-03-28T04:30:05Z

@gaocegege yes storage class should define how the volume should actually be provisioned. We already started some experiments in a forked branch, do you guys want me to submit a brief proposal or you want to implement it. it'd be something similar to stateful set

gaocegege · 2019-03-30T13:56:11Z

@zhan849 I am glad to see your proposal! Thanks for your contribution.

gaocegege · 2019-04-17T02:55:28Z

@richardsliu @zhan849

Should we place the feature in common-operator? It is general for all PS-worker based training jobs

richardsliu · 2019-04-17T04:26:10Z

@gaocegege Yes I think so. So it should be in the next API version.

zhan849 · 2019-04-19T16:25:46Z

@gaocegege @richardsliu agreed. will close this one and open new one under common-operator.

And also for design doc, given all the discussion threads under tf-operator, I'd suggest that we keep it there and check into community repo after the PR is finalized and merged

zhan849 changed the title ~~[FeatureRequest] Support dynamic volume provisioning for TFJob~~ [FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob Mar 6, 2019

richardsliu added the area/0.6.0 label Mar 27, 2019

richardsliu mentioned this issue Mar 27, 2019

TFJob 1.0 #968

Closed

4 tasks

zhan849 mentioned this issue Apr 8, 2019

Add design doc for kubeflow volume claim template support #970

Closed

zhan849 mentioned this issue Apr 19, 2019

[FeatureRequest] Support dynamic volume provisioning for distributed training job kubeflow/common#19

Open

zhan849 closed this as completed Apr 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

zhan849 commented Mar 5, 2019 •

edited

Loading

johnugeorge commented Mar 5, 2019

zhan849 commented Mar 5, 2019

zhan849 commented Mar 5, 2019

johnugeorge commented Mar 5, 2019

gaocegege commented Mar 28, 2019

zhan849 commented Mar 28, 2019

gaocegege commented Mar 30, 2019

gaocegege commented Apr 17, 2019

richardsliu commented Apr 17, 2019

zhan849 commented Apr 19, 2019

[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

[FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949

Comments

zhan849 commented Mar 5, 2019 • edited Loading

johnugeorge commented Mar 5, 2019

zhan849 commented Mar 5, 2019

zhan849 commented Mar 5, 2019

johnugeorge commented Mar 5, 2019

gaocegege commented Mar 28, 2019

zhan849 commented Mar 28, 2019

gaocegege commented Mar 30, 2019

gaocegege commented Apr 17, 2019

richardsliu commented Apr 17, 2019

zhan849 commented Apr 19, 2019

zhan849 commented Mar 5, 2019 •

edited

Loading