λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

λDNN is a cost-efficient function resource provisioning framework to minimize the monetary cost and guarantee the performance for DDNN training workloads in serverless platforms.

Overview of λDNN

λDNN framework running on AWS Lambda and comprises two pieces of modules: a training performance predictor and a function resource provisioner. To guarantee the objective DDNN training time, the resource provisioner further identifies the cost-efficient serverless function resource provisioning plan. Once the cost-efficient resource provisioning plan is determined, the function allocator finally sets up a number of functions with an appropriate amount of memory.

Modeling DDNN Training Performance In Serverless Platforms

In general, the DNN model requires a number of iterations (denoted by k) to converge to an objective training loss value. Accordingly, the DDNN training time T can be calculated by summing up the loading time, and the computation time, as well as the communication time, which is given by

The loading time is calculated as

Given n provisioned functions, the computation time tcomp of model gradients is defined as

The data communication time is calculated as

The objective is to minimize the monetary cost of provisioned function resources, while guaranteeing the performance of DDNN training workloads. The optimization problem is formally defined as

Publication

Fei Xu, Yiling Qin, Li Chen, Zhi Zhou, Fangming Liu, “λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures,” IEEE Transactions on Computers, 2022, 71(2): 450-463. DOI:10.1109/TC.2021.3054656.

@article{xu2021lambdadnn,
  title={$\lambda$dnn: Achieving predictable distributed DNN training with serverless architectures},
  author={Xu, Fei and Qin, Yiling and Chen, Li and Zhou, Zhi and Liu, Fangming},
  journal={IEEE Transactions on Computers},
  volume={71},
  number={2},
  pages={450--463},
  year={2021},
  publisher={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
images		images
lambda_dnn		lambda_dnn
README.md		README.md
lambdadnn-report.pdf		lambdadnn-report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

Overview of λDNN

Modeling DDNN Training Performance In Serverless Platforms

Publication

About

Releases

Packages

Contributors 2

Languages

icloud-ecnu/lambdadnn

Folders and files

Latest commit

History

Repository files navigation

λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

Overview of λDNN

Modeling DDNN Training Performance In Serverless Platforms

Publication

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages