λDNN is a cost-efficient function resource provisioning framework to minimize the monetary cost and guarantee the performance for DDNN training workloads in serverless platforms.
λDNN framework running on AWS Lambda and comprises two pieces of modules: a training performance predictor and a function resource provisioner. To guarantee the objective DDNN training time, the resource provisioner further identifies the cost-efficient serverless function resource provisioning plan. Once the cost-efficient resource provisioning plan is determined, the function allocator finally sets up a number of functions with an appropriate amount of memory.
In general, the DNN model requires a number of iterations (denoted by k) to converge to an objective training loss value. Accordingly, the DDNN training time T can be calculated by summing up the loading time, and the computation time, as well as the communication time, which is given by
The loading time is calculated as Given n provisioned functions, the computation time tcomp of model gradients is defined as The data communication time is calculated as The objective is to minimize the monetary cost of provisioned function resources, while guaranteeing the performance of DDNN training workloads. The optimization problem is formally defined asFei Xu, Yiling Qin, Li Chen, Zhi Zhou, Fangming Liu, “λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures,” IEEE Transactions on Computers, 2022, 71(2): 450-463. DOI:10.1109/TC.2021.3054656.
@article{xu2021lambdadnn,
title={$\lambda$dnn: Achieving predictable distributed DNN training with serverless architectures},
author={Xu, Fei and Qin, Yiling and Chen, Li and Zhou, Zhi and Liu, Fangming},
journal={IEEE Transactions on Computers},
volume={71},
number={2},
pages={450--463},
year={2021},
publisher={IEEE}
}