Fail to offload FSDP model weights and optimizer states without using CPUOffload(offload_params=True) #130530
Labels
needs reproduction
Someone else needs to try reproducing the issue given the instructions. No action needed from user
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
🚀 The feature, motivation and pitch
Hi Pytorch maintainers,
I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShardDataParallel (FSDP) for each model. A significant challenge we face is managing the storage demands for multiple LLMs, including their optimizer states, gradients, and activations.
We notice that FSDP supports offloading the model parameters and optimizer states during training
cpu_offload=CPUOffload(offload_params=True)
. However, this feature will offload the model parameters and optimizer states in CPU and perform optimizer step in CPU, which will affect the training throughput.In our scenario, the models are computed one by one so we manage to offload the other models when one LLM is performing computation on GPU.
However, we fail to offload the _fsdp_wrapped_module into the CPU with the following code. (We offload the FSDP model by offloading the parameters in
named_paramteres()
). We found identical GPU memory usage before/after the offload operation, indicating no effective offloading.It appears that there might be persistent references to these parameters, causing p.data.to('cpu') to merely copy the data in CPU and preventing PyTorch's garbage collector from freeing the original GPU storage.
Could you provide guidance on how to properly offload these FSDP-wrapped parameters to the CPU to alleviate GPU memory constraints effectively? Any insights or updates to facilitate this process would greatly enhance our training capabilities and efficiency.
Thank you for your attention to this feature/issue!
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @anijain2305 @chauhang @penguinwu
The text was updated successfully, but these errors were encountered: