Open
Description
Although you can train an aitextgen model on TPUs by setting n_tpu_cores=8
in an appropriate runtime, and the training loss indeed does decrease, there are a number of miscellaneous blocking problems:
- The
model
stored inaitextgen
does not update, even after training. - Saving the model via
save_pretrained()
causes hang, even withxm.rendezvous()
- Memory leaks on the host system (especially with large batch size)
fp16
doesn't work at all, and there's no training loss decrease.
Will gladly take any suggestions/PRs to help resolve these!
Metadata
Assignees
Labels
No labels