Skip to content

TPU Support #3

Open
Open
@minimaxir

Description

Although you can train an aitextgen model on TPUs by setting n_tpu_cores=8 in an appropriate runtime, and the training loss indeed does decrease, there are a number of miscellaneous blocking problems:

  • The model stored in aitextgen does not update, even after training.
  • Saving the model via save_pretrained() causes hang, even with xm.rendezvous()
  • Memory leaks on the host system (especially with large batch size)
  • fp16 doesn't work at all, and there's no training loss decrease.

Will gladly take any suggestions/PRs to help resolve these!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions