Open
Description
π Feature request
This is a discussion issue for training/fine-tuning very large transformer models. Recently, model parallelism was added for gpt2 and t5. The current implementation is for PyTorch only and requires manually modifying the model classes for each model. Possible routes (thanks to @stas00 for identifying these):
fairscale
to avoid individual model implementationdeepspeed
to possibly enable even larger models to be trained