Model Parallelism and Big Models

# 🚀 Feature request

This is a discussion issue for training/fine-tuning very large transformer models. Recently, model parallelism was added for gpt2 and t5. The current implementation is for PyTorch only and requires manually modifying the model classes for each model. Possible routes (thanks to @stas00  for identifying these):
- `fairscale` to avoid individual model implementation
- `deepspeed` to possibly enable even larger models to be trained

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Parallelism and Big Models #8771

🚀 Feature request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development