TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale. It provides:
- A repository of modular and composable building blocks (models, fusion layers, loss functions, datasets and utilities).
- A repository of examples that show how to combine these building blocks with components and common infrastructure from across the PyTorch Ecosystem to replicate state-of-the-art models published in the literature. These examples should serve as baselines for ongoing research in the field, as well as a starting point for future work.
As a first open source example, researchers will be able to train and extend FLAVA using TorchMultimodal.
TorchMultimodal requires Python >= 3.8. The library can be installed with or without CUDA support.
- Create conda environment
conda create -n torch-multimodal python=<python_version> conda activate torch-multimodal
- Install pytorch, torchvision, and torchtext. See PyTorch documentation.
For now we only support Linux platform.
conda install pytorch torchvision torchtext cudatoolkit=11.3 -c pytorch-nightly -c nvidia # For CPU-only install conda install pytorch torchvision torchtext cpuonly -c pytorch-nightly
- Download and install TorchMultimodal and remaining requirements.
For developers please follow the development installation.
git clone --recursive https://github.com/facebookresearch/multimodal.git torchmultimodal cd torchmultimodal pip install -r requirements.txt python setup.py install
The library builds on the following concepts:
-
Architectures: These are general and composable classes that capture the core logic associated with a family of models. In most cases these take modules as inputs instead of flat arguments (see Models below). Examples include the
LateFusionArchitecture
,FLAVA
andCLIPArchitecture
. Users should either reuse an existing architecture or a contribute a new one. We avoid inheritance as much as possible. -
Models: These are specific instantiations of a given architecture implemented using builder functions. The builder functions take as input all of the parameters for constructing the modules needed to instantiate the architecture. See cnn_lstm.py for an example.
-
Modules: These are self-contained components that can be stitched up in various ways to build an architecture. See lstm_encoder.py as an example.
See the CONTRIBUTING file for how to help out.
TorchMultimodal is BSD licensed, as found in the LICENSE file.