This is the official code release for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, published in NeurIPS 2021.
There is a minimum framework requirement for this codebase: - Python 3.8
,
CUDA 10.1
, NVIDIA Driver v 440.100
, CuDNN 7.6.5
Make sure to install the following libraries by running pip install -r requirements.txt
:
tensorflow==2.7.0
tensorflow_addons==0.15.0
tensorflow_probability==0.15.0
tensorflow_text==2.7.0
keras==2.7.0
scikit-image
scikit-learn
scipy
six
numpy
yaml
dmvr
absl
The data pipeline in this code is based on
DMVR
, which supports TF Example and TF
SequenceExample. The data loaders assume that the datasets are stored as TF Records
similar to this
example.
Make sure to fill in the correct constructor under vatt/data/datasets
before
launching the main script. There is a toy example under
vatt/data/datasets/toy_dataset.py
for your reference.
Depending on the configuration, you might need the pre-trained text embeddings and vocuabulary.
Please download this file and extract it under vatt/
.
Assuming all datasets are stored and dataloaders are functioning, pre-training
can be lauched using the following: python -m vatt.main --task=pretrain --mode=train --model_dir=PATH/TO/RUN --model_arch=tx_fac --strategy_type=mirrored
If --mode=train
, the self-supervised training will launch and if --mode=eval
the thorough evaluation will be launched.
The evaluation pipeline constantly loops over the model_dir
path and looks for
new checkpoints. This means that you can launch the evaluation pipeline
separately and benefit from a continuous evaluation during the course of
pre-training.
Alternatively, you can set --override_checkpoint=PATH/TO/CHECKPOINT
to
evaluate based on a specific checkpoint.
If you are using TPUs, you can set --strategy_type=tpu --tpu=ADDRESS/OF/TPU
.
- The options for
model_arch
are the following:tx_fac
: Modality-specific VATTut_fac
: Modality-agnostic VATTmmv_fac
: The CNN-based counterpart as in MMV
Once you pre-train a model, you can fine-tune the vision or audio Transformers on a classification dataset.
Assuming all datasets are stored and dataloaders are functioning, fine-tuning
can be lauched using the following: python -m vatt.main --task=finetune --mode=train --model_dir=PATH/TO/RUN --model_arch=ViT_Medium --strategy_type=mirrored
Similarly, mode
can take either of train
or eval
and a continuous
evaluation is possible by running the evaluation pipeline in parallel.
- The options for
model_arch
are the following:vit_base
: Vision Transformer with theBase
configurationvit_medium
: Vision Transformer with theMedium
configurationvit_large
: Vision Transformer with theLarge
configurationwat_base
: Waveform Transformer with theBase
configurationwat_medium
: Waveform Transformer with theMedium
configurationspt_base
: Spectrogram Transformer with theBase
configurationspt_medium
: Spectrogram Transformer with theMedium
configurationi3d
: Video model based on I3D architectureresnet2d_50
: Audio model based on a ResNet-2D architecture (Spectrogram-only)
In any of the settings, make sure to set the correct configuration for data and
optimization under vatt/configs
.
Backbone | Model Size (Video-Audio-Text) | Checkpoint |
---|---|---|
Modality Specific | Base-Base-Small | data, index |
Modality Specific | Medium-Base-Small | data, index |
Modality Specific | Large-Base-Small | data, index |
Modality Agnositc | Medium (single-backbone ) |
data, index |
Dataset | Model Type | Pre-trained Checkpoint | Top-1 | Top-5 | Checkpoint |
---|---|---|---|---|---|
Kinetics-400 | ViT Base | Base-Base-Small | 79.6 | 94.9 | data, index |
Kinetics-400 | ViT Medium | Medium-Base-Small | 81.1 | 95.6 | data, index |
Kinetics-400 | ViT Large | Large-Base-Small | 82.1 | 95.5 | data, index |
Kinetics-400 | ViT Medium | Medium (single-backbone ) |
79.9 | 94.9 | data, index |
Kinetics-600 | ViT Base | Base-Base-Small | 80.5 | 95.5 | data, index |
Kinetics-600 | ViT Medium | Medium-Base-Small | 82.4 | 96.1 | data, index |
Kinetics-600 | ViT Large | Large-Base-Small | 83.6 | 96.6 | data, index |
Kinetics-600 | ViT Medium | Medium (single-backbone ) |
80.8 | 95.5 | data, index |
Kinetics-700 | ViT Base | Base-Base-Small | - | - | TBD |
Kinetics-700 | ViT Medium | Medium-Base-Small | - | - | TBD |
Kinetics-700 | ViT Large | Large-Base-Small | 72.7 | 90.5 | data, index |
Kinetics-700 | ViT Medium | Medium (single-backbone ) |
- | - | TBD |
Moments-in-Time | ViT Base | Base-Base-Small | 38.7 | 67.5 | data, index |
Moments-in-Time | ViT Medium | Medium-Base-Small | 39.5 | 68.2 | data, index |
Moments-in-Time | ViT Large | Large-Base-Small | 41.1 | 67.7 | data, index |
Moments-in-Time | ViT Medium | Medium (single-backbone ) |
37.8 | 65.9 | data, index |
Dataset | Model Type | Pre-trained Checkpoint | mAP | AUC | d-prime | Checkpoint |
---|---|---|---|---|---|---|
AudioSet | WaT Base | Base-Base-Small | 39.4 | 97.1 | 2.895 | data, index |
AudioSet | WaT Medium | Medium (single-backbone ) |
39.3 | 97.0 | 2.884 | data, index |
@article{akbari2021vatt,
title={Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text},
author={Akbari, Hassan and Yuan, Liangzhe and Qian, Rui and Chuang, Wei-Hong and Chang, Shih-Fu and Cui, Yin and Gong, Boqing},
journal={arXiv preprint arXiv:2104.11178},
year={2021}
}
Any feedback is appreciated. If you observed any issues, please contact us.
Corresponding author: https://github.com/hassanhub