vatt

VATT in Tensorflow 2

This is the official code release for VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, published in NeurIPS 2021.

Installation

Framework

There is a minimum framework requirement for this codebase: - Python 3.8, CUDA 10.1, NVIDIA Driver v 440.100, CuDNN 7.6.5

Libraries

Make sure to install the following libraries by running pip install -r requirements.txt:

tensorflow==2.7.0
tensorflow_addons==0.15.0
tensorflow_probability==0.15.0
tensorflow_text==2.7.0
keras==2.7.0
scikit-image
scikit-learn
scipy
six
numpy
yaml
dmvr
absl

Data

The data pipeline in this code is based on DMVR, which supports TF Example and TF SequenceExample. The data loaders assume that the datasets are stored as TF Records similar to this example.

Make sure to fill in the correct constructor under vatt/data/datasets before launching the main script. There is a toy example under vatt/data/datasets/toy_dataset.py for your reference.

Embeddings and Vocabulary

Depending on the configuration, you might need the pre-trained text embeddings and vocuabulary. Please download this file and extract it under vatt/.

PreTrain

Assuming all datasets are stored and dataloaders are functioning, pre-training can be lauched using the following: python -m vatt.main --task=pretrain --mode=train --model_dir=PATH/TO/RUN --model_arch=tx_fac --strategy_type=mirrored

If --mode=train, the self-supervised training will launch and if --mode=eval the thorough evaluation will be launched.

The evaluation pipeline constantly loops over the model_dir path and looks for new checkpoints. This means that you can launch the evaluation pipeline separately and benefit from a continuous evaluation during the course of pre-training.

Alternatively, you can set --override_checkpoint=PATH/TO/CHECKPOINT to evaluate based on a specific checkpoint.

If you are using TPUs, you can set --strategy_type=tpu --tpu=ADDRESS/OF/TPU.

The options for model_arch are the following:
- tx_fac: Modality-specific VATT
- ut_fac: Modality-agnostic VATT
- mmv_fac: The CNN-based counterpart as in MMV

FineTune

Once you pre-train a model, you can fine-tune the vision or audio Transformers on a classification dataset.

Assuming all datasets are stored and dataloaders are functioning, fine-tuning can be lauched using the following: python -m vatt.main --task=finetune --mode=train --model_dir=PATH/TO/RUN --model_arch=ViT_Medium --strategy_type=mirrored

Similarly, mode can take either of train or eval and a continuous evaluation is possible by running the evaluation pipeline in parallel.

The options for model_arch are the following:
- vit_base: Vision Transformer with the Base configuration
- vit_medium: Vision Transformer with the Medium configuration
- vit_large: Vision Transformer with the Large configuration
- wat_base: Waveform Transformer with the Base configuration
- wat_medium: Waveform Transformer with the Medium configuration
- spt_base: Spectrogram Transformer with the Base configuration
- spt_medium: Spectrogram Transformer with the Medium configuration
- i3d: Video model based on I3D architecture
- resnet2d_50: Audio model based on a ResNet-2D architecture (Spectrogram-only)

In any of the settings, make sure to set the correct configuration for data and optimization under vatt/configs.

Checkpoints

Pre-trained checkpoints

Backbone	Model Size (Video-Audio-Text)	Checkpoint
Modality Specific	Base-Base-Small	data, index
Modality Specific	Medium-Base-Small	data, index
Modality Specific	Large-Base-Small	data, index
Modality Agnositc	Medium (`single-backbone`)	data, index

Fine-tuned checkpoints for Video Action Recognition

Dataset	Model Type	Pre-trained Checkpoint	Top-1	Top-5	Checkpoint
Kinetics-400	ViT Base	Base-Base-Small	79.6	94.9	data, index
Kinetics-400	ViT Medium	Medium-Base-Small	81.1	95.6	data, index
Kinetics-400	ViT Large	Large-Base-Small	82.1	95.5	data, index
Kinetics-400	ViT Medium	Medium (`single-backbone`)	79.9	94.9	data, index
Kinetics-600	ViT Base	Base-Base-Small	80.5	95.5	data, index
Kinetics-600	ViT Medium	Medium-Base-Small	82.4	96.1	data, index
Kinetics-600	ViT Large	Large-Base-Small	83.6	96.6	data, index
Kinetics-600	ViT Medium	Medium (`single-backbone`)	80.8	95.5	data, index
Kinetics-700	ViT Base	Base-Base-Small	-	-	TBD
Kinetics-700	ViT Medium	Medium-Base-Small	-	-	TBD
Kinetics-700	ViT Large	Large-Base-Small	72.7	90.5	data, index
Kinetics-700	ViT Medium	Medium (`single-backbone`)	-	-	TBD
Moments-in-Time	ViT Base	Base-Base-Small	38.7	67.5	data, index
Moments-in-Time	ViT Medium	Medium-Base-Small	39.5	68.2	data, index
Moments-in-Time	ViT Large	Large-Base-Small	41.1	67.7	data, index
Moments-in-Time	ViT Medium	Medium (`single-backbone`)	37.8	65.9	data, index

Fine-tuned checkpoints for Audio Event Classification

Dataset	Model Type	Pre-trained Checkpoint	mAP	AUC	d-prime	Checkpoint
AudioSet	WaT Base	Base-Base-Small	39.4	97.1	2.895	data, index
AudioSet	WaT Medium	Medium (`single-backbone`)	39.3	97.0	2.884	data, index

References

@article{akbari2021vatt,
  title={Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text},
  author={Akbari, Hassan and Yuan, Liangzhe and Qian, Rui and Chuang, Wei-Hong and Chang, Shih-Fu and Cui, Yin and Gong, Boqing},
  journal={arXiv preprint arXiv:2104.11178},
  year={2021}
}

Correspondence and Maintenance

Any feedback is appreciated. If you observed any issues, please contact us.

Corresponding author: https://github.com/hassanhub

Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
data		data
experiments		experiments
modeling		modeling
utils		utils
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vatt

vatt

README.md

VATT in Tensorflow 2

Installation

Framework

Libraries

Data

Embeddings and Vocabulary

PreTrain

FineTune

Checkpoints

Pre-trained checkpoints

Fine-tuned checkpoints for Video Action Recognition

Fine-tuned checkpoints for Audio Event Classification

References

Correspondence and Maintenance

Files

vatt

Directory actions

More options

Directory actions

More options

Latest commit

History

vatt

Folders and files

parent directory

README.md

VATT in Tensorflow 2

Installation

Framework

Libraries

Data

Embeddings and Vocabulary

PreTrain

FineTune

Checkpoints

Pre-trained checkpoints

Fine-tuned checkpoints for Video Action Recognition

Fine-tuned checkpoints for Audio Event Classification

References

Correspondence and Maintenance