Video Object Grounding using Semantic Roles in Language Description
Arka Sadhu, Kan Chen Ram Nevatia
CVPR 2020
Video Object Grounding (VOG) is the task of localizing objects in a video referred in a query sentence description. We elevate the role of object relations via spatial and temporal concatenation of contrastive examples sampled from a newly contributed dataset called ActivityNet-SRL (ASRL).
This repository includes:
- code to create the ActivityNet-SRL dataset under
dcode/
- code to run all the experiments provided in the paper under
code/
- To foster reproducibility of our results, links to all trained models in the paper along with their log files are provided in EXPTS.md
Code has been modularized from its initial implementation. It should be easy to extend the code for other datasets by inheriting relevant modules.
Requirements:
- python>=3.6
- pytorch==1.1 (should work with pytorch >=1.3 as well but not tested)
To use the same environment you can use conda
and the environment file conda_env_vog.yml
file provided. Please refer to Miniconda for details on installing conda
.
```
MINICONDA_ROOT=[to your Miniconda/Anaconda root directory]
conda env create -f conda_env_vog.yml --prefix $MINICONDA_ROOT/envs/vog_pyt
conda activate vog_pyt
```
- Clone repo:
git clone https://github.com/TheShadow29/vognet-pytorch.git cd vognet-pytorch export ROOT=$(pwd)
- Download Data (~530gb) (See DATA_README for more details)
cd $ROOT/data bash download_data.sh all [data_folder]
- Train Models
cd $ROOT python code/main_dist.py "spat_vog_gt5" --ds.exp_setting='gt5' --mdl.name='vog' --mdl.obj_tx.use_rel=True --mdl.mul_tx.use_rel=True --train.prob_thresh=0.2 --train.bs=4 --train.epochs=10 --train.lr=1e-4
If you just want to use ASRL, you can refer to DATA_README. It contains direct links to download ASRL
If instead, you want to recreate ASRL from ActivityNet Entities and ActivityNet Captions, or perhaps want to extend to a newer dataset, refer to DATA_PREP_README.md
Basic usage is python code/main_dist.py "experiment_name" --arg1=val1 --arg2=val2
and the arg1, arg2 can be found in configs/anet_srl_cfg.yml
.
The hierarchical structure of yml
is also supported using .
For example, if you want to change the mdl name
which looks like
mdl:
name: xyz
you can pass --mdl.name='abc'
As an example, training VOGNet
using spat
strategy with gt5
setting:
python code/main_dist.py "spat_vog_gt5" --ds.exp_setting='gt5' --mdl.name='vog' --mdl.obj_tx.use_rel=True --mdl.mul_tx.use_rel=True --train.prob_thresh=0.2 --train.bs=4 --train.epochs=10 --train.lr=1e-4
You can change default settings in configs/anet_srl_cfg.yml
directly as well.
See EXPTS.md for command-line instructions for all experiments.
Logs are stored inside tmp/
directory. When you run the code with $exp_name the following are stored:
txt_logs/$exp_name.txt
: the config used and the training, validation losses after ever epoch.models/$exp_name.pth
: the model, optimizer, scheduler, accuracy, number of epochs and iterations completed are stored. Only the best model upto the current epoch is stored.ext_logs/$exp_name.txt
: this uses thelogging
module of python to store thelogger.debug
outputs printed. Mainly used for debugging.predictions
: the validation outputs of current best model.
To evaluate a model, you need to first load it and then pass --only_val=True
As an example, to validate the VOGNet
model trained in spat
with gt5
setting:
python code/main_dist.py "spat_vog_gt5_valid" --train.resume=True --train.resume_path='./tmp/models/spat_vog_gt5.pth' --mdl.name='vog' --mdl.obj_tx.use_rel=True --mdl.mul_tx.use_rel=True --only_val=True --train.prob_thresh=0.2
This will create ./tmp/predictions/spat_vog_gt5_valid/valid_0.pkl
and print out the metrics.
You can also evaluate this file using code/eval_fn_corr.py
. This assumes valid_0.pkl
file is already generated.
python code/eval_fn_corr.py --pred_file='./tmp/predictions/spat_vog_gt5_valid/valid_0.pkl' --split_type='valid' --train.prob_thresh=0.2
For evaluating test
simply use --split_type='test'
If you are using your own code, but just want to use evaluation, you must save your output in the following format:
[
{
'idx_sent': id of the input query
'pred_boxes': # num_srls x num_vids x num_frames x 5d prop boxes
'pred_scores': # num_srls x num_vids x num_frames (between 0-1)
'pred_cmp': # num_srls x num_frames (only required for sep). Basically, which video to choose
'cmp_msk': 1/0s if any videos were padded and hence not considered
'targ_cmp': which is the target video. This is in prediction and not ground-truth since we shuffle the video list at runtime
},
...
]
Google Drive Link for all models: https://drive.google.com/open?id=1e3FiX4FTC8n6UrzY9fTYQzFNKWHihzoQ
Also, see individual models (with corresponding logs) at EXPTS.md
We thank:
- @LuoweiZhou: for his codebase on GVD (https://github.com/facebookresearch/grounded-video-description) along with the extracted features.
- allennlp for providing demo and pre-trained model for SRL.
- fairseq for providing a neat implementation of LSTM.
@InProceedings{Sadhu_2020_CVPR,
author = {Sadhu, Arka and Chen, Kan and Nevatia, Ram},
title = {Video Object Grounding using Semantic Roles in Language Description},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}