CMCIR

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Preprint 2022
For more details, please refer to our paper Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

ORCID iD icon orcid.org/0000-0002-9423-9252

Homepage: https://yangliu9208.github.io/home/

Abstract

Existing visual question answering methods tend to capture the spurious correlations from visual and linguistic modalities, and fail to discover the true casual mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the correct question intention. Additionally, the existing methods usually ignore the complex event-level understanding in multi-modal settings that requires a strong cognitive capability of causal inference to jointly model cross-modal event temporality, causality, and dynamics. In this work, we focus on event-level visual question answering from a new perspective, i.e., cross-modal causal relational reasoning, by introducing causal intervention methods to mitigate the spurious correlations and discover the true causal structures for the integration of visual and linguistic modalities. Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), which consists of three essential components named causality-aware visual-linguistic reasoning module, spatial-temporal transformer, and visual-linguistic feature fusion module, to achieve robust casuality-aware visual-linguistic question answering. To uncover the causal structures for visual and linguistic modalities, the novel Causality-aware Visual-Linguistic Reasoning (CVLR) module is proposed to collaboratively disentangle the visual and linguistic spurious correlations via elaborately designed front-door and back-door causal intervention modules. To discover the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a novel Spatial-Temporal Transformer (STT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. To adaptively fuse the causality-ware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations as the guidance to learn the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on large-scale event-level urban dataset SUTD-TrafficQA and three benchmark real-world datasets TGIF-QA, MSVD-QA, and MSRVTT-QA demonstrate the effectiveness of our CMCIR for discovering visual-linguistic causal structures and achieving robust event-level visual question answering.

Model

Figure 1: Framework of our proposed CMCIR.

Requirements

python3.7
numpy
pytorch
pytorch-geometric

Datasets

We conducted our experiment on large-scale event-level urban dataset SUTD-TrafficQA and three benchmark real-world datasets TGIF-QA, MSVD-QA and MSRVTT-QA. The preprocessing steps are the same as the official ones. Please find more details from these datasets.

Setups

Dowanload SUTD-TrafficQA, TGIF-QA, MSVD-QA and MSRVTT-QA datasets.
Edit absolute paths in preprocess/preprocess_features.py and preprocess/preprocess_questions.py upon where you locate your data.
Install dependencies.

Experiments with SUTD-TrafficQA

We refer to SUTD-TrafficQA Official Codes for preprocessing.

Preprocess Linguistic Features

Download glove pretrained 300d word vectors to /data/glove/ and process it into a pickle file.

python txt2pickle.py

Preprocess train/val/test questions:

python 1_preprocess_questions_oie.py --mode train
    
python 1_preprocess_questions_oie.py --mode test

Preprocess Visual Features

To extract appearance feature with Swin or Resnet101 model:
Download Swin pretrained model (swin_large_patch4_window7_224_22k.pth) and place it to configs/.

python 1_preprocess_features_appearance.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance.py --model resnet101 --question_type none

To extract motion feature with Swin or ResnetXt101 model:

Download Swin3D pretrained model (swin_base_patch244_window877_kinetics600_22k.pth) and place it to configs/.

Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to data/preprocess/pretrained/.

python 1_preprocess_features_motion.py --model Swin --question_type none

or

python 1_preprocess_features_motion.py --model resnext101 --question_type none

Visual K-means Clustering

To extract training appearance feature with Swin or Resnet101 model:

python 1_preprocess_features_appearance_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_train.py --model resnet101 --question_type none

To extract training motion feature with Swin or ResnetXt101 model:

python 1_preprocess_features_motion_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_train.py --model resnext101 --question_type none

K-means Clustering

python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_SUTD.py

Experiments with TGIF-QA

Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.

Preprocess Linguistic Features

Preprocess train/val/test questions:

python 1_preprocess_questions_oie_tgif.py --mode train --question_type {question_type}
    
python 1_preprocess_questions_oie_tgif.py --mode test  --question_type {question_type}

Preprocess Visual Features

To extract appearance feature with Swin or Resnet101 model:

python 1_preprocess_features_appearance_tgif_total.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif_total.py --model resnet101 --question_type {question_type}

To extract motion feature with Swin or ResnetXt101 model:

python 1_preprocess_features_motion_tgif_total.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif_total.py --model resnext101 --question_type {question_type}

Visual K-means Clustering

To extract training appearance feature with Swin or Resnet101 model:

python 1_preprocess_features_appearance_tgif.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif.py --model resnet101 --question_type {question_type}

To extract training motion feature with Swin or ResnetXt101 model:

python 1_preprocess_features_motion_tgif.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif.py --model resnext101 --question_type {question_type}

K-means Clustering

python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_TGIF_Action.py

python train_TGIF_Transition.py

python train_TGIF_Count.py

python train_TGIF_FrameQA.py

Experiments with MSVD-QA/MSRVTT-QA

Preprocess linguistic features

Preprocess train/val/test questions:

python 1_preprocess_questions_oie_msvd.py --mode train
    
python 1_preprocess_questions_oie_msvd.py --mode test

or

python 1_preprocess_questions_oie_msrvtt.py --mode train
    
python 1_preprocess_questions_oie_msrvtt.py --mode test

Preprocess visual features

To extract appearance feature with Swin or Resnet101 model:

python 1_preprocess_features_appearance_msvd.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model resnet101 --question_type none

To extract motion feature with Swin or ResnetXt101 model:

python 1_preprocess_features_motion_msvd.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt.py --model resnext101 --question_type none

Visual K-means Clustering

To extract training appearance feature with Swin or Resnet101 model:

python 1_preprocess_features_appearance_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd_train.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model resnet101 --question_type none

To extract training motion feature with Swin or ResnetXt101 model:

python 1_preprocess_features_motion_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd_train.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model resnext101 --question_type none

K-means Clustering

python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_MSVD.py

python train_MSRVTT.py

Citation

If you use this code for your research, please cite our paper.

@article{liu2022cross,
  title={Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering},
  author={Liu, Yang and Li, Guanbin and Lin, Liang},
  journal={arXiv preprint arXiv:2207.12647},
  year={2022}
}

If you have any question about this code, feel free to reach me (liuy856@mail.sysu.edu.cn)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
configs		configs
data/glove		data/glove
model		model
preprocess		preprocess
1_preprocess_features_appearance.py		1_preprocess_features_appearance.py
1_preprocess_features_appearance_msrvtt.py		1_preprocess_features_appearance_msrvtt.py
1_preprocess_features_appearance_msrvtt_train.py		1_preprocess_features_appearance_msrvtt_train.py
1_preprocess_features_appearance_msvd.py		1_preprocess_features_appearance_msvd.py
1_preprocess_features_appearance_msvd_train.py		1_preprocess_features_appearance_msvd_train.py
1_preprocess_features_appearance_sutd_swin.py		1_preprocess_features_appearance_sutd_swin.py
1_preprocess_features_appearance_sutd_swin_train.py		1_preprocess_features_appearance_sutd_swin_train.py
1_preprocess_features_appearance_svqa.py		1_preprocess_features_appearance_svqa.py
1_preprocess_features_appearance_tgif.py		1_preprocess_features_appearance_tgif.py
1_preprocess_features_appearance_tgif2.py		1_preprocess_features_appearance_tgif2.py
1_preprocess_features_appearance_tgif2_total.py		1_preprocess_features_appearance_tgif2_total.py
1_preprocess_features_appearance_tgif3.py		1_preprocess_features_appearance_tgif3.py
1_preprocess_features_appearance_tgif3_total.py		1_preprocess_features_appearance_tgif3_total.py
1_preprocess_features_appearance_tgif4.py		1_preprocess_features_appearance_tgif4.py
1_preprocess_features_appearance_tgif4_total.py		1_preprocess_features_appearance_tgif4_total.py
1_preprocess_features_appearance_tgif_total.py		1_preprocess_features_appearance_tgif_total.py
1_preprocess_features_appearance_train.py		1_preprocess_features_appearance_train.py
1_preprocess_features_appearance_transformer.py		1_preprocess_features_appearance_transformer.py
1_preprocess_features_motion.py		1_preprocess_features_motion.py
1_preprocess_features_motion_msrvtt.py		1_preprocess_features_motion_msrvtt.py
1_preprocess_features_motion_msrvtt_train.py		1_preprocess_features_motion_msrvtt_train.py
1_preprocess_features_motion_msvd.py		1_preprocess_features_motion_msvd.py
1_preprocess_features_motion_msvd_train.py		1_preprocess_features_motion_msvd_train.py
1_preprocess_features_motion_sutd_swin.py		1_preprocess_features_motion_sutd_swin.py
1_preprocess_features_motion_sutd_swin_train.py		1_preprocess_features_motion_sutd_swin_train.py
1_preprocess_features_motion_tgif.py		1_preprocess_features_motion_tgif.py
1_preprocess_features_motion_tgif2.py		1_preprocess_features_motion_tgif2.py
1_preprocess_features_motion_tgif2_total.py		1_preprocess_features_motion_tgif2_total.py
1_preprocess_features_motion_tgif3.py		1_preprocess_features_motion_tgif3.py
1_preprocess_features_motion_tgif3_total.py		1_preprocess_features_motion_tgif3_total.py
1_preprocess_features_motion_tgif4.py		1_preprocess_features_motion_tgif4.py
1_preprocess_features_motion_tgif4_total.py		1_preprocess_features_motion_tgif4_total.py
1_preprocess_features_motion_tgif_total.py		1_preprocess_features_motion_tgif_total.py
1_preprocess_features_motion_train.py		1_preprocess_features_motion_train.py
1_preprocess_features_motion_transformer.py		1_preprocess_features_motion_transformer.py
1_preprocess_questions_oie.py		1_preprocess_questions_oie.py
1_preprocess_questions_oie2.py		1_preprocess_questions_oie2.py
1_preprocess_questions_oie_msrvtt.py		1_preprocess_questions_oie_msrvtt.py
1_preprocess_questions_oie_msvd.py		1_preprocess_questions_oie_msvd.py
1_preprocess_questions_oie_tgif.py		1_preprocess_questions_oie_tgif.py
1_preprocess_questions_oie_tgif2.py		1_preprocess_questions_oie_tgif2.py
1_preprocess_questions_oie_tgif3.py		1_preprocess_questions_oie_tgif3.py
1_preprocess_questions_oie_tgif4.py		1_preprocess_questions_oie_tgif4.py
DataLoader.py		DataLoader.py
Fig1.png		Fig1.png
README.md		README.md
config.py		config.py
config_transformer.py		config_transformer.py
demo_OIE.py		demo_OIE.py
k_means.py		k_means.py
k_means2.py		k_means2.py
requirements.txt		requirements.txt
train_MSRVTT.py		train_MSRVTT.py
train_MSRVTT_resnet.py		train_MSRVTT_resnet.py
train_MSVD.py		train_MSVD.py
train_MSVD_resnet.py		train_MSVD_resnet.py
train_SUTD.py		train_SUTD.py
train_TGIF_Action.py		train_TGIF_Action.py
train_TGIF_Count.py		train_TGIF_Count.py
train_TGIF_FrameQA.py		train_TGIF_FrameQA.py
train_TGIF_Transition.py		train_TGIF_Transition.py
try_similar.py		try_similar.py
utils.py		utils.py
utils_bert.py		utils_bert.py
validate.py		validate.py
validate_dict.py		validate_dict.py
validate_msvd.py		validate_msvd.py
validate_sutd_finegrained.py		validate_sutd_finegrained.py
validate_sutd_mc.py		validate_sutd_mc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMCIR

Abstract

Model

Requirements

Datasets

Setups

Experiments with SUTD-TrafficQA

Preprocess Linguistic Features

Preprocess Visual Features

Visual K-means Clustering

Training and Testing

Experiments with TGIF-QA

Preprocess Linguistic Features

Preprocess Visual Features

Visual K-means Clustering

Training and Testing

Experiments with MSVD-QA/MSRVTT-QA

Preprocess linguistic features

Preprocess visual features

Visual K-means Clustering

Training and Testing

Citation

About

Releases

Packages

Languages

YangLiu9208/CMCIR

Folders and files

Latest commit

History

Repository files navigation

CMCIR

Abstract

Model

Requirements

Datasets

Setups

Experiments with SUTD-TrafficQA

Preprocess Linguistic Features

Preprocess Visual Features

Visual K-means Clustering

Training and Testing

Experiments with TGIF-QA

Preprocess Linguistic Features

Preprocess Visual Features

Visual K-means Clustering

Training and Testing

Experiments with MSVD-QA/MSRVTT-QA

Preprocess linguistic features

Preprocess visual features

Visual K-means Clustering

Training and Testing

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages