Skip to content

[IEEE T-PAMI 2023] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Notifications You must be signed in to change notification settings

YangLiu9208/CMCIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMCIR

Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Preprint 2022
For more details, please refer to our paper Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

ORCID iD iconorcid.org/0000-0002-9423-9252

Homepage: https://yangliu9208.github.io/home/

Abstract

Existing visual question answering methods tend to capture the spurious correlations from visual and linguistic modalities, and fail to discover the true casual mechanism that facilitates reasoning truthfully based on the dominant visual evidence and the correct question intention. Additionally, the existing methods usually ignore the complex event-level understanding in multi-modal settings that requires a strong cognitive capability of causal inference to jointly model cross-modal event temporality, causality, and dynamics. In this work, we focus on event-level visual question answering from a new perspective, i.e., cross-modal causal relational reasoning, by introducing causal intervention methods to mitigate the spurious correlations and discover the true causal structures for the integration of visual and linguistic modalities. Specifically, we propose a novel event-level visual question answering framework named Cross-Modal Causal RelatIonal Reasoning (CMCIR), which consists of three essential components named causality-aware visual-linguistic reasoning module, spatial-temporal transformer, and visual-linguistic feature fusion module, to achieve robust casuality-aware visual-linguistic question answering. To uncover the causal structures for visual and linguistic modalities, the novel Causality-aware Visual-Linguistic Reasoning (CVLR) module is proposed to collaboratively disentangle the visual and linguistic spurious correlations via elaborately designed front-door and back-door causal intervention modules. To discover the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build a novel Spatial-Temporal Transformer (STT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. To adaptively fuse the causality-ware visual and linguistic features, we introduce a Visual-Linguistic Feature Fusion (VLFF) module that leverages the hierarchical linguistic semantic relations as the guidance to learn the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on large-scale event-level urban dataset SUTD-TrafficQA and three benchmark real-world datasets TGIF-QA, MSVD-QA, and MSRVTT-QA demonstrate the effectiveness of our CMCIR for discovering visual-linguistic causal structures and achieving robust event-level visual question answering.

Model

Image
Figure 1: Framework of our proposed CMCIR.

Requirements

Datasets

We conducted our experiment on large-scale event-level urban dataset SUTD-TrafficQA and three benchmark real-world datasets TGIF-QA, MSVD-QA and MSRVTT-QA. The preprocessing steps are the same as the official ones. Please find more details from these datasets.

Setups

  1. Dowanload SUTD-TrafficQA, TGIF-QA, MSVD-QA and MSRVTT-QA datasets.
  2. Edit absolute paths in preprocess/preprocess_features.py and preprocess/preprocess_questions.py upon where you locate your data.
  3. Install dependencies.

Experiments with SUTD-TrafficQA

We refer to SUTD-TrafficQA Official Codes for preprocessing.

Preprocess Linguistic Features

  1. Download glove pretrained 300d word vectors to /data/glove/ and process it into a pickle file.
python txt2pickle.py

  1. Preprocess train/val/test questions:
python 1_preprocess_questions_oie.py --mode train
    
python 1_preprocess_questions_oie.py --mode test

Preprocess Visual Features

  1. To extract appearance feature with Swin or Resnet101 model:
    Download Swin pretrained model (swin_large_patch4_window7_224_22k.pth) and place it to configs/.
python 1_preprocess_features_appearance.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance.py --model resnet101 --question_type none

  1. To extract motion feature with Swin or ResnetXt101 model:

Download Swin3D pretrained model (swin_base_patch244_window877_kinetics600_22k.pth) and place it to configs/.

Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to data/preprocess/pretrained/.

python 1_preprocess_features_motion.py --model Swin --question_type none

or

python 1_preprocess_features_motion.py --model resnext101 --question_type none

Visual K-means Clustering

  1. To extract training appearance feature with Swin or Resnet101 model:
python 1_preprocess_features_appearance_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_train.py --model resnet101 --question_type none

  1. To extract training motion feature with Swin or ResnetXt101 model:
python 1_preprocess_features_motion_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_train.py --model resnext101 --question_type none
  1. K-means Clustering
python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_SUTD.py

Experiments with TGIF-QA

Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.

Preprocess Linguistic Features

  1. Preprocess train/val/test questions:
python 1_preprocess_questions_oie_tgif.py --mode train --question_type {question_type}
    
python 1_preprocess_questions_oie_tgif.py --mode test  --question_type {question_type}

Preprocess Visual Features

  1. To extract appearance feature with Swin or Resnet101 model:
python 1_preprocess_features_appearance_tgif_total.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif_total.py --model resnet101 --question_type {question_type}

  1. To extract motion feature with Swin or ResnetXt101 model:
python 1_preprocess_features_motion_tgif_total.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif_total.py --model resnext101 --question_type {question_type}

Visual K-means Clustering

  1. To extract training appearance feature with Swin or Resnet101 model:
python 1_preprocess_features_appearance_tgif.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif.py --model resnet101 --question_type {question_type}

  1. To extract training motion feature with Swin or ResnetXt101 model:
python 1_preprocess_features_motion_tgif.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif.py --model resnext101 --question_type {question_type}

  1. K-means Clustering
python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_TGIF_Action.py

python train_TGIF_Transition.py

python train_TGIF_Count.py

python train_TGIF_FrameQA.py

Experiments with MSVD-QA/MSRVTT-QA

Preprocess linguistic features

  1. Preprocess train/val/test questions:
python 1_preprocess_questions_oie_msvd.py --mode train
    
python 1_preprocess_questions_oie_msvd.py --mode test

or

python 1_preprocess_questions_oie_msrvtt.py --mode train
    
python 1_preprocess_questions_oie_msrvtt.py --mode test

Preprocess visual features

  1. To extract appearance feature with Swin or Resnet101 model:
python 1_preprocess_features_appearance_msvd.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model resnet101 --question_type none

  1. To extract motion feature with Swin or ResnetXt101 model:
python 1_preprocess_features_motion_msvd.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt.py --model resnext101 --question_type none

Visual K-means Clustering

  1. To extract training appearance feature with Swin or Resnet101 model:
python 1_preprocess_features_appearance_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd_train.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model resnet101 --question_type none

  1. To extract training motion feature with Swin or ResnetXt101 model:
python 1_preprocess_features_motion_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd_train.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model resnext101 --question_type none

  1. K-means Clustering
python k_means.py

Edit absolute paths upon where you locate your data.

Training and Testing

python train_MSVD.py

python train_MSRVTT.py

Citation

If you use this code for your research, please cite our paper.

@article{liu2022cross,
  title={Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering},
  author={Liu, Yang and Li, Guanbin and Lin, Liang},
  journal={arXiv preprint arXiv:2207.12647},
  year={2022}
}

If you have any question about this code, feel free to reach me (liuy856@mail.sysu.edu.cn)

About

[IEEE T-PAMI 2023] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages