Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
YangLiu9208 authored Apr 25, 2023
1 parent cbbf010 commit b0ce1b1
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@ Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answerin
IEEE Transactions on Pattern Analysis and Machine Intelligence 2023
For more details, please refer to our paper [Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering](https://arxiv.org/abs/2207.12647)

<a href="https://orcid.org/0000-0002-9423-9252" target="orcid.widget" rel="noopener noreferrer" style="vertical-align:top;"><img src="https://orcid.org/sites/default/files/images/orcid_16x16.png" style="width:1em;margin-right:.5em;" alt="ORCID iD icon">orcid.org/0000-0002-9423-9252</a>

Homepage: [https://yangliu9208.github.io/home/](https://yangliu9208.github.io/home/)

### Abstract
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering.
Expand All @@ -14,6 +11,16 @@ Existing visual question answering methods often suffer from cross-modal spuriou
![Image](Fig1.png)
Figure 1: Framework of our proposed CMCIR.

### Experimental Results
![Image](SUTD.png)
Figure 2: Results on SUTD-TrafficQA dataset.
![Image](TGIF.png)
Figure 3: Results on TGIF-QA dataset.
![Image](MSVD.png)
Figure 4: Results on MSVD-QA dataset.
![Image](MSRVTT.png)
Figure 5: Results on MSRVTT-QA dataset.

### Requirements
- python3.7
- numpy
Expand Down Expand Up @@ -271,4 +278,4 @@ If you use this code for your research, please cite our paper.
year={2022}
}
```
If you have any question about this code, feel free to reach me (liuy856@mail.sysu.edu.cn)
If you have any question about this code, feel free to reach (liuy856@mail.sysu.edu.cn).

0 comments on commit b0ce1b1

Please sign in to comment.