Update README.md

YangLiu9208 · Apr 25, 2023 · b0ce1b1 · b0ce1b1
1 parent cbbf010
commit b0ce1b1
Showing 1 changed file with 11 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -3,9 +3,6 @@ Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answerin
 IEEE Transactions on Pattern Analysis and Machine Intelligence 2023         
 For more details, please refer to our paper [Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering](https://arxiv.org/abs/2207.12647)     
 
-<a href="https://orcid.org/0000-0002-9423-9252" target="orcid.widget" rel="noopener noreferrer" style="vertical-align:top;"><img src="https://orcid.org/sites/default/files/images/orcid_16x16.png" style="width:1em;margin-right:.5em;" alt="ORCID iD icon">orcid.org/0000-0002-9423-9252</a>
-
-Homepage: [https://yangliu9208.github.io/home/](https://yangliu9208.github.io/home/)
 
 ### Abstract
 Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering. 
@@ -14,6 +11,16 @@ Existing visual question answering methods often suffer from cross-modal spuriou
 ![Image](Fig1.png)        
 Figure 1: Framework of our proposed CMCIR.        
 
+### Experimental Results
+![Image](SUTD.png)        
+Figure 2: Results on SUTD-TrafficQA dataset.  
+![Image](TGIF.png)        
+Figure 3: Results on TGIF-QA dataset.  
+![Image](MSVD.png)        
+Figure 4: Results on MSVD-QA dataset.  
+![Image](MSRVTT.png)        
+Figure 5: Results on MSRVTT-QA dataset.  
+
 ### Requirements
 - python3.7
 - numpy
@@ -271,4 +278,4 @@ If you use this code for your research, please cite our paper.
   year={2022}
 }
 ``` 
-If you have any question about this code, feel free to reach me (liuy856@mail.sysu.edu.cn)      
+If you have any question about this code, feel free to reach (liuy856@mail.sysu.edu.cn).