This is the official repo for the paper: "MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training".
- [2024-09-20]: To better reflect the generality of our proposed method, we rename it to RagVL.
- [2024-08-05]: Codes of RagVL (RagLLaVA) released.
- [2024-07-31]: Paper of RagVL (RagLLaVA) online.
The required libraries for running RagVL can be found in requirements.txt
. We recommend following LLaVA to configure your environment.
Before running RagVL, please:
-
Download from Google Drive for datasets and checkpoints.
-
Download from WebQA and MultimodalQA for image files.
-
Unzip the file. Place the
checkpoints/
anddatasets/
intoRagVL/
. -
Place the
tasks/
intoRagVL/finetune/
. -
Place the
MMQA_imgs/
andtrain_img/
intoRagVL/finetune/tasks/
. -
Place the
val_image/
intoRagVL/datasets/
.
- Reranker
Models | Global Batch Size | Epochs |
---|---|---|
LLaVA-v1.5-13B | 16 | 2 (WebQA) / 1 (others) |
Qwen-VL-Chat | 16 | 2 (WebQA) / 1 (others) |
mPLUG-Owl2 | 16 | 2 (WebQA) / 1 (others) |
InternVL2-1B | 16 | 1 |
InternVL2-2B | 16 | 1 |
- Generator
Models | Global Batch Size | Epochs |
---|---|---|
LLaVA-v1.5-13B | 16 | 2 (WebQA) / 3 (MMQA) |
InternVL2-1B | 16 | 1 |
InternVL2-2B | 16 | 1 |
Except for the above two hyperparameters, the others follow the default settings from different models.
To finetune LLaVA-v1.5-13B, Qwen-VL-Chat, and mPLUG-Owl2, find the corresponding finetune script in RagVL/finetune/scripts/
.
To finetune InternVL2-1B and InternVL2-2B, find the corresponding finetune script in RagVL/internvl_chat/shell/internvl2.0/2nd_finetune
.
To evaluate RagVL on WebQA / MultimodalQA, you can employ the following command:
python webqa_pipeline.py \ # same arguments on mmqa_pipeline.py
--reranker_model caption_lora \ # select the reranker
--generator_model noise_injected_lora \ # select the generator
--filter 0 \ # select the adaptive threshold
--clip_topk 20 \ # we first retrieve 20 candidates by default
To evaluate the oracle settings on WebQA / MultimodalQA, you can employ the following command:
python webqa_oracle.py \ # same arguments on mmqa_oracle.py
If you are interested or inspired by this work, you can cite us by:
@article{chen2024mllm,
title={MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training},
author={Chen, Zhanpeng and Xu, Chengjin and Qi, Yiyan and Guo, Jian},
journal={arXiv preprint arXiv:2407.21439},
year={2024}
}
- LLaVA: Large Language and Vision Assistant
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- mPLUG-Owl: The Powerful Multi-modal Large Language Model Family
- InternVL: A Pioneering Open-Source Alternative to GPT-4o
- Visualized BGE: A universal multi-modal embedding model
- VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
- CAL: Prioritizing Visual Correlation by Contrastive Alignment