Multimodal Preference Data Synthetic Alignment with Reward Model
🔥 Introducing PDS-DPO: a new pipeline in generating preferenced data synthetic with reward model for effective Multimodal LLMs alignment ✨
Starting with an initial text-to-image prompt, the Stable Diffusion model generates synthetic images. These images are then filtered using a reward model to exclude low-quality samples and retain only those with the highest scores. The selected images, along with their corresponding instruction prompts, serve as input for open-source MLLMs to generate responses. These responses are evaluated based on various criteria, and only the highest-scoring ones are selected to identify the most suitable positive and negative pairs for DPO-based training.
- 2024-12: 📃 Our paper is accesible at arXiv now!
- 2024-12: 🚀 We open-source the code, weights (7B, 7B-LoRA) and dataset of PDS-DPO!
git clone https://github.com/pds-dpo/pds-dpo.git
cd pds-dpo
conda create -n pdsdpo python=3.10 -y
conda activate pdsdpo
pip install --upgrade pip
pip install -e .
You may skip step 1 and step 2 and proceed to step 3 directly as we have provided the resulting dataset in our HuggingFace.
We have provide the sample text-to-image prompts in prompt/sample.txt
. You can run the generation and ranking script directy as follows.
cd image_generation_ranking
python run.py
All images are stored in the images
folder. For each prompt, the script produces four images, which are saved in the sample
folder. The image with the highest ranking score is selected and saved separately in the sample-ranked
folder.
We have provide the sample images and instruction prompts in instruction-prompts/sample.txt
and images-ranked
, respectively. By default, we utilized four different open-source MLLMs, including llava-v1.6-mistral-7b-, llava-v1.6-vicuna-13b, and llava-v1.6-vicuna-7b. You may modify accordingly with your preference MLLMs.
You can generate the response by simply use this command.
cd response_generation_ranking
python run.py
The output is the chosen and rejected conversation save as output.json
with the following format:
[
{
"id": "transport-919",
"image": "images/transport-919.jpg",
"conversations": [
{
"from": "human",
"value": "<image> What challenges does a ferryboat face as it crosses a turbulent sea, with passengers bracing against the spray and wind?"
},
{
"from": "gpt",
"value": "chosen response"
}
],
"rejected_conversations": [
{
"from": "human",
"value": "<image> What challenges does a ferryboat face as it crosses a turbulent sea, with passengers bracing against the spray and wind?"
},
{
"from": "gpt",
"value": "rejected response"
}
]
}
]
- Modify the
dpo_trainer.py
in the trl library
To enable image token processing for DPO training, navigate to the trl library directory in your virtual environment: cd ./envs/pdsdpo/lib/python3.10/site-packages/trl/trainer/
. Replace dpo_trainer.py
with the provided file from the tool
folder.
- Prepare the dataset
Download and extract the entire dataset from HuggingFace, then save it in the data
folder.
- Run DPO training
Double-check the parameters on scripts/run_dpo.sh
and made adjustment if necessary, then simply train the model with this command:
cd scripts
bash run_dpo.sh
We trained the model using 2 x 80GB A100 GPUs.
For comprehensive tutorials on evaluating other benchmarks, please refer to the LLaVA repository documentation.
This project incorporates specific datasets and checkpoints, each governed by their respective original licenses. Users are required to adhere to the terms and conditions outlined in these licenses. The project’s content is independently licensed under the Apache license 2.0.
@article{wijaya2024multimodal,
title={Multimodal Preference Data Synthetic Alignment with Reward Model},
author={Wijaya, Robert and Nguyen, Ngoc-Bao and Cheung, Ngai-Man},
journal={arXiv preprint arXiv:2412.17417},
year={2024}
}
This research benefits from LLaVA-1.5, ImageReward, and RLHF-Reward-Modeling. Thanks for their great work.