This repository contains the official implementation and data for "VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models". The paper was authored by Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt.
Our work introduces VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks. We provide a comprehensive evaluation of models' ability to understand human instructions and generate useful, fluent, and safe outputs. Our dataset includes verified reference outputs for all test cases, and we incorporate an ELO-based ranking system for multimodal chatbots. More details can be found in our paper (coming soon).
Recent advances in instruction-following vision-language models have led to a surge in large-scale and accessible multimodal chatbots. However, existing works lack a comprehensive evaluation of their capabilities to understand human instructions and provide useful, fluent, and safe outputs. We introduce VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks, from recognition to reasoning. VisIT-Bench offers in-depth understanding of a model's conversational abilities. Our dataset includes verified reference outputs for all test cases, facilitating automatic comparison with expected responses via a strong large language model (GPT-4). We also incorporate an Elo-based ranking system to establish a leaderboard for multimodal chatbots. We source human preference annotations for ranking chatbot responses. Both our Elo-rankings approaches show strong agreement with human evaluations, demonstrating reliability. In our human evaluation, we find that the best-performing instruction-following model wins against the GPT-4 reference in just 27 of the comparisons. VisIT-Bench is dynamic and can integrate and evaluate new models
The dataset consists of 679 instances and 1,578 images, spanning a variety of real-world instruction scenarios. The data was sourced both from newly collected data and existing datasets. It can be accessed at:
The link to our public leaderboard is present here.
- You can access the single-image and multiple-image datasets above.
- For every instance (row) in the dataset csv, you would have your model's predictions.
- Create a
predictions.csv
with 4 mandatory columnsinstruction
,instruction_category
,image
(single-image case) /images
(multi-image case),<model name> prediction
. Here,<model name>
should be your model name with version if multiple-versions are available. - Send a
prediction.csv
to us onyonatanbitton1@gmail.com
. - We will use our internal prompting sandbox with reference-free GPT-4 as an evaluator.
- We will add your model to the leaderboard once we receive all the pairwise judgments from the sandbox.
- You will receive a confirmation email as soon as your model has been added to the leaderboard.
- Estimated time from Step 4-7 would be 1-2 weeks, however, we will try to work on your prediction files as soon as they are sent.
We provide the code for most of the instruction-following vision-language models in our paper. Please refer to the baselines readme for more details. Notably, we provide a single VisITBaseModel
interface for model generations.
The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). For the images that were used, please refer to the public license attached to each individual image in the "public_images_metadata" field in the dataset sheets.