VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models

This repository contains the official implementation and data for "VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models". The paper was authored by Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt.

TLDR

Our work introduces VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks. We provide a comprehensive evaluation of models' ability to understand human instructions and generate useful, fluent, and safe outputs. Our dataset includes verified reference outputs for all test cases, and we incorporate an ELO-based ranking system for multimodal chatbots. More details can be found in our paper (coming soon).

Abstract

Recent advances in instruction-following vision-language models have led to a surge in large-scale and accessible multimodal chatbots. However, existing works lack a comprehensive evaluation of their capabilities to understand human instructions and provide useful, fluent, and safe outputs. We introduce VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks, from recognition to reasoning. VisIT-Bench offers in-depth understanding of a model's conversational abilities. Our dataset includes verified reference outputs for all test cases, facilitating automatic comparison with expected responses via a strong large language model (GPT-4). We also incorporate an Elo-based ranking system to establish a leaderboard for multimodal chatbots. We source human preference annotations for ranking chatbot responses. Both our Elo-rankings approaches show strong agreement with human evaluations, demonstrating reliability. In our human evaluation, we find that the best-performing instruction-following model wins against the GPT-4 reference in just 27 of the comparisons. VisIT-Bench is dynamic and can integrate and evaluate new models

Dataset

The dataset consists of 679 instances and 1,578 images, spanning a variety of real-world instruction scenarios. The data was sourced both from newly collected data and existing datasets. It can be accessed at:

Leaderboard

The link to our public leaderboard is present here.

How to add new models to the Leaderboard?

You can access the single-image and multiple-image datasets above.
For every instance (row) in the dataset csv, you would have your model's predictions.
Create a predictions.csv with 4 mandatory columns instruction, instruction_category, image (single-image case) / images (multi-image case), <model name> prediction. Here, <model name>should be your model name with version if multiple-versions are available.
Send a prediction.csv to us on yonatanbitton1@gmail.com.
We will use our internal prompting sandbox with reference-free GPT-4 as an evaluator.
We will add your model to the leaderboard once we receive all the pairwise judgments from the sandbox.
You will receive a confirmation email as soon as your model has been added to the leaderboard.
Estimated time from Step 4-7 would be 1-2 weeks, however, we will try to work on your prediction files as soon as they are sent.

Baselines

We provide the code for most of the instruction-following vision-language models in our paper. Please refer to the baselines readme for more details. Notably, we provide a single VisITBaseModel interface for model generations.

License

The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). For the images that were used, please refer to the public license attached to each individual image in the "public_images_metadata" field in the dataset sheets.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
baselines		baselines
.gitignore		.gitignore
README.md		README.md
fig1.png		fig1.png
visit_fully_annotated_example.csv		visit_fully_annotated_example.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models

TLDR

Abstract

Dataset

Leaderboard

How to add new models to the Leaderboard?

Baselines

License

About

Releases

Packages

Contributors 3

Languages

mlfoundations/VisIT-Bench

Folders and files

Latest commit

History

Repository files navigation

VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models

TLDR

Abstract

Dataset

Leaderboard

How to add new models to the Leaderboard?

Baselines

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages