Skip to content

Commit

Permalink
inital commit message
Browse files Browse the repository at this point in the history
  • Loading branch information
Hritikbansal committed Aug 13, 2023
0 parents commit b19a51a
Show file tree
Hide file tree
Showing 169 changed files with 34,340 additions and 0 deletions.
162 changes: 162 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
ckpts/
.DS_Store

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
.idea
.idea/
*/.idea/workspace.xml
*/.idea/tasks.xml
# User-specific stuff:
.idea/workspace.xml
.idea/tasks.xml
.idea/dictionaries
.idea/vcs.xml
.idea/jsLibraryMappings.xml

# Sensitive or high-churn files:
.idea/dataSources.ids
.idea/dataSources.xml
.idea/dataSources.local.xml
.idea/sqlDataSources.xml
.idea/dynamic.xml
.idea/uiDesigner.xml

# Gradle:
.idea/gradle.xml
.idea/libraries

# Mongo Explorer plugin:
.idea/mongoSettings.xml

sbatch
*sbatch
sbatch*
sbatch/
47 changes: 47 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models

This repository contains the official implementation and data for "VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models". The paper was authored by Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt.

![Alt text](fig1.png)

## TLDR

Our work introduces VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks. We provide a comprehensive evaluation of models' ability to understand human instructions and generate useful, fluent, and safe outputs. Our dataset includes verified reference outputs for all test cases, and we incorporate an ELO-based ranking system for multimodal chatbots. More details can be found in our paper (coming soon).

## Abstract

Recent advances in instruction-following vision-language models have led to a surge in large-scale and accessible multimodal chatbots.
However, existing works lack a comprehensive evaluation of their capabilities to understand human instructions and provide useful, fluent, and safe outputs. We introduce VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks, from recognition to reasoning. VisIT-Bench offers in-depth understanding of a model's conversational abilities. Our dataset includes verified reference outputs for all test cases, facilitating automatic comparison with expected responses via a strong large language model (GPT-4). We also incorporate an Elo-based ranking system to establish a leaderboard for
multimodal chatbots. We source human preference annotations for ranking chatbot responses. Both our Elo-rankings approaches show strong agreement with human evaluations, demonstrating reliability. In our human evaluation, we find that the best-performing instruction-following model wins against the GPT-4 reference in just 27 of the comparisons. VisIT-Bench is dynamic and can integrate and evaluate new models

## Dataset

The dataset consists of 679 instances and 1,578 images, spanning a variety of real-world instruction scenarios. The data was sourced both from newly collected data and existing datasets. It can be accessed at:

- [VisIT-Bench Sheet](https://docs.google.com/spreadsheets/d/1hi8rGXf2WYufkFvGJ2MZ92JNChliM1QEJwZxNboUFlE/edit?usp=sharing)
- [VisIT-Bench Sheet Multi-Images](https://docs.google.com/spreadsheets/d/1IgCjJEd_obCawo1rWYfRZ_J7eiHP_68db5_OaNchKL0/edit?usp=sharing)


## Leaderboard

The link to our public leaderboard is present [here](https://visit-bench.github.io/).

## How to add new models to the Leaderboard?

1. You can access the single-image and multiple-image datasets above.
2. For every instance (row) in the dataset csv, you would have your model's predictions.
3. Create a `predictions.csv` with 4 mandatory columns `instruction`, `instruction_category`, `image` (single-image case) / `images` (multi-image case), `<model name> prediction`. Here, `<model name>`should be your model name with version if multiple-versions are available.
4. Send a `prediction.csv` to us on `yonatanbitton1@gmail.com`.
5. We will use our internal prompting sandbox with reference-free GPT-4 as an evaluator.
6. We will add your model to the leaderboard once we receive all the pairwise judgments from the sandbox.
7. You will receive a confirmation email as soon as your model has been added to the leaderboard.
8. Estimated time from Step 4-7 would be 1-2 weeks, however, we will try to work on your prediction files as soon as they are sent.


## Baselines

We provide the code for most of the instruction-following vision-language models in our paper. Please refer to the baselines [readme](baselines/README.md) for more details. Notably, we provide a single `VisITBaseModel` interface for model generations.

## License
The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
For the images that were used, please refer to the public license attached to each individual image in the "public_images_metadata" field in the dataset sheets.
145 changes: 145 additions & 0 deletions baselines/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
## LLaVA Model Evaluation

### Setup

```
1. git clone https://github.com/haotian-liu/LLaVA.git
2. Follow the steps in their README to install their dependencies and recover LLaVA weights
3. mv llava ../
4. cd ..
5. rm -rf LLaVA
```

## MiniGPT-4 Model Evaluation

### Setup

1. Prepare conda environment:
```bash
conda env create -f minigpt4_utils/environment.yml
conda activate minigpt4
```

2. Follow instructions [here](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/PrepareVicuna.md) and prepare Vicuna weights. The final weights would be in a single folder in a structure similar to the following:

```
vicuna_weights
├── config.json
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
```

Then, set the path to the vicuna weight in the model config file
[here](./minigpt4_utils/configs/models/minigpt4.yaml#16) at Line 16.

3. Download the pretrained checkpoints according to the Vicuna model you prepare.

| Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B |
:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)


Then, set the path to the pretrained checkpoint in the evaluation config file [here](./minigpt4_utils/minigpt4_eval.yaml#11) at Line 11.


#### Reference
https://github.com/Vision-CAIR/MiniGPT-4#installation


## mPLUG-Owl Model Evaluation

### Setup

```bash
# Create conda environment
conda create -n mplug_owl python=3.10
conda activate mplug_owl

# Install PyTorch
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# Install other dependencies
pip install -r mplug_owl_utils/requirements.txt
```

#### Reference
https://github.com/X-PLUG/mPLUG-Owl#install-requirements


## Llama-Adapter-v2 Model Evaluation

### Setup
1. Prepare conda environment
```bash
conda create -n llama_adapter_v2 python=3.8 -y
pip install -r llama_adapter_v2_utils/requirements.txt
```
2. Prepare Llama 7B weights and update [this line](./llama_adapter_v2_modeling.py#15). Organize the downloaded file in the following structure:
```
/path/to/llama_model_weights
├── 7B
│ ├── checklist.chk
│ ├── consolidated.00.pth
│ └── params.json
└── tokenizer.model
```


#### Reference
https://github.com/VegB/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal#setup


## PandaGPT Model Evaluation

### Setup
1. Prepare the environment according to https://github.com/yxuansu/PandaGPT
2. Download Imagebind, Vicuna, PandaGPT Delta checkpoints according to
3. Pass `imagebind_ckpt_path`, `vicuna_ckpt_path`, `delta_ckpt_path` to the `VisITPandaGPT` class.

#### Reference
https://github.com/yxuansu/PandaGPT#2-running-pandagpt-demo-back-to-top

## VisualChatGPT Model Evaluation
1. Prepare the environment
```
# clone the repo
git clone https://github.com/microsoft/TaskMatrix.git
# Go to directory
cd visual-chatgpt
# create a new environment
conda create -n visgpt python=3.8
# activate the new environment
conda activate visgpt
# prepare the basic environments
pip install -r requirements.txt
pip install git+https://github.com/IDEA-Research/GroundingDINO.git
pip install git+https://github.com/facebookresearch/segment-anything.git
# prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}
# prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}
```
2. (Optional) Set ChatGPT model name in [`./visual_chatgpt_utils/visual_chatgpt.py` L](https://github.com/mlfoundations/VisIT-Bench/blob/main/baselines/visual_chatgpt_utils/visual_chatgpt.py#L44). It is recomended to use `text-davinci-003` as by default.

#### Reference
https://github.com/microsoft/TaskMatrix

## InstructBLIP2 Model Evaluation

### Option 1: use the transformers library (default)

##### Regerence
https://huggingface.co/Salesforce/instructblip-vicuna-13b

### Option 2: use the lavis library

##### Regerence
https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
16 changes: 16 additions & 0 deletions baselines/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import torch

class VisITBaseModel(torch.nn.Module):
def __init__(self,):
super().__init__()

def forward(self, instruction, images):
return self.generate(instruction, images)

def generate(self, instruction, images):
"""
instruction: (str) a string of instruction
images: (list) a list of image urls
Return: (str) a string of generated response
"""
raise NotImplementedError
Loading

0 comments on commit b19a51a

Please sign in to comment.