inital commit message

mlfoundations · Aug 13, 2023 · b19a51a · b19a51a
commit b19a51a
Show file tree

Hide file tree

Showing 169 changed files with 34,340 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,162 @@
+ckpts/
+.DS_Store
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+.idea
+.idea/
+*/.idea/workspace.xml
+*/.idea/tasks.xml
+# User-specific stuff:
+.idea/workspace.xml
+.idea/tasks.xml
+.idea/dictionaries
+.idea/vcs.xml
+.idea/jsLibraryMappings.xml
+
+# Sensitive or high-churn files:
+.idea/dataSources.ids
+.idea/dataSources.xml
+.idea/dataSources.local.xml
+.idea/sqlDataSources.xml
+.idea/dynamic.xml
+.idea/uiDesigner.xml
+
+# Gradle:
+.idea/gradle.xml
+.idea/libraries
+
+# Mongo Explorer plugin:
+.idea/mongoSettings.xml
+
+sbatch
+*sbatch
+sbatch*
+sbatch/
diff --git a/README.md b/README.md
@@ -0,0 +1,47 @@
+# VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models
+
+This repository contains the official implementation and data for "VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models". The paper was authored by Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schimdt.
+
+![Alt text](fig1.png)
+
+## TLDR
+
+Our work introduces VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks. We provide a comprehensive evaluation of models' ability to understand human instructions and generate useful, fluent, and safe outputs. Our dataset includes verified reference outputs for all test cases, and we incorporate an ELO-based ranking system for multimodal chatbots. More details can be found in our paper (coming soon).
+
+## Abstract
+
+Recent advances in instruction-following vision-language models have led to a surge in large-scale and accessible multimodal chatbots. 
+However, existing works lack a comprehensive evaluation of their capabilities to understand human instructions and provide useful, fluent, and safe outputs. We introduce VisIT-Bench, a robust benchmark for diverse real-life vision-language instructions across 70 tasks, from recognition to reasoning. VisIT-Bench offers in-depth understanding of a model's conversational abilities. Our dataset includes verified reference outputs for all test cases, facilitating automatic comparison with expected responses via a strong large language model (GPT-4). We also incorporate an Elo-based ranking system to establish a leaderboard for 
+multimodal chatbots. We source human preference annotations for ranking chatbot responses. Both our Elo-rankings approaches show strong agreement with human evaluations, demonstrating reliability. In our human evaluation, we find that the best-performing instruction-following model wins against the GPT-4 reference in just 27 of the comparisons. VisIT-Bench is dynamic and can integrate and evaluate new models
+
+## Dataset
+
+The dataset consists of 679 instances and 1,578 images, spanning a variety of real-world instruction scenarios. The data was sourced both from newly collected data and existing datasets. It can be accessed at:
+
+- [VisIT-Bench Sheet](https://docs.google.com/spreadsheets/d/1hi8rGXf2WYufkFvGJ2MZ92JNChliM1QEJwZxNboUFlE/edit?usp=sharing)
+- [VisIT-Bench Sheet Multi-Images](https://docs.google.com/spreadsheets/d/1IgCjJEd_obCawo1rWYfRZ_J7eiHP_68db5_OaNchKL0/edit?usp=sharing)
+
+
+## Leaderboard
+
+The link to our public leaderboard is present [here](https://visit-bench.github.io/).
+
+## How to add new models to the Leaderboard?
+
+1. You can access the single-image and multiple-image datasets above.
+2. For every instance (row) in the dataset csv, you would have your model's predictions.
+3. Create a `predictions.csv` with 4 mandatory columns `instruction`, `instruction_category`, `image` (single-image case) / `images` (multi-image case), `<model name> prediction`. Here, `<model name>`should be your model name with version if multiple-versions are available. 
+4. Send a `prediction.csv` to us on `yonatanbitton1@gmail.com`. 
+5. We will use our internal prompting sandbox with reference-free GPT-4 as an evaluator.
+6. We will add your model to the leaderboard once we receive all the pairwise judgments from the sandbox.  
+7. You will receive a confirmation email as soon as your model has been added to the leaderboard.
+8. Estimated time from Step 4-7 would be 1-2 weeks, however, we will try to work on your prediction files as soon as they are sent. 
+
+
+## Baselines 
+
+We provide the code for most of the instruction-following vision-language models in our paper. Please refer to the baselines [readme](baselines/README.md) for more details. Notably, we provide a single `VisITBaseModel` interface for model generations. 
+
+## License
+The new contributions of our dataset (e.g., the instructions, reference outputs, model ranking annotations, etc.) are licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
+For the images that were used, please refer to the public license attached to each individual image in the "public_images_metadata" field in the dataset sheets.
diff --git a/baselines/README.md b/baselines/README.md
@@ -0,0 +1,145 @@
+## LLaVA Model Evaluation
+
+### Setup
+
+```
+1. git clone https://github.com/haotian-liu/LLaVA.git
+2. Follow the steps in their README to install their dependencies and recover LLaVA weights
+3. mv llava ../
+4. cd ..
+5. rm -rf LLaVA
+```
+
+## MiniGPT-4 Model Evaluation
+
+### Setup
+
+1. Prepare conda environment:
+```bash
+conda env create -f minigpt4_utils/environment.yml
+conda activate minigpt4
+```
+
+2. Follow instructions [here](https://github.com/Vision-CAIR/MiniGPT-4/blob/main/PrepareVicuna.md) and prepare Vicuna weights. The final weights would be in a single folder in a structure similar to the following:
+
+```
+vicuna_weights
+├── config.json
+├── generation_config.json
+├── pytorch_model.bin.index.json
+├── pytorch_model-00001-of-00003.bin
+...   
+```
+
+Then, set the path to the vicuna weight in the model config file 
+[here](./minigpt4_utils/configs/models/minigpt4.yaml#16) at Line 16.
+
+3. Download the pretrained checkpoints according to the Vicuna model you prepare.
+
+|                                Checkpoint Aligned with Vicuna 13B                                |                               Checkpoint Aligned with Vicuna 7B                                |
+:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
+ [Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) 
+
+
+Then, set the path to the pretrained checkpoint in the evaluation config file [here](./minigpt4_utils/minigpt4_eval.yaml#11) at Line 11. 
+
+
+#### Reference
+https://github.com/Vision-CAIR/MiniGPT-4#installation
+
+
+## mPLUG-Owl Model Evaluation
+
+### Setup
+
+```bash
+# Create conda environment
+conda create -n mplug_owl python=3.10
+conda activate mplug_owl
+
+# Install PyTorch
+conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
+
+# Install other dependencies
+pip install -r mplug_owl_utils/requirements.txt
+```
+
+#### Reference
+https://github.com/X-PLUG/mPLUG-Owl#install-requirements
+
+
+## Llama-Adapter-v2 Model Evaluation
+
+### Setup
+1. Prepare conda environment
+```bash
+conda create -n llama_adapter_v2 python=3.8 -y
+pip install -r llama_adapter_v2_utils/requirements.txt
+```
+2. Prepare Llama 7B weights and update [this line](./llama_adapter_v2_modeling.py#15). Organize the downloaded file in the following structure:
+```
+/path/to/llama_model_weights
+├── 7B
+│   ├── checklist.chk
+│   ├── consolidated.00.pth
+│   └── params.json
+└── tokenizer.model
+```
+
+
+#### Reference
+https://github.com/VegB/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal#setup
+
+
+## PandaGPT Model Evaluation
+
+### Setup
+1. Prepare the environment according to https://github.com/yxuansu/PandaGPT
+2. Download Imagebind, Vicuna, PandaGPT Delta checkpoints according to
+3. Pass `imagebind_ckpt_path`, `vicuna_ckpt_path`, `delta_ckpt_path` to the `VisITPandaGPT` class.
+
+#### Reference
+https://github.com/yxuansu/PandaGPT#2-running-pandagpt-demo-back-to-top
+
+## VisualChatGPT Model Evaluation
+1. Prepare the environment
+```
+# clone the repo
+git clone https://github.com/microsoft/TaskMatrix.git
+
+# Go to directory
+cd visual-chatgpt
+
+# create a new environment
+conda create -n visgpt python=3.8
+
+# activate the new environment
+conda activate visgpt
+
+#  prepare the basic environments
+pip install -r requirements.txt
+pip install  git+https://github.com/IDEA-Research/GroundingDINO.git
+pip install  git+https://github.com/facebookresearch/segment-anything.git
+
+# prepare your private OpenAI key (for Linux)
+export OPENAI_API_KEY={Your_Private_Openai_Key}
+
+# prepare your private OpenAI key (for Windows)
+set OPENAI_API_KEY={Your_Private_Openai_Key}
+```
+2. (Optional) Set ChatGPT model name in [`./visual_chatgpt_utils/visual_chatgpt.py` L](https://github.com/mlfoundations/VisIT-Bench/blob/main/baselines/visual_chatgpt_utils/visual_chatgpt.py#L44). It is recomended to use `text-davinci-003` as by default.
+
+#### Reference
+https://github.com/microsoft/TaskMatrix
+
+## InstructBLIP2 Model Evaluation
+
+### Option 1: use the transformers library (default)
+
+##### Regerence
+https://huggingface.co/Salesforce/instructblip-vicuna-13b
+
+### Option 2: use the lavis library
+
+##### Regerence
+https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
diff --git a/baselines/base.py b/baselines/base.py
@@ -0,0 +1,16 @@
+import torch
+
+class VisITBaseModel(torch.nn.Module):
+    def __init__(self,):
+        super().__init__()
+
+    def forward(self, instruction, images):
+        return self.generate(instruction, images)
+
+    def generate(self, instruction, images):
+        """
+        instruction: (str) a string of instruction
+        images: (list) a list of image urls
+        Return: (str) a string of generated response
+        """
+        raise NotImplementedError