Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VHELM update #2592

Merged
merged 133 commits into from
May 3, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
34107bf
MathVistaScenario + perturbations
teetone Feb 29, 2024
968faa6
fix models
teetone Feb 29, 2024
69a227b
added test
teetone Feb 29, 2024
ccd1b69
update schema
teetone Feb 29, 2024
7015022
rename to vqa
teetone Feb 29, 2024
ecf24f4
add f1 to schema
teetone Feb 29, 2024
5819962
hateful memes as mc
teetone Feb 29, 2024
9191ba2
valid
teetone Feb 29, 2024
e48b734
updated conf
teetone Feb 29, 2024
30a2824
debug
teetone Mar 1, 2024
122a417
crossmodal + cider
teetone Mar 1, 2024
dde04f3
better adaptation
teetone Mar 1, 2024
d2508c8
comments
teetone Mar 2, 2024
ea80da4
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 3, 2024
aa5ca49
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 4, 2024
5e28485
resolve merge conflicts
teetone Mar 5, 2024
b9b7639
todos
teetone Mar 5, 2024
405a3d8
comment out empty open questions
teetone Mar 6, 2024
8f82799
flickr30k
teetone Mar 6, 2024
ac3be64
remove debug
teetone Mar 6, 2024
4100854
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 6, 2024
5bda3c7
fix instructions
teetone Mar 6, 2024
5af02db
GQA
teetone Mar 7, 2024
f2e8a8b
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 7, 2024
39667ba
support location for crossmodal
teetone Mar 7, 2024
a05432a
added geographic bias
teetone Mar 7, 2024
707c4cb
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 7, 2024
2478f7c
a-okvqa
teetone Mar 7, 2024
b96d931
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 8, 2024
ad5bac4
mm safety bench
teetone Mar 9, 2024
ac5574e
image perturbations
teetone Mar 10, 2024
ec6dad5
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 11, 2024
178397f
update conf
teetone Mar 11, 2024
6c415c5
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 11, 2024
7eb02fe
added PAIRS
teetone Mar 11, 2024
edead8f
added PAIRS
teetone Mar 11, 2024
46e2ea2
patch
teetone Mar 11, 2024
1064653
anthropic doens't support empty text blocks
teetone Mar 12, 2024
59bdc06
updated scchema
teetone Mar 12, 2024
cba8787
originality
teetone Mar 12, 2024
8bbed9f
originality
teetone Mar 12, 2024
97f80dc
add to conf
teetone Mar 12, 2024
deb77b6
update split
teetone Mar 12, 2024
fac14e7
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 14, 2024
4beb768
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 16, 2024
ba4cec1
debug
teetone Mar 16, 2024
0672b47
updated lite conf
teetone Mar 18, 2024
a7a0f52
resolve merge conflicts
teetone Mar 20, 2024
3d77b73
fix anthropic
teetone Mar 20, 2024
d0b4707
fix lilypond
teetone Mar 21, 2024
b4d7475
fix mediaobject
teetone Mar 21, 2024
86b5dbd
absolute path
teetone Mar 21, 2024
b78557b
revert
teetone Mar 22, 2024
9831114
resolve merge conflicts
teetone Mar 22, 2024
a34e368
vhelm lite
teetone Mar 22, 2024
5fb4202
add mscoco captioning
teetone Mar 23, 2024
0bb33ef
handle idefics multiple incontext examples
teetone Mar 24, 2024
363c04b
update conf
teetone Mar 24, 2024
39096bb
torch device
teetone Mar 24, 2024
c71bb7e
added llava v1.6 models
teetone Mar 24, 2024
a8d1a8f
fix gemini tokenizer
teetone Mar 24, 2024
2606eb8
rename model
teetone Mar 24, 2024
6afc1a9
fix schema
teetone Mar 24, 2024
fd0ae46
fix next
teetone Mar 24, 2024
769bd33
fix output
teetone Mar 24, 2024
29edebb
update instructions plus more models
teetone Mar 24, 2024
b130fa9
update instructions plus more models
teetone Mar 24, 2024
b75b551
update instructions plus more models
teetone Mar 24, 2024
70ea94b
aokvqa
teetone Mar 24, 2024
d9de52c
device auto
teetone Mar 24, 2024
f757f1c
categorization task
teetone Mar 25, 2024
74f7c44
more metrics
teetone Mar 25, 2024
7511db2
update split
teetone Mar 25, 2024
1d733f8
long caption
teetone Mar 25, 2024
972c6a8
long caption
teetone Mar 25, 2024
57a6db8
categorization multiple choice
teetone Mar 25, 2024
f8c804c
update long
teetone Mar 25, 2024
3e5603b
update schema
teetone Mar 26, 2024
d5ad7c2
resolve merge conflict
teetone Mar 26, 2024
0fd0b51
resolve merge conflict
teetone Mar 30, 2024
d24400f
Merge branch 'vh' of https://github.com/stanford-crfm/benchmarking in…
teetone Mar 30, 2024
07732cf
fix typo
teetone Mar 30, 2024
7b44b32
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 30, 2024
adeeaa3
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 31, 2024
d34285f
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 31, 2024
40ccaaf
Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …
teetone Mar 31, 2024
54c8441
image2structure schema
teetone Mar 31, 2024
ae4d4d9
update instructions
teetone Apr 2, 2024
23e5827
quick run
teetone Apr 2, 2024
b29a58f
adjust
teetone Apr 2, 2024
76c2b20
fix idefics
teetone Apr 2, 2024
4128c0e
resize large images
teetone Apr 3, 2024
67207ee
more model fixes
teetone Apr 11, 2024
dbbdb7e
resize
teetone Apr 11, 2024
2a73e31
use existing copy
teetone Apr 11, 2024
c50623c
update
teetone Apr 11, 2024
259c782
fix content error
teetone Apr 12, 2024
35bccf3
high res
teetone Apr 16, 2024
76097cd
debug
teetone Apr 16, 2024
e650976
update metric
teetone Apr 16, 2024
e0055be
content error
teetone Apr 17, 2024
f20da4f
fix check
teetone Apr 17, 2024
8913bbf
update test split
teetone Apr 19, 2024
3720638
update test split
teetone Apr 19, 2024
b4915d0
narrow schema
teetone Apr 19, 2024
e02eecc
narrow schema
teetone Apr 19, 2024
12b6967
update
teetone Apr 20, 2024
637907b
better prompting for claude
teetone Apr 21, 2024
a8a3c6d
check
teetone Apr 21, 2024
cc64e05
newer instructions
teetone Apr 22, 2024
164cc90
done
teetone Apr 22, 2024
b2ae3b9
prompting
teetone Apr 22, 2024
c092d50
prompting
teetone Apr 22, 2024
efb76f8
idefics 2
teetone Apr 22, 2024
84d74a4
tokenizer
teetone Apr 22, 2024
0d32442
vizwiz
teetone Apr 22, 2024
81f3cc3
multiple num_completions
teetone Apr 22, 2024
60006a5
multiple num_completions
teetone Apr 22, 2024
0b6d00d
quasi
teetone Apr 23, 2024
795fc17
test
teetone Apr 23, 2024
21e3ff2
cleanup schema
teetone Apr 24, 2024
8550ffa
cleanup schema
teetone Apr 24, 2024
6d2cca7
resize
teetone Apr 25, 2024
cbfb6ad
cleanup
teetone Apr 25, 2024
e828f3b
log
teetone Apr 25, 2024
9a16d5f
resolve merge conflicts
teetone Apr 26, 2024
29dc34e
renamed file
teetone Apr 26, 2024
a909139
vhelm lite documentation
teetone Apr 26, 2024
ded6d2c
cleanup
teetone Apr 28, 2024
ca91ad5
get rid of extra space
teetone May 1, 2024
a70d0e5
earth mover similarity block
teetone May 2, 2024
11d9e48
fix
teetone May 2, 2024
f9a7646
fix type
teetone May 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
crossmodal + cider
  • Loading branch information
teetone committed Mar 1, 2024
commit 122a417b63dd0c716610a12d3f9b9c2da067d85f
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ install_requires=
# Basic metrics
nltk~=3.7
pyext~=0.7
pycocoevalcap~=1.2
teetone marked this conversation as resolved.
Show resolved Hide resolved
rouge-score~=0.1.2
scipy~=1.10
uncertainty-calibration~=0.1.4
Expand Down
4 changes: 3 additions & 1 deletion src/helm/benchmark/metrics/common_metric_specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,4 +164,6 @@ def get_disinformation_metric_specs(args: Optional[Dict] = None) -> List[MetricS


def get_open_ended_generation_metric_specs() -> List[MetricSpec]:
teetone marked this conversation as resolved.
Show resolved Hide resolved
return get_basic_metric_specs(["exact_match", "quasi_exact_match", "f1_score", "rouge_l", "bleu_1", "bleu_4"])
return get_basic_metric_specs(
["exact_match", "quasi_exact_match", "f1_score", "rouge_l", "bleu_1", "bleu_4", "cider"]
)
10 changes: 10 additions & 0 deletions src/helm/benchmark/metrics/evaluate_reference_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import string
from . import code_metrics_helper
import nltk
from pycocoevalcap.cider.cider import Cider
teetone marked this conversation as resolved.
Show resolved Hide resolved

try:
nltk.data.find("tokenizers/punkt")
Expand Down Expand Up @@ -188,6 +189,14 @@ def bleu_4(gold: str, pred: str) -> float:
return sentence_bleu([word_tokenize(gold)], word_tokenize(pred), weights=(0, 0, 0, 1))


def cider(gold: str, pred: str) -> float:
cider_evaluator = Cider()
candidate = {"caption": [pred]}
reference = {"caption": [gold]}
average_score, _ = cider_evaluator.compute_score(reference, candidate)
return average_score


def extract_set_from_text(
set_str: str,
set_start_str: str = " is ",
Expand Down Expand Up @@ -325,6 +334,7 @@ def compute_metrics_helper(
"math_equiv_chain_of_thought": is_equiv_chain_of_thought,
"code_eval_acc": code_eval,
"pass": code_eval,
"cider": cider,
"f1_score": f1_score,
"rouge_1": get_rouge_function("rouge1"),
"rouge_2": get_rouge_function("rouge2"),
Expand Down
9 changes: 4 additions & 5 deletions src/helm/benchmark/presentation/run_specs_debug.conf
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
entries: [
# {description: "vqa:model=vlm,data_augmentation=chinese", priority: 1, groups: ["vqa_chinese"]}
# {description: "vqa:model=vlm,data_augmentation=hindi", priority: 1, groups: ["vqa_hindi"]}
# {description: "vqa:model=vlm,data_augmentation=spanish", priority: 1, groups: ["vqa_spanish"]}

{description: "vqa:model=vlm,data_augmentation=dialect_deterministic", priority: 1, groups: ["vqa_dialect"]}
{description: "vqa:model=vlm,data_augmentation=robustness", priority: 1, groups: ["vqa_robustness"]}
{description: "crossmodal_3600:model=vlm,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,language=spanish", priority: 1}
{description: "crossmodal_3600:model=vlm,language=chinese", priority: 1}
{description: "crossmodal_3600:model=vlm,language=hindi", priority: 1}
]
15 changes: 12 additions & 3 deletions src/helm/benchmark/presentation/run_specs_vhelm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ entries: [
####################################################################################################################
# Accuracy: Is the output semantically correct, given the text and image inputs?
####################################################################################################################

# Questions about natural images
{description: "vqa:model=vlm", priority: 1, groups: ["vqa_base"]}
{description: "viz_wiz:model=vlm", priority: 1}

Expand Down Expand Up @@ -72,11 +74,15 @@ entries: [
# Does the model generate creative content (e.g., poetry, art)?
####################################################################################################################

# TODO: story generation, poetry generation for given images

####################################################################################################################
# Bias: Are the generations biased in demographic representation (e.g., gender, skin tone)?
####################################################################################################################

# TODO: implement https://github.com/gzcch/Bingo

# Crossmodal-3600 dataset also measures geographic bias

####################################################################################################################
# Fairness: Does the model exhibit performance disparities across social groups (e.g., gender, dialect)?
Expand All @@ -99,13 +105,16 @@ entries: [

{description: "vqa:model=vlm,data_augmentation=robustness", priority: 1, groups: ["vqa_robustness"]}

# Robustness https://arxiv.org/pdf/2311.16101.pdf

####################################################################################################################
# Multilinguality: Does the model support non-English languages?
####################################################################################################################

{description: "vqa:model=vlm,data_augmentation=chinese", priority: 1, groups: ["vqa_chinese"]}
{description: "vqa:model=vlm,data_augmentation=hindi", priority: 1, groups: ["vqa_hindi"]}
{description: "vqa:model=vlm,data_augmentation=spanish", priority: 1, groups: ["vqa_spanish"]}
{description: "crossmodal_3600:model=vlm,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,language=spanish", priority: 1}
{description: "crossmodal_3600:model=vlm,language=chinese", priority: 1}
{description: "crossmodal_3600:model=vlm,language=hindi", priority: 1}

####################################################################################################################
# Efficiency: How fast is the inference for the model?
Expand Down
19 changes: 19 additions & 0 deletions src/helm/benchmark/run_specs/vlm_run_specs.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,25 @@ def get_chart2csv_spec() -> RunSpec:
)


@run_spec_function("crossmodal_3600")
def get_crossmodal_3600_spec(language: str) -> RunSpec:
scenario_spec = ScenarioSpec(
class_name="helm.benchmark.scenarios.vision_language.crossmodal_3600_scenario.Crossmodal3600Scenario",
args={"language": language},
)
adapter_spec: AdapterSpec = get_short_answer_generation_adapter_spec()
metric_specs: List[MetricSpec] = get_exact_match_metric_specs() + get_open_ended_generation_metric_specs()

run_spec_name: str = "crossmodal_3600"
return RunSpec(
name=f"{run_spec_name}:language={language}",
scenario_spec=scenario_spec,
adapter_spec=adapter_spec,
metric_specs=metric_specs,
groups=[run_spec_name],
)


@run_spec_function("hateful_memes")
def get_hateful_memes_spec() -> RunSpec:
scenario_spec = ScenarioSpec(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
import json
import os
from typing import Dict, List

from helm.benchmark.scenarios.scenario import (
CORRECT_TAG,
TEST_SPLIT,
Instance,
Input,
Output,
Reference,
Scenario,
)
from helm.common.media_object import MediaObject, MultimediaObject
from helm.common.general import ensure_file_downloaded


class Crossmodal3600Scenario(Scenario):
"""
Crossmodal-3600 dataset (XM3600 in short), a geographically-diverse set of 3600 images annotated
with human-generated reference captions in 36 languages.

@inproceedings{ThapliyalCrossmodal2022,
author = {Ashish Thapliyal and Jordi Pont-Tuset and Xi Chen and Radu Soricut},
title = {{Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset}},
booktitle = {EMNLP},
year = {2022}
}

Paper: https://arxiv.org/abs/2205.12522
Website: https://google.github.io/crossmodal-3600/
"""

LANGUAGE_TO_ID: Dict[str, str] = {
"arabic": "ar",
"bengali": "bn",
"chinese": "zh",
"croatian": "hr",
"cusco_quechua": "quz",
"czech": "cs",
"danish": "da",
"dutch": "nl",
"english": "en",
"persian": "fa",
"finnish": "fi",
"french": "fr",
"german": "de",
"greek": "el",
"hebrew": "he",
"hindi": "hi",
"hungarian": "hu",
"indonesian": "id",
"italian": "it",
"japanese": "ja",
"korean": "ko",
"maori": "mi",
"norwegian": "no",
"polish": "pl",
"portuguese": "pt",
"romanian": "ro",
"russian": "ru",
"spanish": "es",
"swahili": "sw",
"swedish": "sv",
"telugu": "te",
"thai": "th",
"turkish": "tr",
"ukrainian": "uk",
"vietnamese": "vi",
}

IMAGES_URL: str = "https://open-images-dataset.s3.amazonaws.com/crossmodal-3600/images.tgz"
CAPTIONS_URL: str = "https://google.github.io/crossmodal-3600/web-data/captions.zip"

name = "crossmodal_3600"
description = (
"Crossmodal-3600 dataset (XM3600 in short), a geographically-diverse set of 3600 images annotated "
"with human-generated reference captions in 36 languages. ([paper](https://arxiv.org/abs/2205.12522))."
)
tags = ["vision-language", "multilinguality"]

def __init__(self, language: str):
super().__init__()
self._language_id: str = self.LANGUAGE_TO_ID[language]
self._instruction: str = f"Generate a short caption for the following image in {language}."

def get_instances(self, output_path: str) -> List[Instance]:
images_path: str = os.path.join(output_path, "images")
ensure_file_downloaded(
source_url=self.IMAGES_URL,
target_path=images_path,
unpack=True,
unpack_type="untar",
)

captions_path: str = os.path.join(output_path, "captions.jsonl")
ensure_file_downloaded(
source_url=self.CAPTIONS_URL,
target_path=captions_path,
unpack=True,
unpack_type="unzip",
)

instances: List[Instance] = []
with open(captions_path, "r") as captions_file:
for line in captions_file:
example: Dict = json.loads(line)

language_id: str = example["image/locale"]
if language_id != self._language_id:
continue

key: str = example["image/key"]
image_path: str = os.path.join(images_path, f"{key}.jpg")
assert os.path.exists(image_path), f"Image {image_path} does not exist"

assert language_id in example, f"Language {language_id} not found in example"
all_captions: Dict = example[language_id]
captions: List[str] = all_captions["caption"]

content: List[MediaObject] = [
MediaObject(text=self._instruction, content_type="text/plain"),
MediaObject(location=image_path, content_type="image/jpeg"),
]
instances.append(
Instance(
Input(multimedia_content=MultimediaObject(content)),
references=[Reference(Output(text=caption), tags=[CORRECT_TAG]) for caption in captions],
split=TEST_SPLIT,
)
)

return instances
21 changes: 21 additions & 0 deletions src/helm/benchmark/static/schema_vlm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,10 @@ metrics:
display_name: F1
description: Average F1 score in terms of word overlap between the model output and correct reference.
lower_is_better: false
- name: cider
display_name: CIDEr
description: Evaluates the quality of generated caption by measuring the weighted similarity of n-grams between the captions and a set of human-written reference captions, emphasizing informativeness and consensus.
lower_is_better: false

############################################################
perturbations:
Expand Down Expand Up @@ -294,6 +298,23 @@ run_groups:
- mmmu
- image2structure

- name: crossmodal_3600
display_name: Crossmodal 3600
description: Crossmodal-3600 dataset (XM3600 in short), a geographically-diverse set of 3600 images annotated with human-generated reference captions in 36 languages. ([paper](https://arxiv.org/abs/2205.12522))
metric_groups:
- accuracy
- efficiency
- general_information
environment:
main_name: f1_score
main_split: valid
taxonomy:
task: multilingual captioning
what: Real-world images
who: Human experts
when: "2022"
language: 36 languages

- name: heim_human_eval
display_name: HEIM Human Eval Scenario
description: Seeing if we can use VLMs to evaluate AI-generated images from HEIM
Expand Down