Skip to content

Commit

Permalink
[Refactor] hide the video dataset related args (#675)
Browse files Browse the repository at this point in the history
* [Refactor] merge the video dataset related args into config json and each dataset inside

* fix the concat dataset problem

* update the build_model_from_config with empty dict

* add supported_video_datasets function for quick start

* update on result_file_name problem

* fix lint

* update configSystem doc and quickStart doc
  • Loading branch information
FangXinyu-0913 authored Dec 25, 2024
1 parent 2fd7140 commit aa9f50e
Show file tree
Hide file tree
Showing 16 changed files with 332 additions and 290 deletions.
20 changes: 15 additions & 5 deletions docs/en/ConfigSystem.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Config System

By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.
By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py` or `vlmeval/dataset/video_dataset_config.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.

To address this, VLMEvalKit provides a more flexible config system. The user can specify the model and dataset settings in a json file, and pass the path to the config file to the `run.py` script with the `--config` argument. Here is a sample config json:

Expand All @@ -18,7 +18,8 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
"model": "gpt-4o-2024-08-06",
"temperature": 1.0,
"img_detail": "low"
}
},
"GPT4o_20241120": {}
},
"data": {
"MME-RealWorld-Lite": {
Expand All @@ -28,7 +29,14 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
"MMBench_DEV_EN_V11": {
"class": "ImageMCQDataset",
"dataset": "MMBench_DEV_EN_V11"
}
},
"MMBench_Video_8frame_nopack":{},
"Video-MME_16frame_subs": {
"class": "VideoMME",
"dataset": "Video-MME",
"nframe": 16,
"use_subtitle": true
},
}
}
```
Expand All @@ -39,10 +47,11 @@ Explanation of the config json:
2. For items in `model`, the value is a dictionary containing the following keys:
- `class`: The class name of the model, which should be a class name defined in `vlmeval/vlm/__init__.py` (open-source models) or `vlmeval/api/__init__.py` (API models).
- Other kwargs: Other kwargs are model-specific parameters, please refer to the definition of the model class for detailed usage. For example, `model`, `temperature`, `img_detail` are arguments of the `GPT4V` class. It's noteworthy that the `model` argument is required by most model classes.
- Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut, for example, `GPT4o_20241120: {}` is equivalent to `GPT4o_20241120: {'class': 'GPT4V', 'model': 'gpt-4o-2024-11-20', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`
3. For the dictionary `data`, we suggest users to use the official dataset name as the key (or part of the key), since we frequently determine the post-processing / judging settings based on the dataset name. For items in `data`, the value is a dictionary containing the following keys:
- `class`: The class name of the dataset, which should be a class name defined in `vlmeval/dataset/__init__.py`.
- Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes.

- Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes. It's noteworthy that the `nframe` argument or `fps` argument is required by most video dataset classes.
- Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` can be used as a shortcut, for example, `MMBench_Video_8frame_nopack: {}` is equivalent to `MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`.
Saving the example config json to `config.json`, you can launch the evaluation by:

```bash
Expand All @@ -55,3 +64,4 @@ That will generate the following output files under the working directory `$WORK
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
- `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
...
10 changes: 4 additions & 6 deletions docs/en/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,6 @@ We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.
- `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
- `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
- `--work-dir (str, default to '.')`: The directory to save evaluation results.
- `--nframe (int, default to 8)`: The number of frames to sample from a video, only applicable to the evaluation of video benchmarks.
- `--pack (bool, store_true)`: A video may associate with multiple questions, if `pack==True`, will ask all questions for a video in a single query.

**Command for Evaluating Image Benchmarks **

Expand Down Expand Up @@ -99,10 +97,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
# When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
# That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).

# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs.
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
# GPT-4o (API model) on MMBench-Video, with 16 frames as inputs and pack evaluation (all questions of a video in a single query).
python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs. MMBench_Video_8frame_nopack is a defined dataset setting in `vlmeval/dataset/video_dataset_config.py`.
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
# GPT-4o (API model) on MMBench-Video, with 1 frame per second as inputs and pack evaluation (all questions of a video in a single query).
python run.py --data MMBench_Video_1fps_pack --model GPT4o
```

The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.
Expand Down
20 changes: 15 additions & 5 deletions docs/zh-CN/ConfigSystem.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

# 配置系统

默认情况下,VLMEvalKit通过在`run.py`脚本中使用`--model``--data`参数设置模型名称(在`/vlmeval/config.py`中定义)和数据集名称(在`vlmeval/dataset/__init__.py`中定义)来启动评估。这种方法在大多数情况下简单且高效,但当用户希望使用不同设置评估多个模型/数据集时,可能不够灵活。
默认情况下,VLMEvalKit通过在`run.py`脚本中使用`--model``--data`参数设置模型名称(在`/vlmeval/config.py`中定义)和数据集名称(在`vlmeval/dataset/__init__.py``vlmeval/dataset/video_dataset_config.py` 中定义)来启动评估。这种方法在大多数情况下简单且高效,但当用户希望使用不同设置评估多个模型/数据集时,可能不够灵活。

为了解决这个问题,VLMEvalKit提供了一个更灵活的配置系统。用户可以在json文件中指定模型和数据集设置,并通过`--config`参数将配置文件的路径传递给`run.py`脚本。以下是一个示例配置json:

Expand All @@ -19,7 +19,8 @@
"model": "gpt-4o-2024-08-06",
"temperature": 1.0,
"img_detail": "low"
}
},
"GPT4o_20241120": {}
},
"data": {
"MME-RealWorld-Lite": {
Expand All @@ -29,7 +30,14 @@
"MMBench_DEV_EN_V11": {
"class": "ImageMCQDataset",
"dataset": "MMBench_DEV_EN_V11"
}
},
"MMBench_Video_8frame_nopack":{},
"Video-MME_16frame_subs": {
"class": "VideoMME",
"dataset": "Video-MME",
"nframe": 16,
"use_subtitle": true
},
}
}
```
Expand All @@ -40,9 +48,11 @@
2. 对于`model`中的项目,值是一个包含以下键的字典:
- `class`:模型的类名,应该是`vlmeval/vlm/__init__.py`(开源模型)或`vlmeval/api/__init__.py`(API模型)中定义的类名。
- 其他kwargs:其他kwargs是模型特定的参数,请参考模型类的定义以获取详细用法。例如,`model``temperature``img_detail``GPT4V`类的参数。值得注意的是,大多数模型类都需要`model`参数。
- Tip:在位于`vlmeval/config.py`的变量`supported_VLM`中的已经被定义的模型可以作为`model`的键,而不需要填对应的值即可启动。例如,`GPT4o_20240806_T00_HIGH: {}`是等价于`GPT4o_20240806_T00_HIGH: {'class': 'GPT4V', 'model': 'gpt-4o-2024-08-06', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`
3. 对于字典`data`,我们建议用户使用官方数据集名称作为键(或键的一部分),因为我们经常根据数据集名称确定后处理/判断设置。对于`data`中的项目,值是一个包含以下键的字典:
- `class`:数据集的类名,应该是`vlmeval/dataset/__init__.py`中定义的类名。
- 其他kwargs:其他kwargs是数据集特定的参数,请参考数据集类的定义以获取详细用法。通常,大多数数据集类都需要`dataset`参数。
- 其他kwargs:其他kwargs是数据集特定的参数,请参考数据集类的定义以获取详细用法。通常,大多数数据集类都需要`dataset`参数。大多数视频数据集类都需要 `nframe``fps` 参数。
- Tip:在位于`vlmeval/dataset/video_dataset_config.py`的变量`supported_video_dataset`中的已经被定义的数据集可以作为`data`的键,而不需要填对应的值即可启动。例如,`MMBench_Video_8frame_nopack: {}`是等价于`MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`

将示例配置json保存为`config.json`,您可以通过以下命令启动评估:

Expand All @@ -56,4 +66,4 @@ python run.py --config config.json
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
- `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
- `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
-
......
10 changes: 4 additions & 6 deletions docs/zh-CN/Quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,6 @@ pip install -e .
- `--mode (str, 默认值为 'all', 可选值为 ['all', 'infer'])`:当 mode 设置为 "all" 时,将执行推理和评估;当设置为 "infer" 时,只执行推理
- `--nproc (int, 默认值为 4)`: 调用 API 的线程数
- `--work-dir (str, default to '.')`: 存放测试结果的目录
- `--nframe (int, default to 8)`: 从视频中采样的帧数,仅对视频多模态评测集适用
- `--pack (bool, store_true)`: 一个视频可能关联多个问题,如 `pack==True`,将会在一次询问中提问所有问题

**用于评测图像多模态评测集的命令**

Expand Down Expand Up @@ -98,10 +96,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
# 使用 `python` 运行时,只实例化一个 VLM,并且它可能使用多个 GPU。
# 这推荐用于评估参数量非常大的 VLMs(如 IDEFICS-80B-Instruct)。

# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测
torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样 16 帧作为输入,采用 pack 模式评测
python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入,不采用 pack 模式评测. MMBench_Video_8frame_nopack 是一个定义在 `vlmeval/dataset/video_dataset_config.py` 的数据集设定.
torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样每秒一帧作为输入,采用 pack 模式评测
python run.py --data MMBench_Video_1fps_pack --model GPT4o
```

评估结果将作为日志打印出来。此外,**结果文件**也会在目录 `$YOUR_WORKING_DIRECTORY/{model_name}` 中生成。以 `.csv` 结尾的文件包含评估的指标。
Expand Down
Loading

0 comments on commit aa9f50e

Please sign in to comment.