[Refactor] hide the video dataset related args (#675)

* [Refactor] merge the video dataset related args into config json and each dataset inside * fix the concat dataset problem * update the build_model_from_config with empty dict * add supported_video_datasets function for quick start * update on result_file_name problem * fix lint * update configSystem doc and quickStart doc
open-compass · Dec 25, 2024 · aa9f50e · aa9f50e
1 parent 2fd7140
commit aa9f50e
Show file tree

Hide file tree

Showing 16 changed files with 332 additions and 290 deletions.
diff --git a/docs/en/ConfigSystem.md b/docs/en/ConfigSystem.md
@@ -1,6 +1,6 @@
 # Config System
 
-By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.
+By default, VLMEvalKit launches the evaluation by setting the model name(s) (defined in `/vlmeval/config.py`) and dataset name(s) (defined in `vlmeval/dataset/__init__.py` or `vlmeval/dataset/video_dataset_config.py`) in the `run.py` script with the `--model` and `--data` arguments. Such approach is simple and efficient in most scenarios, however, it may not be flexible enough when the user wants to evaluate multiple models / datasets with different settings.
 
 To address this, VLMEvalKit provides a more flexible config system. The user can specify the model and dataset settings in a json file, and pass the path to the config file to the `run.py` script with the `--config` argument. Here is a sample config json:
 
@@ -18,7 +18,8 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
             "model": "gpt-4o-2024-08-06",
             "temperature": 1.0,
             "img_detail": "low"
-        }
+        },
+        "GPT4o_20241120": {}
     },
     "data": {
         "MME-RealWorld-Lite": {
@@ -28,7 +29,14 @@ To address this, VLMEvalKit provides a more flexible config system. The user can
         "MMBench_DEV_EN_V11": {
             "class": "ImageMCQDataset",
             "dataset": "MMBench_DEV_EN_V11"
-        }
+        },
+        "MMBench_Video_8frame_nopack":{},
+        "Video-MME_16frame_subs": {
+            "class": "VideoMME",
+            "dataset": "Video-MME",
+            "nframe": 16,
+            "use_subtitle": true
+        },
     }
 }
 ```
@@ -39,10 +47,11 @@ Explanation of the config json:
 2. For items in `model`, the value is a dictionary containing the following keys:
     - `class`: The class name of the model, which should be a class name defined in `vlmeval/vlm/__init__.py` (open-source models) or `vlmeval/api/__init__.py` (API models).
     - Other kwargs: Other kwargs are model-specific parameters, please refer to the definition of the model class for detailed usage. For example, `model`, `temperature`, `img_detail` are arguments of the `GPT4V` class. It's noteworthy that the `model` argument is required by most model classes.
+    - Tip: The defined model in the `supported_VLM` of `vlmeval/config.py` can be used as a shortcut, for example, `GPT4o_20241120: {}` is equivalent to `GPT4o_20241120: {'class': 'GPT4V', 'model': 'gpt-4o-2024-11-20', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`
 3. For the dictionary `data`, we suggest users to use the official dataset name as the key (or part of the key), since we frequently determine the post-processing / judging settings based on the dataset name. For items in `data`, the value is a dictionary containing the following keys:
     - `class`: The class name of the dataset, which should be a class name defined in `vlmeval/dataset/__init__.py`.
-    - Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes.
-
+    - Other kwargs: Other kwargs are dataset-specific parameters, please refer to the definition of the dataset class for detailed usage. Typically, the `dataset` argument is required by most dataset classes. It's noteworthy that the `nframe` argument or `fps` argument is required by most video dataset classes.
+    - Tip: The defined dataset in the `supported_video_datasets` of `vlmeval/dataset/video_dataset_config.py` can be used as a shortcut, for example, `MMBench_Video_8frame_nopack: {}` is equivalent to `MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`.
 Saving the example config json to `config.json`, you can launch the evaluation by:
 
 ```bash
@@ -55,3 +64,4 @@ That will generate the following output files under the working directory `$WORK
 - `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
 - `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
 - `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
+...
diff --git a/docs/en/Quickstart.md b/docs/en/Quickstart.md
@@ -68,8 +68,6 @@ We use `run.py` for evaluation. To use the script, you can use `$VLMEvalKit/run.
 - `--mode (str, default to 'all', choices are ['all', 'infer'])`: When `mode` set to "all", will perform both inference and evaluation; when set to "infer", will only perform the inference.
 - `--nproc (int, default to 4)`: The number of threads for OpenAI API calling.
 - `--work-dir (str, default to '.')`: The directory to save evaluation results.
-- `--nframe (int, default to 8)`: The number of frames to sample from a video, only applicable to the evaluation of video benchmarks.
-- `--pack (bool, store_true)`: A video may associate with multiple questions, if `pack==True`, will ask all questions for a video in a single query.
 
 **Command for Evaluating Image Benchmarks **
 
@@ -99,10 +97,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
 # When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior).
 # That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct).
 
-# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs.
-torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
-# GPT-4o (API model) on MMBench-Video, with 16 frames as inputs and pack evaluation (all questions of a video in a single query).
-python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
+# IDEFICS2-8B on MMBench-Video, with 8 frames as inputs and vanilla evaluation. On a node with 8 GPUs. MMBench_Video_8frame_nopack is a defined dataset setting in `vlmeval/dataset/video_dataset_config.py`.
+torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
+# GPT-4o (API model) on MMBench-Video, with 1 frame per second as inputs and pack evaluation (all questions of a video in a single query).
+python run.py --data MMBench_Video_1fps_pack --model GPT4o
 ```
 
 The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics.

diff --git a/docs/zh-CN/ConfigSystem.md b/docs/zh-CN/ConfigSystem.md
@@ -1,7 +1,7 @@
 
 # 配置系统
 
-默认情况下，VLMEvalKit通过在`run.py`脚本中使用`--model`和`--data`参数设置模型名称（在`/vlmeval/config.py`中定义）和数据集名称（在`vlmeval/dataset/__init__.py`中定义）来启动评估。这种方法在大多数情况下简单且高效，但当用户希望使用不同设置评估多个模型/数据集时，可能不够灵活。
+默认情况下，VLMEvalKit通过在`run.py`脚本中使用`--model`和`--data`参数设置模型名称（在`/vlmeval/config.py`中定义）和数据集名称（在`vlmeval/dataset/__init__.py` 或 `vlmeval/dataset/video_dataset_config.py` 中定义）来启动评估。这种方法在大多数情况下简单且高效，但当用户希望使用不同设置评估多个模型/数据集时，可能不够灵活。
 
 为了解决这个问题，VLMEvalKit提供了一个更灵活的配置系统。用户可以在json文件中指定模型和数据集设置，并通过`--config`参数将配置文件的路径传递给`run.py`脚本。以下是一个示例配置json：
 
@@ -19,7 +19,8 @@
             "model": "gpt-4o-2024-08-06",
             "temperature": 1.0,
             "img_detail": "low"
-        }
+        },
+        "GPT4o_20241120": {}
     },
     "data": {
         "MME-RealWorld-Lite": {
@@ -29,7 +30,14 @@
         "MMBench_DEV_EN_V11": {
             "class": "ImageMCQDataset",
             "dataset": "MMBench_DEV_EN_V11"
-        }
+        },
+        "MMBench_Video_8frame_nopack":{},
+        "Video-MME_16frame_subs": {
+            "class": "VideoMME",
+            "dataset": "Video-MME",
+            "nframe": 16,
+            "use_subtitle": true
+        },
     }
 }
 ```
@@ -40,9 +48,11 @@
 2. 对于`model`中的项目，值是一个包含以下键的字典：
     - `class`：模型的类名，应该是`vlmeval/vlm/__init__.py`（开源模型）或`vlmeval/api/__init__.py`（API模型）中定义的类名。
     - 其他kwargs：其他kwargs是模型特定的参数，请参考模型类的定义以获取详细用法。例如，`model`、`temperature`、`img_detail`是`GPT4V`类的参数。值得注意的是，大多数模型类都需要`model`参数。
+    - Tip：在位于`vlmeval/config.py`的变量`supported_VLM`中的已经被定义的模型可以作为`model`的键，而不需要填对应的值即可启动。例如，`GPT4o_20240806_T00_HIGH: {}`是等价于`GPT4o_20240806_T00_HIGH: {'class': 'GPT4V', 'model': 'gpt-4o-2024-08-06', 'temperature': 0, 'img_size': -1, 'img_detail': 'high', 'retry': 10, 'verbose': False}`。
 3. 对于字典`data`，我们建议用户使用官方数据集名称作为键（或键的一部分），因为我们经常根据数据集名称确定后处理/判断设置。对于`data`中的项目，值是一个包含以下键的字典：
     - `class`：数据集的类名，应该是`vlmeval/dataset/__init__.py`中定义的类名。
-    - 其他kwargs：其他kwargs是数据集特定的参数，请参考数据集类的定义以获取详细用法。通常，大多数数据集类都需要`dataset`参数。
+    - 其他kwargs：其他kwargs是数据集特定的参数，请参考数据集类的定义以获取详细用法。通常，大多数数据集类都需要`dataset`参数。大多数视频数据集类都需要 `nframe` 或 `fps` 参数。
+    - Tip：在位于`vlmeval/dataset/video_dataset_config.py`的变量`supported_video_dataset`中的已经被定义的数据集可以作为`data`的键，而不需要填对应的值即可启动。例如，`MMBench_Video_8frame_nopack: {}`是等价于`MMBench_Video_8frame_nopack: {'class': 'MMBenchVideo', 'dataset': 'MMBench-Video', 'nframe': 8, 'pack': False}`。
 
 将示例配置json保存为`config.json`，您可以通过以下命令启动评估：
 
@@ -56,4 +66,4 @@ python run.py --config config.json
 - `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MME-RealWorld-Lite*`
 - `$WORK_DIR/GPT4o_20240806_T00_HIGH/GPT4o_20240806_T00_HIGH_MMBench_DEV_EN_V11*`
 - `$WORK_DIR/GPT4o_20240806_T10_Low/GPT4o_20240806_T10_Low_MMBench_DEV_EN_V11*`
--
+......
diff --git a/docs/zh-CN/Quickstart.md b/docs/zh-CN/Quickstart.md
@@ -67,8 +67,6 @@ pip install -e .
 - `--mode (str, 默认值为 'all', 可选值为 ['all', 'infer'])`：当 mode 设置为 "all" 时，将执行推理和评估；当设置为 "infer" 时，只执行推理
 - `--nproc (int, 默认值为 4)`: 调用 API 的线程数
 - `--work-dir (str, default to '.')`: 存放测试结果的目录
-- `--nframe (int, default to 8)`: 从视频中采样的帧数，仅对视频多模态评测集适用
-- `--pack (bool, store_true)`: 一个视频可能关联多个问题，如 `pack==True`，将会在一次询问中提问所有问题
 
 **用于评测图像多模态评测集的命令**
 
@@ -98,10 +96,10 @@ torchrun --nproc-per-node=2 run.py --data MME --model qwen_chat --verbose
 # 使用 `python` 运行时，只实例化一个 VLM，并且它可能使用多个 GPU。
 # 这推荐用于评估参数量非常大的 VLMs（如 IDEFICS-80B-Instruct）。
 
-# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入，不采用 pack 模式评测
-torchrun --nproc-per-node=8 run.py --data MMBench-Video --model idefics2_8b --nframe 8
-# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样 16 帧作为输入，采用 pack 模式评测
-python run.py --data MMBench-Video --model GPT4o --nframe 16 --pack
+# 在 MMBench-Video 上评测 IDEFCIS2-8B, 视频采样 8 帧作为输入，不采用 pack 模式评测. MMBench_Video_8frame_nopack 是一个定义在 `vlmeval/dataset/video_dataset_config.py` 的数据集设定.
+torchrun --nproc-per-node=8 run.py --data MMBench_Video_8frame_nopack --model idefics2_8
+# 在 MMBench-Video 上评测 GPT-4o (API 模型), 视频采样每秒一帧作为输入，采用 pack 模式评测
+python run.py --data MMBench_Video_1fps_pack --model GPT4o
 ```
 
 评估结果将作为日志打印出来。此外，**结果文件**也会在目录 `$YOUR_WORKING_DIRECTORY/{model_name}` 中生成。以 `.csv` 结尾的文件包含评估的指标。