Video-Bench: Human Preference Aligned Video Generation Benchmark

HABench is a benchmark tool designed to systematically leverage MLLMs across all dimensions relevant to video generation assessment in generative models. By incorporating few-shot scoring and chain-of-query techniques, HA-Video-Bench provides a structured, scalable approach to generated video evaluation.

Overview

Leaderboard

Model	Imaging Quality	Aesthetic Quality	Temporal Consist.	Motion Effects	Avg Rank	Video-text Consist.	Object-class Consist.	Color Consist.	Action Consist.	Scene Consist.	Avg Rank	Overall Avg Rank
Cogvideox [57]	3.87	3.84	4.14	3.55	3.00	4.62	2.81	2.92	2.81	2.93	1.60	2.22
Gen3 [42]	4.66	4.44	4.74	3.99	1.00	4.38	2.81	2.87	2.59	2.93	2.40	1.78
Kling [24]	4.26	3.82	4.38	3.11	2.75	4.07	2.70	2.81	2.50	2.82	4.60	3.78
VideoCrafter2 [5]	4.08	3.85	3.69	2.81	3.75	4.18	2.85	2.90	2.53	2.78	2.80	3.22
LaVie [52]	3.00	2.94	3.00	2.43	7.00	3.71	2.82	2.81	2.45	2.63	5.00	5.88
PiKa-Beta [38]	3.78	3.76	3.40	2.59	5.50	3.78	2.51	2.52	2.25	2.60	6.80	6.22
Show-1 [60]	3.30	3.28	3.90	2.90	5.00	4.21	2.82	2.79	2.53	2.72	3.80	4.33

Notes:

Higher scores indicate better performance.
The best score in each dimension is highlighted in bold.

HumanAlignment

Metrics	Benchmark	Imaging Quality	Aesthetic Quality	Temporal Consist.	Motion Effects	Video-text Consist.	Action Consist.	Object-class Consist.	Color Consist.	Scene Consist.
MUSIQ [21]	VBench [19]	0.363	-	-	-	-	-	-	-	-
LAION	VBench [19]	-	0.446	-	-	-	-	-	-	-
CLIP [40]	VBench [19]	-	-	0.260	-	-	-	-	-	-
RAFT [48]	VBench [19]	-	-	-	0.329	-	-	-	-	-
Amt [28]	VBench [19]	-	-	-	0.329	-	-	-	-	-
ViCLIP [53]	VBench [19]	-	-	-	-	0.445	-	-	-	-
UMT [27]	VBench [19]	-	-	-	-	-	0.411	-	-	-
GRiT [54]	VBench [19]	-	-	-	-	-	-	0.469	0.545	-
Tag2Text [16]	VBench [19]	-	-	-	-	-	-	-	-	0.422
ComBench [46]	ComBench [46]	-	-	-	-	0.633	0.633	0.611	0.696	0.631
Video-Bench	Video-Bench	0.733	0.702	0.402	0.514	0.732	0.718	0.735	0.750	0.733

Notes:

Higher scores indicate better performance.
The best score in each dimension is highlighted in bold.

Installation

Installation Requirements

Python >= 3.8

OpenAI API access Update your OpenAI API keys in config.json:

{
    "GPT4o_API_KEY": "your-api-key",
    "GPT4o_BASE_URL": "your-base-url",
    "GPT4o_mini_API_KEY": "your-mini-api-key",
    "GPT4o_mini_BASE_URL": "your-mini-base-url"
}

Pip Installation

Install with pip
```
pip install HAbench
```

Install with git clone

git clone https://github.com/yourusername/Video-Bench.git
cd Video-Bench
pip install -r requirements.txt

Download From Huggingface

wget https://huggingface.co/Video-Bench/Video-Bench

or

curl -L https://huggingface.co/Video-Bench/Video-Bench

Preparation

Please organize your data according to the following data structure:

# Data Structure
/Video-Bench/data/
├── color/                           # 'color' dimension videos
│   ├── cogvideox5b/
│   │   ├── A red bird_0.mp4
│   │   ├── A red bird_1.mp4
│   │   └── ...
│   ├── lavie/
│   │   ├── A red bird_0.mp4
│   │   ├── A red bird_1.mp4
│   │   └── ...
│   ├── pika/
│   │   └── ...
│   └── ...
│
├── object_class/                    # 'object_class' dimension videos
│   ├── cogvideox5b/
│   │   ├── A train_0.mp4
│   │   ├── A train_1.mp4
│   │   └── ...
│   ├── lavie/
│   │   └── ...
│   └── ...
│
├── scene/                           # 'scene' dimension videos
│   ├── cogvideox5b/
│   │   ├── Botanical garden_0.mp4
│   │   ├── Botanical garden_1.mp4
│   │   └── ...
│   └── ...
│
├── action/                          # 'action' 'temporal_consistency' 'motion_effects' dimension videos
│   ├── cogvideox5b/
│   │   ├── A person is marching_0.mp4
│   │   ├── A person is marching_1.mp4
│   │   └── ...
│   └── ...
│
└── video-text consistency/             # 'video-text consistency' 'imaging_quality' 'aesthetic_quality' dimension videos
    ├── cogvideox5b/
    │   ├── Close up of grapes on a rotating table._0.mp4
    │   └── ...
    ├── lavie/
    │   └── ...
    ├── pika/
    │   └── ...
    └── ...

Instructions

Video-Bench provides comprehensive evaluation across multiple dimensions of video generation quality. Each dimension is assessed using a specific scoring scale to ensure accurate and meaningful evaluation.

Evaluation Dimensions

Dimension	Description	Scale	Module
Static Quality
Image Quality	Evaluates technical aspects including clarity and sharpness	1-5	`staticquality.py`
Aesthetic Quality	Assesses visual appeal and artistic composition	1-5	`staticquality.py`
Dynamic Quality
Temporal Consistency	Measures frame-to-frame coherence and smoothness	1-5	`dynamicquality.py`
Motion Effects	Evaluates quality of movement and dynamics	1-5	`dynamicquality.py`
Video-Text Alignment
Video-Text Consistency	Overall alignment with text prompt	1-5	`VideoTextAlignment.py`
Object-Class Consistency	Accuracy of object representation	1-3	`VideoTextAlignment.py`
Color Consistency	Matching of colors with text prompt	1-3	`VideoTextAlignment.py`
Action Consistency	Accuracy of depicted actions	1-3	`VideoTextAlignment.py`
Scene Consistency	Correctness of scene environment	1-3	`VideoTextAlignment.py`

Usage

Video-Bench supports two modes: standard mode and custom input mode. Video-Bench only supports assessments of the following dimensions: 'aesthetic_quality', 'imaging_quality','temporal_consistency', 'motion_effects','color', 'object_class', 'scene', 'action', 'video-text consistency'

Standard Mode

The Standard Mode assesses videos generated by various video generation models using the prompt suite defined in our HAbench_full.json.

It allows users to organize three sets of video data for the seven provided models or add three sets for other models following the data structure. It also supports using only one set of video data for all models. Please ensure that the number of data sets is consistent across all models within the data structure.

To evaluate videos, simply specify the models to be tested in the --models parameter. For example, if you want to evaluate videos under modelname1 and modelname2, use the following commands with --models modelname1 modelname2

python evaluate.py \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --config_path ./config.json \
 --models modelname1 modelname2

Custom Mode

This mode allows users to evaluate videos generated from prompts that are not included in the Video-Bench prompt suite.

You can provide prompts in two ways:

Single prompt: Use --prompt "your customized prompt" to specify a single prompt.
Multiple prompts: Create a JSON file and use --prompt_file $json_path. Create a JSON file containing your prompts and use --prompt_file $json_path to load them. The JSON file can follow this format:

{
    0: "prompt1",
    1: "prompt2",
    ...
}

For video-text alignment or dynamic quality dimensions, `set mode=custom_nonstatic`:

python evaluate.py \
 --dimension $DIMENSION \ 
 --videos_path ./data/ \
 --mode custom_nonstatic \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_nonstatic \
 --config_path ./config.json \
 --models modelname1 modelname2

For static quality dimensions, `set mode=custom_static`:

python evaluate.py \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_static \
 --config_path ./config.json \
 --models modelname1 modelname2

or

HAbench \
 --dimension $DIMENSION \
 --videos_path ./data/ \
 --mode custom_static \
 --config_path ./config.json \
 --models modelname1 modelname2

Videos and Annotations

You can obtain the video data and human annotations in two ways:

Option 1: Download from Hugging Face

Videos dataset: Video-Bench/Video-Bench_videos
Human annotations: Video-Bench/Video-Bench_human_annotation

# Download videos
git clone https://huggingface.co/datasets/Video-Bench/Video-Bench_videos
# Download annotations  
git clone https://huggingface.co/datasets/Video-Bench/Video-Bench_human_annotation

Option 2: Local Directory

The human annotations can also be found in the local directory:

./data/human_anno/

Citation

If you use our dataset, code or find Video-Bench useful, please cite our paper in your work as:

@article{ni2023content,
  title={Video-Bench: Human Preference Aligned Video Generation Benchmark},
  author={Han, Hui and Li, Siyuan and Chen, Jiaqi and Yuan, Yiwen and Wu, Yuling and Leong, Chak Tou and Du, Hanwen and Fu, Junchen and Li, Youhua and Zhang, Jie and Zhang, Chi and Li, Li-jia and Ni, Yongxin},
  journal={arXiv preprint arXiv:xxx},
  year={2024}
}

Literature

Video Generation Evaluation Methods

Model	Paper	Resource	Conference/Journal/Preprint	Year	Features
Video-Bench	Link	GitHub	Arxiv	2024	Video-Bench leverages Multimodal Large Language Models (MLLMs) to provide highly accurate evaluations that closely align with human preferences across multiple dimensions of video quality. It incorporates few-shot scoring and chain-of-query techniques, allowing for scalable and structured assessments. Video-Bench supports cross-modal consistency and offers more objective insights when diverging from human judgments, making it a more reliable and comprehensive tool for video generation evaluation. It also demonstrates unique strength compared to human ratings in terms of accuracy.
FETV	Link	GitHub	NeurIPS	2023	FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity.
FVD	Link	GitHub	ICLR Workshop	2023	A novel metric for generative video models that extends the Fréchet Inception Distance (FID) to account for not only visual quality but also temporal coherence and diversity, addressing the lack of qualitative metrics in current video generation evaluation.
GAIA	Link	GitHub	Arxiv	2024	By adopting a causal reasoning perspective, it evaluates popular text-to-video (T2V) models on their ability to generate visually rational actions and benchmarks existing automatic evaluation methods, revealing a significant gap between current models and human perception patterns.
SAVGBench	Link	Links	Arxiv	2024	This work introduces a benchmark for Spatially Aligned Audio-Video Generation (SAVG), focusing on spatial alignment between audio and visuals. Key innovations include a new dataset, a baseline diffusion model for stereo audio-visual learning, and a spatial alignment metric, revealing significant gaps in quality and alignment between the model and ground truth.
VBench++	Link	GitHub	Arxiv	2024	VBench++ is a comprehensive benchmark for video generation, featuring 16 evaluation dimensions, human alignment validation, and support for both text-to-video and image-to-video models, assessing both technical quality and model trustworthiness.
T2V-CompBench	Link	GitHub	Arxiv	2024	T2V-CompBench evaluates diverse aspects such as attribute binding, spatial relationships, motion, and object interactions. It introduces tailored evaluation metrics based on MLLM, detection, and tracking, validated by human evaluation.
VideoScore	Link	Website	EMNLP	2024	It introduces a dataset with human-provided multi-aspect scores for 37.6K videos from 11 generative models. VideoScore is trained on this to provide automatic video quality assessment, achieving a 77.1 Spearman correlation with human ratings.
ChronoMagic-Bench	Link	Website	NeurIPS	2024	ChronoMagic-Bench evaluates T2V models on their ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence, using 1,649 prompts across four categories. Its advantages include the introduction of new metrics (MTScore and CHScore) and a large-scale dataset (ChronoMagic-Pro) for comprehensive, high-quality evaluation.
T2VSafetyBench	Link	GitHub	NeurIPS	2024	T2VSafetyBench introduces a benchmark for assessing the safety of text-to-video models, focusing on 12 critical aspects of video generation safety, including temporal risks. It addresses the unique safety concerns of video generation, providing a malicious prompt dataset, and offering valuable insights into the trade-off between usability and safety.
T2VBench	Link	Website	CVPR	2024	T2VBench focuses on 16 critical temporal dimensions such as camera transitions and event sequences for evaluating text-to-video models, consisting of a hierarchical framework with over 1,600 prompts and 5,000 videos.
EvalCrafter	Link	Website	CVPR	2024	EvalCrafter provides a systematic framework for benchmarking and evaluating large-scale video generation models, ensuring high-quality assessments across various video generation attributes.
VQAScore	Link	GitHub	ECCV	2024	This work introduces VQAScore, a novel alignment metric that uses a visual-question-answering model to assess image-text coherence, addressing the limitations of CLIPScore with complex prompts. It also presents GenAI-Bench, a challenging benchmark of 1,600 compositional prompts and 15,000 human ratings, enabling more accurate evaluation of generative models like Stable Diffusion and DALL-E 3.
VBench	Link	GitHub	CVPR	2024	VBench introduces a comprehensive evaluation benchmark for video generation, addressing the misalignment between current metrics and human perception. Its key innovations include 16 detailed evaluation dimensions, human preference alignment for validation, and the ability to assess various content types and model gaps.
DEVIL	Link	GitHub	NeurIPS	2024	DEVIL introduces a new benchmark with dynamic scores at different temporal granularities, achieving over 90% Pearson correlation with human ratings for comprehensive model assessment.
AIGCBench	Link	Website	Arxiv	2024	AIGCBench is a benchmark for evaluating image-to-video (I2V) generation. It incorporates an open-domain image-text dataset and introduces 11 metrics across four dimensions—alignment, motion effects, temporal consistency, and video quality.
MiraData	Link	GitHub	NeurIPS	2024	MiraData offers longer videos, stronger motion intensity, and more detailed captions. Paired with MiraBench to enhance evaluation with metrics like 3D consistency and motion strength.
PhyGenEval	Link	Website	Arxiv	2024	PhyGenBench is designed to evaluate the understanding of physical commonsense in text-to-video (T2V) generation, consisting of 160 prompts covering 27 physical laws across four domains, paired with the PhyGenEval evaluation framework that enables assessments of models' adherence to physical commonsense.
VideoPhy	Link	GitHub	Arxiv	2024	VideoPhy is a benchmark designed to assess the physical commonsense accuracy of generated videos, particularly for T2V models, by evaluating their adherence to real-world physical laws and behaviors.
T2VHE	Link	GitHub	Arxiv	2024	The T2VHE protocol is an approach for evaluating text-to-video (T2V) models, addressing challenges in reproducibility, reliability, and practicality of manual evaluations. It includes defined metrics, annotator training, and a dynamic evaluation module.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.vscode		.vscode
HAbench		HAbench
data/human_anno		data/human_anno
figures		figures
README.md		README.md
config.json		config.json
environment.yml		environment.yml
evaluate.py		evaluate.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-Bench: Human Preference Aligned Video Generation Benchmark

Contents

Overview

Leaderboard

HumanAlignment

Installation

Installation Requirements

Pip Installation

Download From Huggingface

Preparation

Instructions

Evaluation Dimensions

Usage

Standard Mode

Custom Mode

For video-text alignment or dynamic quality dimensions, `set mode=custom_nonstatic`:

For static quality dimensions, `set mode=custom_static`:

Videos and Annotations

Option 1: Download from Hugging Face

Option 2: Local Directory

Citation

Literature

Video Generation Evaluation Methods

About

Releases

Packages

Languages

Video-Bench/Video-Bench

Folders and files

Latest commit

History

Repository files navigation

Video-Bench: Human Preference Aligned Video Generation Benchmark

Contents

Overview

Leaderboard

HumanAlignment

Installation

Installation Requirements

Pip Installation

Download From Huggingface

Preparation

Instructions

Evaluation Dimensions

Usage

Standard Mode

Custom Mode

For video-text alignment or dynamic quality dimensions, set mode=custom_nonstatic:

For static quality dimensions, set mode=custom_static:

Videos and Annotations

Option 1: Download from Hugging Face

Option 2: Local Directory

Citation

Literature

Video Generation Evaluation Methods

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

For video-text alignment or dynamic quality dimensions, `set mode=custom_nonstatic`:

For static quality dimensions, `set mode=custom_static`:

Packages