Skip to content

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

License

Notifications You must be signed in to change notification settings

apple/ml-slowfast-llava

Repository files navigation

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

This project accompanies the research paper,

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Mingze Xu*, Mingfei Gao*, Zhe Gan, Hong-You Chen, Zhengfeng Lai, Haiming Gang, Kai Kang, Afshin Dehghan

SlowFast-LLaVA is a training-free multimodal large language model (LLM) for video understanding and reasoning. Without requiring fine-tuning on any data, it achieves comparable or even better performance compared to state-of-the-art Video LLMs on a wide range of VideoQA tasks and benchmarks, as shown in the figure.

Table of contents

Getting Started

Installation

  • The code is developed with CUDA 11.7, Python >= 3.10.12, PyTorch >= 2.1.0

    1. [Optional but recommended] Create a new conda environment.

      conda create -n sf_llava python=3.10.12
      

      And activate the environment.

      conda activate sf_llava
      
    2. Install the requirements.

      bash setup_env.sh
      
    3. Add OpenAI key and organization to the system environment to use GPT-3.5-turbo for model evaluation.

      export OPENAI_API_KEY=$YOUR_OPENAI_API_KEY
      export OPENAI_ORG=$YOUR_OPENAI_ORG  # optional
      
    4. Download pre-trained LLaVA-NeXT weights from HuggingFace, and put them under the ml-slowfast-llava folder.

      git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b liuhaotian/llava-v1.6-vicuna-7b
      git lfs clone https://huggingface.co/liuhaotian/llava-v1.6-34b liuhaotian/llava-v1.6-34b
      

Data Preparation

  1. We prepare the ground-truth question and answer files based on IG-VLM, and put them under playground/gt_qa_files.

    • MSVD-QA
      • Download the MSVD_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_msvd_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • MSRVTT-QA
      • Download the MSRVTT_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_msrvtt_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • TGIF-QA
      • Download the TGIF_FrameQA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_tgif_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • Activitynet-QA
      • Download the Activitynet_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_activitynet_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • NExT-QA
      • Download the NExT_QA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_nextqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • EgoSchema
      • Download the EgoSchema.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_egoschema_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • IntentQA
      • Download the IntentQA.csv from the here
      • Reformat the files by running
        python scripts/data/prepare_intentqa_qa_file.py --qa_file $PATH_TO_CSV_FILE
        
    • VCGBench
      • Download all files under text_generation_benchmark
      • Reformat the files by running
        python scripts/data/prepare_vcgbench_qa_file.py --qa_folder $TEXT_GENERATION_BENCHMARK
        
  2. Download the raw videos from the official websites.

    • Openset VideoQA

    • Multiple Choice VideoQA

    • Text Generation

      • The videos are based on ActivityNet, and you can reuse the one from Openset VideoQA.
  3. Organize the raw videos under playground/data.

    • To directly use our data loaders without changing paths, please organize your datasets as follows

      $ ml-slowfast-llava/playground/data
          ├── video_qa
              ├── MSVD_Zero_Shot_QA
                  ├── videos
                      ├── ...
              ├── MSRVTT_Zero_Shot_QA
                  ├── videos
                      ├── all
                          ├── ...
              ├── TGIF_Zero_Shot_QA
                 ├── mp4
                     ├── ...
              ├── Activitynet_Zero_Shot_QA
                 ├── all_test
                     ├── ...
          ├── multiple_choice_qa
              ├── NExTQA
                  ├── video
                     ├── ...
              ├── EgoSchema
                  ├── video
                     ├── ...
              ├── IntentQA
                  ├── video
                     ├── ...
      

Configuration

We use yaml config to control the design choice of SlowFast-LLaVA. We will use the config of SlowFast-LLaVA-7B as an example to explain some important parameters.

  • SCRIPT: It controls the tasks that you want to run.
  • DATA_DIR and CONV_MODE: They are the data directories and prompts for different tasks. They could be either a string or a list of strings, but must match the SCRIPT.
  • NUM_FRAMES: The total number of sampled video frames.
  • TEMPORAL_AGGREGATION: It controls the setting of Slow and Fast pathways. It should be a string with the pattern slowfast-slow_{$S_FRMS}frms_{$S_POOL}-fast_{$F_OH}x{F_OW}, where
    • $S_FRMS should be an integer which indicates the number of frames in the Slow pathway,
    • $S_POOL should be a string which indicates the pooling operation for the Slow pathway,
    • $F_OH and $F_OW should be an integer and are the height and width of the output tokens in the Fast pathway.

Inference and Evaluation

SlowFast-LLaVA is a training-free method, so we can directly do the inference and evaluation without model training.

By default, we use 8 GPUs for the model inference. We can modify the CUDA_VISIBLE_DEVICES in the config file to accommodate your own settings. Please note that the model inference of SlowFast-LLaVA-34B requires GPUs with at least 80G memory.

cd ml-slowfast-llava
python run_inference.py --exp_config $PATH_TO_CONFIG_FILE
  • This is optional, but use export PYTHONWARNINGS="ignore" if you want to suppress the warnings.

Output Structures

  • The inference outputs will be stored under outputs/artifacts.
  • The intermediate outputs of GPT-3.5-turbo will be stored under outputs/eval_save_dir.
  • The evaluation results will be stored under outputs/logs.
  • All of these can be changed in the config file.

Demo

We provide a script for running video question-answering on a single video.

cd ml-slowfast-llava
python run_demo.py --video_path $PATH_TO_VIDEO --model_path $PATH_TO_LLAVA_MODEL --question "Describe this video in details"

License

This project is licensed under the Apple Sample Code License.

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@article{xu2024slowfast,
	title={SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models},
	author={Xu, Mingze and Gao, Mingfei and Gan, Zhe, and Chen, Hong-You and Lai, Zhengfeng and Gang, Haiming and Kang, Kai and Dehghan, Afshin},
	journal={arXiv:2407.15841},
	year={2024}
}

About

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published