Skip to content

[ECCV 2024πŸ”₯] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"

License

Notifications You must be signed in to change notification settings

TencentARC/ST-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ST-LLM

hf arXiv License

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

News πŸ“’

  • [2024/3/28] All codes and weights are available now! Welcome to watch this repository for the latest updates.

Introduction πŸ’‘

  • ST-LLM is a temporal-sensitive video large language model. Our model incorporates three key architectural:
    • (1) Joint spatial-temporal modeling within large language models for effective video understanding.
    • (2) Dynamic masking strategy and mask video modeling for efficiency and robustness.
    • (3) Global-local input module for long video understanding.
  • ST-LLM has established new state-of-the-art results on MVBench, VideoChatGPT Bench and VideoQA Bench:
MethodMVBenchVcgBenchVideoQABench
AvgCorrectDetailContextTemporalConsistMSVDMSRVTTANet
VideoLLaMA34.11.962.182.161.821.791.9851.629.612.4
LLaMA-Adapter31.72.032.322.301.982.152.1654.943.834.2
VideoChat35.52.232.502.531.942.242.2956.345.026.5
VideoChatGPT32.72.382.402.522.621.982.3764.949.335.7
MovieChat-2.762.933.012.242.422.6774.252.745.7
Vista-LLaMA-2.442.643.182.262.312.5765.360.548.3
LLaMA-VID-2.892.963.003.532.462.5169.757.747.4
Chat-UniVi-2.992.892.913.462.892.8165.054.645.8
VideoChat251.12.983.022.883.512.662.8170.054.149.1
ST-LLM54.93.153.233.053.742.932.8174.663.250.9

Demo πŸ€—

Please download the conversation weights from here and follow the instructions in installation first. Then, run the gradio demo:

CUDA_VISIBLE_DEVICES=0 python3 demo_gradio.py --ckpt-path /path/to/STLLM_conversation_weight

We have also prepared local scripts that are easy to modify:demo.py

Examples πŸ‘€

  • Video Description: for high-difficulty videos with complex scene changes, ST-LLM can accurately describe all the contents.

  • Action Identification: ST-LLM can accurately and comprehensively describe the actions occurring in the video.

  • Reasoning: for the challenging open-ended reasoning questions, STLLM can also provide reasonable answers.

Installation πŸ› οΈ

Git clone our repository, creating a Python environment and activate it via the following command

git clone https://github.com/farewellthree/ST-LLM.git
cd ST-LLM
conda create --name stllm python=3.10
conda activate stllm
pip install -r requirement.txt

Training & Validation πŸ“Š

The instructions of data, training and evaluating can be found in trainval.md.

Acknowledgement πŸ‘

Citation ✏️

If you find the code and paper useful for your research, please consider staring this repo and citing our paper:

@article{liu2023one,
  title={One for all: Video conversation is feasible without video instruction tuning},
  author={Liu, Ruyang and Li, Chen and Ge, Yixiao and Shan, Ying and Li, Thomas H and Li, Ge},
  journal={arXiv preprint arXiv:2309.15785},
  year={2023}
}
@article{liu2023one,
  title={ST-LLM: Large Language Models Are Effective Temporal Learners},
  author={Liu, Ruyang and Li, Chen and Tang, Haoran and Ge, Yixiao and Shan, Ying and Li, Ge},
  journal={https://arxiv.org/abs/2404.00308},
  year={2023}
}

About

[ECCV 2024πŸ”₯] Official implementation of the paper "ST-LLM: Large Language Models Are Effective Temporal Learners"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published