Movie Gen is a collection of cutting-edge foundation models developed by the Movie Gen team at Meta1. These models are designed to generate high-quality 1080p HD videos with various aspect ratios and synchronized audio. Movie Gen excels at a range of tasks, including:
- Text-to-video synthesis
- Video personalization
- Precise video editing based on user instructions
- Video-to-audio generation
- Text-to-audio generation
Their models set a new state-of-the-art in multiple video and audio generation domains and aim to push the boundaries of what's possible in media creation. The most powerful model in Their collection is a 30 billion parameter transformer, capable of generating videos up to 16 seconds long at 16 frames per second (fps). The model operates with a maximum context length of 73K video tokens, allowing for highly detailed and complex media output.
- HD Video Generation: Outputs high-quality 1080p videos with various aspect ratios and synchronized audio.
- Text-to-Video Synthesis: Generates fully realized videos from natural language descriptions.
- Personalized Video Creation: Tailors videos based on user-supplied images or inputs.
- Instruction-Based Video Editing: Allows precise control and editing of video content through instructions.
- Audio Synthesis: Generates audio based on video content and natural language descriptions.
- Scaling & Efficiency: Achieves high scalability through technical innovations in parallelization, architecture simplifications, and efficient data curation.
Model | Parameters | Capabilities | Max Context Length | FPS |
---|---|---|---|---|
Movie Gen Base | 5B | Text-to-Video, Video-to-Audio | 18K video tokens | 16 |
Movie Gen Pro | 15B | Personalized Video, Text-to-Video | 40K video tokens | 16 |
Movie Gen Max (State-of-the-art) | 30B | Full-featured Video & Audio Generation | 73K video tokens | 16 |
-
Architecture Simplifications: They introduced several architectural simplifications to scale media generation models effectively. These include novel transformer-based structures tailored for handling video data.
-
Latent Spaces & Training Objectives: By refining latent spaces and optimizing training objectives, Their models can generate realistic, coherent, and high-quality outputs across multiple media modalities.
-
Data Curation: They built a highly curated, diverse dataset specifically designed for multi-modal media generation tasks.
-
Parallelization Techniques: Their models leverage advanced parallelization techniques, enabling faster training and inference.
-
Inference Optimizations: They implemented optimizations that significantly reduce latency during inference, making real-time video generation and editing feasible.
To use Movie Gen, clone the repository and install the necessary dependencies:
git clone https://github.com/kyegomez/movie-gen.git
cd movie-gen
pip install -r requirements.txt
import torch
from loguru import logger
from movie_gen.tae import TemporalAutoencoder
def test_temporal_autoencoder():
"""
Test the TemporalAutoencoder model with a dummy input tensor.
This function creates a random input tensor representing a batch of videos,
passes it through the model, and prints out the input and output shapes.
"""
# Set the logger to display debug messages
logger.add(lambda msg: print(msg, end=''))
# Instantiate the model
model = TemporalAutoencoder(in_channels=3, latent_channels=16)
# Create a dummy input tensor representing a batch of videos
# Batch size B=2, T0=16 frames, 3 channels (RGB), H0=64, W0=64
B, T0, C_in, H0, W0 = 1, 16, 3, 64, 64
x = torch.randn(B, T0, C_in, H0, W0)
# Forward pass through the model
recon = model(x)
# Print the shapes
print(f"Input shape: {x.shape}")
print(f"Reconstructed output shape: {recon.shape}")
if __name__ == "__main__":
test_temporal_autoencoder()
Their models have been rigorously evaluated on multiple tasks, including:
- Text-to-video generation benchmarks.
- Video personalization accuracy.
- Instruction-based video editing precision.
- Audio generation quality.
To reproduce the evaluation metrics in Their paper, use the following command:
python evaluate.py --model movie-gen-max --task text-to-video
They welcome contributions! Please follow the standard GitHub flow:
- Fork the repository
- Create a new feature branch (
git checkout -b feature-branch
) - Make your changes
- Submit a pull request
For a list of core contributors, please refer to the appendix of the Movie Gen Paper.
Movie Gen is licensed under the MIT License. See LICENSE
for more information.
For any questions or collaboration opportunities, please reach out to the Movie Gen team at:
- Email: http://agoralab.ai
- Website: http://agoralab.ai