Skip to content

An open source community implementation of the model from the paper: "Movie Gen: A Cast of Media Foundation Models". Join our community to help implement this model!

License

Notifications You must be signed in to change notification settings

kyegomez/movie-gen

Repository files navigation

Multi-Modality

Movie Gen

Join Their Discord Subscribe on YouTube Connect on LinkedIn Follow on X.com

Movie Gen is a collection of cutting-edge foundation models developed by the Movie Gen team at Meta1. These models are designed to generate high-quality 1080p HD videos with various aspect ratios and synchronized audio. Movie Gen excels at a range of tasks, including:

  • Text-to-video synthesis
  • Video personalization
  • Precise video editing based on user instructions
  • Video-to-audio generation
  • Text-to-audio generation

Their models set a new state-of-the-art in multiple video and audio generation domains and aim to push the boundaries of what's possible in media creation. The most powerful model in Their collection is a 30 billion parameter transformer, capable of generating videos up to 16 seconds long at 16 frames per second (fps). The model operates with a maximum context length of 73K video tokens, allowing for highly detailed and complex media output.

Key Features

  • HD Video Generation: Outputs high-quality 1080p videos with various aspect ratios and synchronized audio.
  • Text-to-Video Synthesis: Generates fully realized videos from natural language descriptions.
  • Personalized Video Creation: Tailors videos based on user-supplied images or inputs.
  • Instruction-Based Video Editing: Allows precise control and editing of video content through instructions.
  • Audio Synthesis: Generates audio based on video content and natural language descriptions.
  • Scaling & Efficiency: Achieves high scalability through technical innovations in parallelization, architecture simplifications, and efficient data curation.

Model Overview

Model Parameters Capabilities Max Context Length FPS
Movie Gen Base 5B Text-to-Video, Video-to-Audio 18K video tokens 16
Movie Gen Pro 15B Personalized Video, Text-to-Video 40K video tokens 16
Movie Gen Max (State-of-the-art) 30B Full-featured Video & Audio Generation 73K video tokens 16

Technical Innovations

  1. Architecture Simplifications: They introduced several architectural simplifications to scale media generation models effectively. These include novel transformer-based structures tailored for handling video data.

  2. Latent Spaces & Training Objectives: By refining latent spaces and optimizing training objectives, Their models can generate realistic, coherent, and high-quality outputs across multiple media modalities.

  3. Data Curation: They built a highly curated, diverse dataset specifically designed for multi-modal media generation tasks.

  4. Parallelization Techniques: Their models leverage advanced parallelization techniques, enabling faster training and inference.

  5. Inference Optimizations: They implemented optimizations that significantly reduce latency during inference, making real-time video generation and editing feasible.

Installation

To use Movie Gen, clone the repository and install the necessary dependencies:

git clone https://github.com/kyegomez/movie-gen.git
cd movie-gen
pip install -r requirements.txt

Usage

TemporalAutoencoder or TAE

import torch
from loguru import logger
from movie_gen.tae import TemporalAutoencoder

def test_temporal_autoencoder():
    """
    Test the TemporalAutoencoder model with a dummy input tensor.
    This function creates a random input tensor representing a batch of videos,
    passes it through the model, and prints out the input and output shapes.
    """
    # Set the logger to display debug messages
    logger.add(lambda msg: print(msg, end=''))

    # Instantiate the model
    model = TemporalAutoencoder(in_channels=3, latent_channels=16)

    # Create a dummy input tensor representing a batch of videos
    # Batch size B=2, T0=16 frames, 3 channels (RGB), H0=64, W0=64
    B, T0, C_in, H0, W0 = 1, 16, 3, 64, 64
    x = torch.randn(B, T0, C_in, H0, W0)

    # Forward pass through the model
    recon = model(x)

    # Print the shapes
    print(f"Input shape: {x.shape}")
    print(f"Reconstructed output shape: {recon.shape}")

if __name__ == "__main__":
    test_temporal_autoencoder()

Evaluation

Their models have been rigorously evaluated on multiple tasks, including:

  • Text-to-video generation benchmarks.
  • Video personalization accuracy.
  • Instruction-based video editing precision.
  • Audio generation quality.

Reproducing Their Results

To reproduce the evaluation metrics in Their paper, use the following command:

python evaluate.py --model movie-gen-max --task text-to-video

Contributing

They welcome contributions! Please follow the standard GitHub flow:

  1. Fork the repository
  2. Create a new feature branch (git checkout -b feature-branch)
  3. Make your changes
  4. Submit a pull request

For a list of core contributors, please refer to the appendix of the Movie Gen Paper.

License

Movie Gen is licensed under the MIT License. See LICENSE for more information.

Contact

For any questions or collaboration opportunities, please reach out to the Movie Gen team at:

About

An open source community implementation of the model from the paper: "Movie Gen: A Cast of Media Foundation Models". Join our community to help implement this model!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published