This is the official implementation of the paper:
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
Danah Yatim*,
Rafail Fridman*,
Omer Bar-Tal,
Yoni Kasten,
Tali Dekel
(*equal contribution)
teaser.mp4
Introducing a zero-shot method for transferring motion across objects and scenes. without any training or finetuning.
We present a new method for text-driven motion transfer -- synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.
For more, visit the project webpage.
Clone the repo and create a new environment:
git clone https://github.com/diffusion-motion-transfer/diffusion-motion-transfer.git
cd diffusion-motion-transfer
conda create --name dmt python=3.9
conda activate dmt
Install our environment requirements:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
- Our method is designed for transferring motion across objects and scenes
- Our method is based on ZeroScope text-to-video model. Therefore, we can edit videos of 24 frames.
- in some cases the combination of target object and input video motion is out of distribution for the T2V model, which can lead to visual artifacts in the generated video. It may be necessary to sample several seeds.
- Method was tested to run on a single NVIDIA A40 48GB, and takes ~32GB of video memory. It takes approximately 7 minutes on a single NVIDIA A40 48GB.
To preprocess a video, update configuration file `configs/preprocess_config.yaml':
Arguments to update:
video_path
- the input video frames should be located in this pathsave_dir
- the latents will be saved in this pathprompt
- empty string or a string describing the video content
Optional arguments to update:
--save_ddim_reconstruction
if True, the reconstructed video will be saved in--save_dir
After updating config file, run the following command:
python preprocess_video_ddim.py --config_path configs/preprocess_config.yaml
Once the preprocessing is done, the latents will be saved in the save_dir
path.
To edit the video, update configuration file configs/guidance_config.yaml
Arguments to update:
data_path
- the input video frames should be located in this pathoutput_path
- the edited video will be saved in this pathlatents_path
- the latents of the input video should be located in this pathsource_prompt
- prompt used for inversiontarget_prompt
- prompt used for editing
Optional arguments to update:
negative_prompt
- prompt used for unconditional classifier free guidanceseed
- By default it is randomly chosen, to specify seed change thise value.optimization_step
- number of optimization steps for each denoising stepoptim_lr
- learning ratewith_lr_decay
- if True, overridesoptim_lr
, and the learning rate will decay during the optimization process in the range ofscale_range
After updating the config file, run the following command:
python run.py --config_path configs/guidance_config.yaml
Once the method is done, the video will be saved to the output_path
under result.mp4
.
- To get better samples from the T2V model, we used the prefix text
"Amazing quality, masterpiece, "
for inversion and edits. - If the video contains more complex motion/small objects, try increasing number of optimization steps -
optimization_step: 30
. - For large deviation in structure between the source and target objects, try using a lower lr -
scale_range:[0.005, 0.002]
, - or adding the source object to the negative prompt text.
We also provide the code for calculating the motion fidelity metric introduced in the paper (Section 5.1). To calculate the motion fidelity metric, first follow the instructions here to install Co-Tracker and download their checkpoint. Then, run the following command:
python motion_fidelity_score.py --config_path configs/motion_fidelity_config.yaml
@article{yatim2023spacetime,
title = {Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer},
author = {Yatim, Danah and Fridman, Rafail and Bar-Tal, Omer and Kasten, Yoni and Dekel, Tali},
journal={arXiv preprint arxiv:2311.17009},
year={2023}
}