Updation: We support HunyuanVideo for in-context learning in new branch.
TL;DR: We explores in-context capabilities in video diffusion transformers, with minimal tuning to activate them.
Abstract: Following In-context-Lora, we directly concatenate both condition and target videos into a single composite video from spacial or time dimension while using natural language to define the task. It can serve as a general framework for control video generation, with task-specific fine-tuning. More encouragingly, it can create a consistent multi-scene video more than 30 seconds without any more computation burden.
For more detailed information, please read our technique report. This is a research project, and it is recommended to try advanced products:
Our environment is totally same with CogvideoX and you can install by:
pip install -r requirement.txt
Download the lora checkpoint from huggingface, and put it with model path variable.
We provide the scene and human loras, which generate the cases with different prompt types in technique report.
You can run with mini code as following or refer to infer.py
which generate cases, after setting the path for lora.
from diffusers.utils import export_to_video
from diffusers import CogVideoXPipeline
lora_path = /path/to/lora
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)
pipe.load_lora_weights(lora_path, adapter_name="cogvideox-lora")
pipe.set_adapters(["cogvideox-lora"], [1.0])
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
You can training with your own lora for control tasks with finetuning scripts and our experiments can be repeated by simply run the training scripts as:
sh finetune.sh
Before, you should prepare:
- Video-text pair data as formation;
- Prompt template to combine different video clips;
The codebase is based on the awesome IC-Lora, CogvideoX, Cogvideo-factory, and diffusers repos.