Open
Description
Thanks for the code. I want to know how we can integrate this with the temporal attention based models like tune-a-video to generate videos. As the svdiff will be trained for 2D image while in video generation we have an additional dimension for number of frames.
Metadata
Assignees
Labels
No labels