For temporal attention

Thanks for the code. I want to know how we can integrate this with the temporal attention based models like tune-a-video to generate videos. As the svdiff will be trained for 2D image while in video generation we have an additional dimension for number of frames.