-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tuning or train from scratch for longer outputs #19
Comments
My model only focuses on the quality limit of ten seconds, so I think it may not be suitable for long-term music modeling . If my model is used, I think the maximum training time of my model is about 2-3 minutes with batch-size=1 on an 80G A100 graphics card, and 8 graphics cards may be effective after training for about a week. However, technology is developing, and my work is more about providing insights for the field of music generation, or serving in small demand scenarios (sound effects, BGM, short video soundtrack, etc.), rather than a truly realistic industrial-grade music generation model. I recommend you to take a look at the paper of stable audio. Although their open source version can only generate less than one minute, they provide a paper trained on nearly five minutes of data. I think you can achieve your goal with their technology. By the way, under the existing technology, a music generation model of about five minutes is usually a language model architecture rather than a diffusion model architecture. If you focus on the time length of the audio rather than the performance limit of the audio, I recommend you to refer to the training paradigm of Musicgen or MusicLM, which have more potential to generate longer audio. If you are research-oriented, Stable AI recently open-sourced Stable Codec. I think combining it with a large language model provides the potential for generating long-duration music in the future (perhaps Suno or Udio is also such a technical paradigm). If you want to train with lyrics generation, I recommend DiffSinger and Text-to-Song Of course, if you are interested in my paper, the label optimization strategy and quality-aware training strategy in my paper are actually not limited to modeling methods. You are free to merge these techniques with music models that can generate longer durations. Links mentioned: MusicLM training code: https://github.com/lucidrains/musiclm-pytorch MusicGEN training code: https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md Stable audio training code: https://github.com/Stability-AI/stable-audio-tools Stable audio 2 paper: https://stability.ai/news/stable-audio-2-0 Stable codec: https://github.com/Stability-AI/stable-codec Diffsinger: https://github.com/MoonInTheRiver/DiffSinger Text-to-song: https://arxiv.org/abs/2404.09313 |
Hi,
First off, thank you for open-sourcing this amazing work!
I wanted to ask if it's possible to fine-tune your checkpoint to work with longer audio using my own music dataset, which consists of songs averaging 5 minutes in length.
If not, is there a way to train from scratch for generating such long audio?
The text was updated successfully, but these errors were encountered: