Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning or train from scratch for longer outputs #19

Open
Toolblox-live opened this issue Dec 11, 2024 · 1 comment
Open

Fine-tuning or train from scratch for longer outputs #19

Toolblox-live opened this issue Dec 11, 2024 · 1 comment

Comments

@Toolblox-live
Copy link

Hi,

First off, thank you for open-sourcing this amazing work!

I wanted to ask if it's possible to fine-tune your checkpoint to work with longer audio using my own music dataset, which consists of songs averaging 5 minutes in length.

If not, is there a way to train from scratch for generating such long audio?

@ivcylc
Copy link
Owner

ivcylc commented Dec 11, 2024

My model only focuses on the quality limit of ten seconds, so I think it may not be suitable for long-term music modeling .
(this is mainly because the VAE I use comes from AudioLDM2, which is a model for generating sound effects within ten seconds. Its compression rate is very small, so training on long-term data requires a huge amount of CUDA memory, which is not suitable for my model from the perspective of efficiency or resources).

If my model is used, I think the maximum training time of my model is about 2-3 minutes with batch-size=1 on an 80G A100 graphics card, and 8 graphics cards may be effective after training for about a week. However, technology is developing, and my work is more about providing insights for the field of music generation, or serving in small demand scenarios (sound effects, BGM, short video soundtrack, etc.), rather than a truly realistic industrial-grade music generation model.

I recommend you to take a look at the paper of stable audio. Although their open source version can only generate less than one minute, they provide a paper trained on nearly five minutes of data. I think you can achieve your goal with their technology. By the way, under the existing technology, a music generation model of about five minutes is usually a language model architecture rather than a diffusion model architecture.

If you focus on the time length of the audio rather than the performance limit of the audio, I recommend you to refer to the training paradigm of Musicgen or MusicLM, which have more potential to generate longer audio. If you are research-oriented, Stable AI recently open-sourced Stable Codec. I think combining it with a large language model provides the potential for generating long-duration music in the future (perhaps Suno or Udio is also such a technical paradigm).

If you want to train with lyrics generation, I recommend DiffSinger and Text-to-Song

Of course, if you are interested in my paper, the label optimization strategy and quality-aware training strategy in my paper are actually not limited to modeling methods. You are free to merge these techniques with music models that can generate longer durations.

Links mentioned:

MusicLM training code: https://github.com/lucidrains/musiclm-pytorch

MusicGEN training code: https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md

Stable audio training code: https://github.com/Stability-AI/stable-audio-tools

Stable audio 2 paper: https://stability.ai/news/stable-audio-2-0

Stable codec: https://github.com/Stability-AI/stable-codec

Diffsinger: https://github.com/MoonInTheRiver/DiffSinger

Text-to-song: https://arxiv.org/abs/2404.09313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants