Fine-tuning or train from scratch for longer outputs #19

Toolblox-live · 2024-12-11T15:23:53Z

Hi,

First off, thank you for open-sourcing this amazing work!

I wanted to ask if it's possible to fine-tune your checkpoint to work with longer audio using my own music dataset, which consists of songs averaging 5 minutes in length.

If not, is there a way to train from scratch for generating such long audio?

ivcylc · 2024-12-11T21:45:47Z

My model only focuses on the quality limit of ten seconds, so I think it may not be suitable for long-term music modeling .
(this is mainly because the VAE I use comes from AudioLDM2, which is a model for generating sound effects within ten seconds. Its compression rate is very small, so training on long-term data requires a huge amount of CUDA memory, which is not suitable for my model from the perspective of efficiency or resources).

If my model is used, I think the maximum training time of my model is about 2-3 minutes with batch-size=1 on an 80G A100 graphics card, and 8 graphics cards may be effective after training for about a week. However, technology is developing, and my work is more about providing insights for the field of music generation, or serving in small demand scenarios (sound effects, BGM, short video soundtrack, etc.), rather than a truly realistic industrial-grade music generation model.

I recommend you to take a look at the paper of stable audio. Although their open source version can only generate less than one minute, they provide a paper trained on nearly five minutes of data. I think you can achieve your goal with their technology. By the way, under the existing technology, a music generation model of about five minutes is usually a language model architecture rather than a diffusion model architecture.

If you focus on the time length of the audio rather than the performance limit of the audio, I recommend you to refer to the training paradigm of Musicgen or MusicLM, which have more potential to generate longer audio. If you are research-oriented, Stable AI recently open-sourced Stable Codec. I think combining it with a large language model provides the potential for generating long-duration music in the future (perhaps Suno or Udio is also such a technical paradigm).

If you want to train with lyrics generation, I recommend DiffSinger and Text-to-Song

Of course, if you are interested in my paper, the label optimization strategy and quality-aware training strategy in my paper are actually not limited to modeling methods. You are free to merge these techniques with music models that can generate longer durations.

Links mentioned:

MusicLM training code: https://github.com/lucidrains/musiclm-pytorch

MusicGEN training code: https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md

Stable audio training code: https://github.com/Stability-AI/stable-audio-tools

Stable audio 2 paper: https://stability.ai/news/stable-audio-2-0

Stable codec: https://github.com/Stability-AI/stable-codec

Diffsinger: https://github.com/MoonInTheRiver/DiffSinger

Text-to-song: https://arxiv.org/abs/2404.09313