This folder contains scripts for running a Denoising Diffusion Probabilistic Model generative model with the AudioMNIST dataset, which contains recordings of spoken English digits in a variety of voices and accents.
https://arxiv.org/pdf/2006.11239.pdf
Denoising Diffusion Probabilistic Models are a classes of generative models based on the key idea of training a neural network model, which, given a sample with added noise, can identify that noise and, during inference, can generate a sample by reconstructing it from pure noise in a stepwise manner. The method is inspired by Langevin dynamics.
Generation can be unconditioned, where the model will generate an arbitrary sample from the target distribution, or conditioned, where it will generate samples given a class label or a prompt.
Diffusion models can operate in the original sample space, but for high-dimensional samples, such as full-resolution images, this can be too slow during inference time. A common approach to addressing this is latent diffusion (https://arxiv.org/abs/2112.10752). This recipe supports both sample-space (spectrogram-space) and latent diffusion.
This recipe implements a basic DDPM to generate speech samples using the AudioMNIST dataset. It can be used to train an unconditioned model, a model conditioned on the speaker or on the digit label.
To train the unconditioned model, run the following command:
python train.py hparams/train.yaml --data_folder=your/data/folder
The required data will be automatically downloaded into the specified data folder. Keep in mind that AudioMNIST is a relatively small dataset, which may pose challenges in training a diffusion model capable of generating extremely high-quality samples. Nonetheless, the generated samples should maintain intelligibility and exhibit digit-like sounds.
To train the model with speaker conditioning, execute the following command:
python train.py hparams/train.yaml --speaker_conditioned true --data_folder=your/data/folder
In this case, you should generate intelligible digits. When the generation process is conditioned on the same speaker, the digits sound as being generated by a speaker with the same (or very similar) speaker characteristics.
For a model focused on digit conditioning, useful for a simplified Text-to-Speech (TTS) use case, run the following command:
python train.py hparams/train.yaml --digit_conditioned true --data_folder=your/data/folder
In this case, you should generate intelligible digits. When the generation process is conditioned on the same digit, you should generate the same digit (normally from different speakers).
To train the latent diffusion model, use the following command:
python train.py hparams/train_latent.yaml --data_folder=your/data/folder
The quality of the generated digit is lower with latent diffusion. The generated signals should, however, sounds like a digit.
The training scripts will produce results that can be found in the <output_folder>/samples
directory after each training epoch.
The output folder containing the generated samples, model checkpoints and training logs for all the aforementioned experiments can be found here: https://www.dropbox.com/sh/szpmkp8aok1nquf/AABziohiZ8UhYBJz5TXscu93a?dl=0
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
- HuggingFace: https://huggingface.co/speechbrain/
Please, cite SpeechBrain if you use it for your research or business.
@misc{speechbrainV1,
title={Open-Source Conversational AI with SpeechBrain 1.0},
author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
year={2024},
eprint={2407.00463},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}