Skip to content

Latest commit

 

History

History

diffusion

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Denoising Diffusion Probabilistic Model

This folder contains scripts for running a Denoising Diffusion Probabilistic Model generative model with the AudioMNIST dataset, which contains recordings of spoken English digits in a variety of voices and accents.

https://arxiv.org/pdf/2006.11239.pdf

Denoising Diffusion Probabilistic Models are a classes of generative models based on the key idea of training a neural network model, which, given a sample with added noise, can identify that noise and, during inference, can generate a sample by reconstructing it from pure noise in a stepwise manner. The method is inspired by Langevin dynamics.

Generation can be unconditioned, where the model will generate an arbitrary sample from the target distribution, or conditioned, where it will generate samples given a class label or a prompt.

Diffusion models can operate in the original sample space, but for high-dimensional samples, such as full-resolution images, this can be too slow during inference time. A common approach to addressing this is latent diffusion (https://arxiv.org/abs/2112.10752). This recipe supports both sample-space (spectrogram-space) and latent diffusion.

This recipe implements a basic DDPM to generate speech samples using the AudioMNIST dataset. It can be used to train an unconditioned model, a model conditioned on the speaker or on the digit label.

Training

Unconditioned Model

To train the unconditioned model, run the following command:

python train.py hparams/train.yaml --data_folder=your/data/folder

The required data will be automatically downloaded into the specified data folder. Keep in mind that AudioMNIST is a relatively small dataset, which may pose challenges in training a diffusion model capable of generating extremely high-quality samples. Nonetheless, the generated samples should maintain intelligibility and exhibit digit-like sounds.

Speaker-Conditioned Model

To train the model with speaker conditioning, execute the following command:

python train.py hparams/train.yaml --speaker_conditioned true  --data_folder=your/data/folder

In this case, you should generate intelligible digits. When the generation process is conditioned on the same speaker, the digits sound as being generated by a speaker with the same (or very similar) speaker characteristics.

Digit-Conditioned Model (Simplified TTS)

For a model focused on digit conditioning, useful for a simplified Text-to-Speech (TTS) use case, run the following command:

python train.py hparams/train.yaml --digit_conditioned true  --data_folder=your/data/folder

In this case, you should generate intelligible digits. When the generation process is conditioned on the same digit, you should generate the same digit (normally from different speakers).

Latent Diffusion Model

To train the latent diffusion model, use the following command:

python train.py hparams/train_latent.yaml  --data_folder=your/data/folder

The quality of the generated digit is lower with latent diffusion. The generated signals should, however, sounds like a digit.

Samples, checkpoints and Training logs

The training scripts will produce results that can be found in the <output_folder>/samples directory after each training epoch.

The output folder containing the generated samples, model checkpoints and training logs for all the aforementioned experiments can be found here: https://www.dropbox.com/sh/szpmkp8aok1nquf/AABziohiZ8UhYBJz5TXscu93a?dl=0

About SpeechBrain

Citing SpeechBrain

Please, cite SpeechBrain if you use it for your research or business.

@misc{speechbrainV1,
  title={Open-Source Conversational AI with SpeechBrain 1.0},
  author={Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca Della Libera and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Gaelle Laperriere and Mickael Rouvier and Renato De Mori and Yannick Esteve},
  year={2024},
  eprint={2407.00463},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2407.00463},
}
@misc{speechbrain,
  title={{SpeechBrain}: A General-Purpose Speech Toolkit},
  author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
  year={2021},
  eprint={2106.04624},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  note={arXiv:2106.04624}
}