Skip to content

haoheliu/AudioLDM-training-finetuning

Repository files navigation

arXiv arXiv

🔊 AudioLDM training, finetuning, inference and evaluation

Prepare Python running environment

# Create conda environment
conda create -n audioldm_train python=3.10
conda activate audioldm_train
# Clone the repo
git clone https://github.com/haoheliu/AudioLDM-training-finetuning.git; cd AudioLDM-training-finetuning
# Install running environment
pip install poetry
poetry install

Download checkpoints and dataset

  1. Download checkpoints (checkpoints.tar) from Google Drive (or from Zenodo): link. The checkpoints including pretrained VAE, AudioMAE, CLAP, 16kHz HiFiGAN, and 48kHz HiFiGAN.
  2. Uncompress the checkpoint tar file and place the content into data/checkpoints/
  3. Download the preprocessed AudioCaps (dataset.tar) from Google Drive (or from Zenodo): link
  4. Similarly, uncompress the dataset tar file and place the content into data/dataset

To double check if dataset or checkpoints are ready, run the following command:

python3 tests/validate_dataset_checkpoint.py

If the structure is not correct or partly missing. You will see the error message.

Play around with the code

Train the AudioLDM model

# Train the AudioLDM (latent diffusion part)
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml

# Train the VAE (Optional)
# python3 audioldm_train/train/autoencoder.py -c audioldm_train/config/2023_11_13_vae_autoencoder/16k_64.yaml

The program will perform generation on the evaluation set every 5 epochs of training. After obtaining the audio generation folders (named val_), you can proceed to the next step for model evaluation.

Finetuning of the pretrained model

You can finetune with two pretrained checkpoint, first download the one that you like (e.g., using wget):

  1. Medium size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-m-full.ckpt
  2. Small size AudioLDM: https://zenodo.org/records/7884686/files/audioldm-s-full

Place the checkpoint in the data/checkpoints folder

Then perform finetuning with one of the following commands:

# Medium size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original_medium.yaml --reload_from_ckpt data/checkpoints/audioldm-m-full.ckpt

# Small size AudioLDM
python3 audioldm_train/train/latent_diffusion.py -c audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_original.yaml --reload_from_ckpt data/checkpoints/audioldm-s-full

You can specify your own dataset following the same format as the provided AudioCaps dataset.

Note that the pretrained AudioLDM checkpoints are under CC-by-NC 4.0 license, which is not allowed for commerial use.

Evaluate the model output

Automatically evaluation based on each of the folder with generated audio

# Evaluate all existing generated folder
python3 audioldm_train/eval.py --log_path all

# Evaluate only a specific experiment folder
python3 audioldm_train/eval.py --log_path <path-to-the-experiment-folder>

The evaluation result will be saved in a json file at the same level of the audio folder.

Inference with the pretrained model

Use the following syntax:

python3 audioldm_train/infer.py --config_yaml <The-path-to-the-same-config-file-you-use-for-training> --list_inference <the-filelist-you-want-to-generate>

For example:

# Please make sure you have train the model using audioldm_crossattn_flant5.yaml
# The generated audio will be saved at the same log folder if the pretrained model.
python3 audioldm_train/infer.py --config_yaml audioldm_train/config/2023_08_23_reproduce_audioldm/audioldm_crossattn_flant5.yaml --list_inference tests/captionlist/inference_test.lst

The generated audio will be named with the caption by default. If you like to specify the filename to use, please checkout the format of tests/captionlist/inference_test_with_filename.lst.

This repo only support inference with the model you trained by yourself. If you want to use the pretrained model directly, please use these two repos: AudioLDM and AudioLDM2.

Train the model using your own dataset

Super easy, simply follow these steps:

  1. Prepare the metadata with the same format as the provided AudioCaps dataset.
  2. Register in the metadata of your dataset in data/dataset/metadata/dataset_root.json
  3. Use your dataset in the YAML file.

You do not need to resample or pre-segment the audiofile. The dataloader will do most of the jobs.

Cite this work

If you found this tool useful, please consider citing

@article{audioldm2-2024taslp,
  author={Liu, Haohe and Yuan, Yi and Liu, Xubo and Mei, Xinhao and Kong, Qiuqiang and Tian, Qiao and Wang, Yuping and Wang, Wenwu and Wang, Yuxuan and Plumbley, Mark D.},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining}, 
  year={2024},
  volume={32},
  pages={2871-2883},
  doi={10.1109/TASLP.2024.3399607}
}

@article{liu2023audioldm,
  title={{AudioLDM}: Text-to-Audio Generation with Latent Diffusion Models},
  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
  journal={Proceedings of the International Conference on Machine Learning},
  year={2023}
  pages={21450-21474}
}

Acknowledgement

We greatly appreciate the open-soucing of the following code bases. Open source code base is the real-world infinite stone 💎!

This research was partly supported by the British Broadcasting Corporation Research and Development, Engineering and Physical Sciences Research Council (EPSRC) Grant EP/T019751/1 "AI for Sound", and a PhD scholarship from the Centre for Vision, Speech and Signal Processing (CVSSP), Faculty of Engineering and Physical Science (FEPS), University of Surrey. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) license to any Author Accepted Manuscript version arising. We would like to thank Tang Li, Ke Chen, Yusong Wu, Zehua Chen and Jinhua Liang for their support and discussions.