This is a tensorflow implementation of the multispeaker TTS network introduced in paper From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint. This repository also contains a deep speaker verification model that is used in multi-speaker TTS model as the feedback network. Synthesized samples are provided online.
@inproceedings{Cai2020,
author={Zexin Cai and Chuxiong Zhang and Ming Li},
title={{From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint}},
year=2020,
booktitle={Proc. Interspeech 2020}
}
The speaker verification model is located in directory deep_speaker. By default setting, the speaker verification model is trained with data Voxceleb 1 and Voxceleb 2. You can find the file list in the directory. Hyperparameters are set in vox12_hparams.py.
To train the speaker verificaiton model from scratch, prepare the data as listed in file list and run:
CUDA_VISIBLE_DEVICES=0 python train.py
By default setting, the synthesizer is trained using dataset VCTK.
-
Extract audio feature using process_audio.ipynb
-
Extract speaker embeddings using ipython notebook deep_speaker/get_gvector.ipynb
-
Train a baseline multispeaker TTS system
CUDA_VISIBLE_DEVICES=0 python synthesizer_train.py vctk datasets/vctk/synthesizer
-
Feel free to evaluate and synthesize samples using syn.ipynb during training
By default setting, the vocoder is also trained using dataset VCTK. It would be easy after you have the acoustic feature extracted from the previous section (TTS synthesizer). For better performance, please use GTA Mel-spectrogram obtained by vocoder_preprocess.py after the synthesizer training is finished.
CUDA_VISIBLE_DEVICES=0 python vocoder_train.py -g --syn_dir datasets/vctk/synthesizer vctk datasets/vctk
-
Set the path to the two pretrained model (the speaker verification model and the multispeaker synthesizer) by changing the corresponding keys in hparams.py.
-
Train the model and evaluate anytime with feedback_syn.ipynb
CUDA_VISIBLE_DEVICES=0 python fc_synthesizer_train.py
- Speaker embedding network
- Baseline synthesizer 1 (used as the pretrained model for the feedback training)
- Baseline synthesizer 2
- TTS synthesizer with feedback constraint
- WaveRNN vocoder