This project follows the structure presented in Conversational End-to-End TTS for Voice Agent. Note that auxiliary encoder
is not considered in current implementation, focusing only on the effect of conversational context encoder
(both the auxiliary encoder and chat history are independent to each other). You can implement it on top of this project without any conflict.
Also, only Korean is supported by the current implementation. You can easily extend to English or other languages. Please refer to the Notes section of Emotional TTS.
- Please install the python dependencies given in
requirements.txt
.pip3 install -r requirements.txt
- Install UKPLab's sentence-transformers for BERT Embedding. It will be used to extract sentence (text) embedding of each turn in a dialog.
Not permitted to share pre-trained model publicly due to the copyright of AIHub Multimodal Video AI datasets.
- Follow the same process as in Emotional TTS
- Different from general TTS, we need to split the dataset into dialogs to build the conversational TTS. The following command will generate new file lists (
train_dialog.txt
andval_dialog.txt
), filtering out non-sanity dialog (e.g., missing a turn or turns).python3 prepare_dialog.py -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml
Now you have all the prerequisites! Train the model using the following command:
python3 train.py -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml
Only the batch inference is supported to synthesize per dialog. Try
python3 synthesize.py --source preprocessed_data/AIHub-MMV/val_dialog.txt --restore_step STEP --mode batch -p config/AIHub-MMV/preprocess.yaml -m config/AIHub-MMV/model.yaml -t config/AIHub-MMV/train.yaml
to synthesize all dialogs in preprocessed_data/AIHub-MMV/val_dialog.txt.
The generated utterances will be saved at output/result/AIHub-MMV
, dialog by dialog.
Use
tensorboard --logdir output/log
to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.
- Use speaker embedding instead of a one-hot vector of speaker id. This can be reasoned from the different settings of speakers (e.g., gender, the total number of speakers).
- Utterance-level BERT has 512 hidden sizes instead of 768-dim.
- Total chat history length is 11 (10 turns for the history and 1 turn for the current). They are aggregated by a simple attention mechanism (named Sequence Level Attention by this project).
- Use 2 stacked BGRU to apply dropout during the context encoding.
- Since AIHub Multimodal Video AI datasets contain enough training data, the encoder and decoder are not pre-trained on available TTS datasets.