Section 4.4 End-to-End Speech Synthesis #169

freedomtowin · 2024-08-06T13:01:47Z

So it seems that if you train the vcoder on the predicted mel-spectrograms of the text-to-wave model (Tacotron2) you get better results, right?

The mel dataset creator, returns the following


(mel.squeeze(), audio.squeeze(0), filename, mel_loss.squeeze())

In the training it looks as follows:

x, y, _, y_mel = batch

But if not fine-tuning, then x and y_mel are the same. Where can I look in the paper to better understand this?

The text was updated successfully, but these errors were encountered:

Provide feedback