Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

Open
seekerzz opened this issue Sep 22, 2021 · 16 comments

Comments

@seekerzz
Copy link

Hello, thanks for sharing the pytorch-based code!
However, I have some question about the _initial_sample func in model/prior.py .
epsilon is sampled from N(0, t) (t is the temperature), how its logprob is calculated? For norm distribution,
image
After log (the mean is 0)
image.
Can you explain why use \sigma as 1 instead of t here?

@keonlee9420
Copy link
Owner

Hi @seekerzz , t is always 1 in our setting.

@seekerzz
Copy link
Author

Thanks for your reply!
Have you tried the multi-speaker situation? I used the code for LibriTTS training. However, the performance is bad and KL is high (at the 10^3 level). I also added the initial process of mu and logvar from the flowseq repo (to make them output at around 0), but this won't help.
I tried to first train the posterior (only use the mel loss )and then the prior (only use KL), but this still won't converge. I also checked whether the posterior P(Z|X,Y) and the decoder P(Y|Z,X) just discards the information of X (like an encoder-decoder of Y), but the decoder alignment shows that the information of X is used.
Thus, this makes me wondering, why the prior fails to learn from the posterior:

  • Will it be too hard for Glow to learn it in the multi-speaker situation?
  • Or, maybe I should try the maximum likelihood training of Z instead of KL?

@seekerzz
Copy link
Author

By the way, this is my training curve
image
I did not train the length predictor (just using the ground truth length).

@keonlee9420
Copy link
Owner

Can you share the synthesized samples? And where did you apply the speaker information, e.g., speaker embedding?

@seekerzz
Copy link
Author

Thanks for the quick reply!😁
I add the speaker embedding into the text embedding (as I think Z can be viewed a style mapping from text X to mel Y, adding speaker information to X is more intuitive) . However, the synthesized samples are still very bad after about 40 epochs on LibriTTS.
For example, the predicted and the groundtruth.
image
image
However, if only train the posterior, the predicted mel is quite OK.
image

I read another flow-based TTS: Glow-TTS, and find that they conditioned the speaker information on Z. Maybe I should try their merging method.🤔

@keonlee9420
Copy link
Owner

Thanks for the sharing. So if I understood correctly, you add the speaker embedding to the text embedding right after the text encoder so that both posterior and prior encoder can take the speaker-dependent hidden representations X, am I right?
If so, is it different from the Glow-TTS' conditioning method as they explained?

To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension.
The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning

I quoted it from section 4 of the Glow-TTS paper.

@seekerzz
Copy link
Author

Yes! I am going to try their conditioning method. If it succeed I will share the result.😊

@keonlee9420
Copy link
Owner

Ah, I see. I think It should work if you adopt the same way. Looking forward to seeing it!

@keonlee9420
Copy link
Owner

@seekerzz hey, have you made any progress?

@seekerzz
Copy link
Author

Hello! I find there might be a mistake in the code (just now)!
In VAENAR.py
image
But in posterior.py
image
I'm trying to train the multi-speaker version again to see the results.😁
(Curious about why LJSpeech still works, haha)

@keonlee9420
Copy link
Owner

Great! Hope to get the clear sample soon.
That's intended since we are not interested in the alignment from the posterior, so you should get no error from it when you use the same code for the multi-speaker setting.

@seekerzz
Copy link
Author

Hello, I mean the position of Mu and Logvar are misplaced.

@keonlee9420
Copy link
Owner

Ah, sorry for the misunderstanding. Yes, you're right. It should be switched. But the reason why it's still working is that they are the same but wrongly named (reversed). In other words, mu_projection in the current implementation predicts logvar, and logvar_projection predicts mu. I will retrain the model with this fixation when I have room for that. Thanks for the report!

@seekerzz
Copy link
Author

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge.
The posterior and decoder can be trained easily within about 20 epochs,
image
Although the decoder attention seems a little noisy, it is correct,
image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level).
The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech).
image
I don't know whether this can be a problem for the flow-based model.🤔

@wizardk
Copy link

wizardk commented Dec 30, 2021

@seekerzz Could you share any synthesized samples?

@whh07141
Copy link

whh07141 commented Apr 6, 2022

hi,I have met the same problem when I joined a vq encoder after posterior and prior encoder.The kl was 1e+4 and won't converged.Did you finish the job?

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, image Although the decoder attention seems a little noisy, it is correct, image

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). image I don't know whether this can be a problem for the flow-based model.🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants