For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

seekerzz · 2021-09-22T02:37:29Z

Hello, thanks for sharing the pytorch-based code!
However, I have some question about the _initial_sample func in model/prior.py .
epsilon is sampled from N(0, t) (t is the temperature), how its logprob is calculated? For norm distribution,

After log (the mean is 0)
.
Can you explain why use \sigma as 1 instead of t here?

The text was updated successfully, but these errors were encountered:

keonlee9420 · 2021-09-27T03:10:55Z

Hi @seekerzz , t is always 1 in our setting.

seekerzz · 2021-09-27T03:31:45Z

Thanks for your reply!
Have you tried the multi-speaker situation? I used the code for LibriTTS training. However, the performance is bad and KL is high (at the 10^3 level). I also added the initial process of mu and logvar from the flowseq repo (to make them output at around 0), but this won't help.
I tried to first train the posterior (only use the mel loss )and then the prior (only use KL), but this still won't converge. I also checked whether the posterior P(Z|X,Y) and the decoder P(Y|Z,X) just discards the information of X (like an encoder-decoder of Y), but the decoder alignment shows that the information of X is used.
Thus, this makes me wondering, why the prior fails to learn from the posterior:

Will it be too hard for Glow to learn it in the multi-speaker situation?
Or, maybe I should try the maximum likelihood training of Z instead of KL?

seekerzz · 2021-09-27T03:35:32Z

By the way, this is my training curve

I did not train the length predictor (just using the ground truth length).

keonlee9420 · 2021-09-27T13:53:32Z

Can you share the synthesized samples? And where did you apply the speaker information, e.g., speaker embedding?

seekerzz · 2021-09-27T14:07:12Z

Thanks for the quick reply!😁
I add the speaker embedding into the text embedding (as I think Z can be viewed a style mapping from text X to mel Y, adding speaker information to X is more intuitive) . However, the synthesized samples are still very bad after about 40 epochs on LibriTTS.
For example, the predicted and the groundtruth.

However, if only train the posterior, the predicted mel is quite OK.

I read another flow-based TTS: Glow-TTS, and find that they conditioned the speaker information on Z. Maybe I should try their merging method.🤔

keonlee9420 · 2021-09-27T14:37:37Z

Thanks for the sharing. So if I understood correctly, you add the speaker embedding to the text embedding right after the text encoder so that both posterior and prior encoder can take the speaker-dependent hidden representations X, am I right?
If so, is it different from the Glow-TTS' conditioning method as they explained?

To train muli-speaker Glow-TTS, we add the speaker embedding and increase the hidden dimension.
The speaker embedding is applied in all affine coupling layers of the decoder as a global conditioning

I quoted it from section 4 of the Glow-TTS paper.

seekerzz · 2021-09-27T14:42:14Z

Yes! I am going to try their conditioning method. If it succeed I will share the result.😊

keonlee9420 · 2021-09-27T14:49:09Z

Ah, I see. I think It should work if you adopt the same way. Looking forward to seeing it!

keonlee9420 · 2021-10-14T10:49:36Z

@seekerzz hey, have you made any progress?

seekerzz · 2021-10-14T10:53:33Z

Hello! I find there might be a mistake in the code (just now)!
In VAENAR.py

But in posterior.py

I'm trying to train the multi-speaker version again to see the results.😁
(Curious about why LJSpeech still works, haha)

keonlee9420 · 2021-10-14T11:04:00Z

Great! Hope to get the clear sample soon.
That's intended since we are not interested in the alignment from the posterior, so you should get no error from it when you use the same code for the multi-speaker setting.

seekerzz · 2021-10-14T11:05:47Z

Hello, I mean the position of Mu and Logvar are misplaced.

keonlee9420 · 2021-10-14T11:17:56Z

Ah, sorry for the misunderstanding. Yes, you're right. It should be switched. But the reason why it's still working is that they are the same but wrongly named (reversed). In other words, mu_projection in the current implementation predicts logvar, and logvar_projection predicts mu. I will retrain the model with this fixation when I have room for that. Thanks for the report!

seekerzz · 2021-10-15T01:59:59Z

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge.
The posterior and decoder can be trained easily within about 20 epochs,

Although the decoder attention seems a little noisy, it is correct,

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level).
The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech).

I don't know whether this can be a problem for the flow-based model.🤔

wizardk · 2021-12-30T08:43:07Z

@seekerzz Could you share any synthesized samples?

whh07141 · 2022-04-06T07:42:13Z

hi，I have met the same problem when I joined a vq encoder after posterior and prior encoder.The kl was 1e+4 and won't converged.Did you finish the job?

Thanks for your reply! I have understood that they can replace each other's variable name!

My main problem for the multi-speaker training is that the prior cannot converge. The posterior and decoder can be trained easily within about 20 epochs, Although the decoder attention seems a little noisy, it is correct,

So, I decide to train the prior only (with the posterior and decoder frozen). However, the prior cannot converges to the learned Z from the posterior (and KL divergence is at the 2*10^3 level). The predicted Logvar from the posterior is very small compared to that from the single-speaker (LJSpeech) situation, and the samples (I mean samples, eps = self.posterior.reparameterize(mu, logvar, self.n_sample)) nearly equal to mu. Thus the logprobs is very high (even becomes positive compared to the negative situation for LJSpeech). I don't know whether this can be a problem for the flow-based model.🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

seekerzz commented Sep 22, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 14, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 14, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 15, 2021

wizardk commented Dec 30, 2021

whh07141 commented Apr 6, 2022 •

edited

Loading

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

For model/prior.py _initial_sample, why the prob is calculated as from N(0,1)? #2

Comments

seekerzz commented Sep 22, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

seekerzz commented Sep 27, 2021

keonlee9420 commented Sep 27, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 14, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 14, 2021

keonlee9420 commented Oct 14, 2021

seekerzz commented Oct 15, 2021

wizardk commented Dec 30, 2021

whh07141 commented Apr 6, 2022 • edited Loading

whh07141 commented Apr 6, 2022 •

edited

Loading