Description
Describe the question
In Diarization task, i train on AMI train-dev set and ICSI corpus , i test on AMI test set. Both datasets include audios of 3-5 speakers in 50-70 minutes. My d embedding trains on Voxceleb1,2 with EER = 4.55%. I train uirnn with window size .24ms, overlap 50%, segment size .4ms. The result is poor on both train and test set.
I also read all your code about uirnn, i don't understand 1> why do you split up the original utterances and concatenate them by speaker and then use that input for training? 2> Why doese the input ignore which audio the utterance belongs to, just merge all utterances in 1 single audio? .This process seems completely different to inference process and also reduce the capacity of using batch size if one speaker talk too much.
For 1 hour audio, the output has 20-30 speakers instead of 3-5 speakers no matter the smaller of crp_alpha is.
My background
Have I read the README.md
file?
- yes
Have I searched for similar questions from closed issues?
- yes
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
- yes
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
- yes
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
- yes