Skip to content

uis-rnn can't work for long utterances dataset? #50

Open
@wrongbattery

Description

Describe the question

In Diarization task, i train on AMI train-dev set and ICSI corpus , i test on AMI test set. Both datasets include audios of 3-5 speakers in 50-70 minutes. My d embedding trains on Voxceleb1,2 with EER = 4.55%. I train uirnn with window size .24ms, overlap 50%, segment size .4ms. The result is poor on both train and test set.
I also read all your code about uirnn, i don't understand 1> why do you split up the original utterances and concatenate them by speaker and then use that input for training? 2> Why doese the input ignore which audio the utterance belongs to, just merge all utterances in 1 single audio? .This process seems completely different to inference process and also reduce the capacity of using batch size if one speaker talk too much.
For 1 hour audio, the output has 20-30 speakers instead of 3-5 speakers no matter the smaller of crp_alpha is.

My background

Have I read the README.md file?

  • yes

Have I searched for similar questions from closed issues?

  • yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

  • yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

  • yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

  • yes

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions