-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection between output tokens and input audio #4278
Comments
espnet models usually have a CTC branch and we can get an onset token information (and offset from the previous frame of the succeeding token onset) by a Viterbi algorithm. @lumaku, is it easy to obtain the token-level segmentation information from your CTC segmentation? |
I would not recommend using the attention map to obtain the segmentation results. |
I agree. I believe that what makes the attention map noisy is the self-attention in the encoding layers. |
Sorry "an attention branch" --> "a CTC branch" |
It is possible, you can set the intervals (utterances) to single tokens. Use a character-based model. Also, the number of output tokens of the ASR model must be at least 2 times larger compared to the number of ground truth tokens; if not, the aligned tokens will be overlap and become inaccurate. The CTC output is always in the form of spikes at the output index when a certain token "occurs". The CTC segmentation algorithm sets the interval as a next-neighbor around the token activation. This is usually a bit tricky for the last token (but that also depends on the audio and ASR model). @Daniel-asr You can test this for example with the character-based WSJ model |
Oh, I see. If we prepare a list of characters or tokens, the model expects to work on providing the token level alignments. |
Add a description of token-level alignment, as discussed in #4278
I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:
I saw this attention images issue, which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?
Thanks!
The text was updated successfully, but these errors were encountered: