Connection between output tokens and input audio #4278

Daniel-asr · 2022-04-14T12:01:22Z

I want to find a way to know which interval in the input audio correspond to every output token. For example, if in the audio it was said "hello", I want a mapping such as:

H --> [0.2, 0.3] (seconds)
E --> [0.3, 0.7]
L --> [0.7, 1.1]
L --> [0.7, 1.1]
O --> [1.1, 1.8]

I saw this attention images issue, which might help, since there is a linear correlation between the input audio timing and the encoders, but still I am not sure if this is the right way to go or if there are other tools in ESPnet for my need. Furthermore, I am expecting one correlation mapping, so how can there be more than one attention image (shown in the issue above)?

Thanks!

sw005320 · 2022-04-14T12:50:40Z

espnet models usually have a CTC branch and we can get an onset token information (and offset from the previous frame of the succeeding token onset) by a Viterbi algorithm.

@lumaku, is it easy to obtain the token-level segmentation information from your CTC segmentation?
This is one of the frequent requests.

sw005320 · 2022-04-14T12:52:46Z

I would not recommend using the attention map to obtain the segmentation results.
The attention map is noisy and also in the transformer case, there are multiple attention maps and it is not trivial to aggregate such multiple information.

Daniel-asr · 2022-04-14T13:06:24Z

I agree. I believe that what makes the attention map noisy is the self-attention in the encoding layers.
I am curious about CTC segmentation option. How can we evaluate that option? @lumaku

sw005320 · 2022-04-14T13:08:11Z

espnet models usually have an attention branch and we can get an onset token information (and offset from the previous frame of the succeeding token onset) by a Viterbi algorithm.

Sorry "an attention branch" --> "a CTC branch"

lumaku · 2022-04-16T21:20:46Z

@lumaku, is it easy to obtain the token-level segmentation information from your CTC segmentation? This is one of the frequent requests.

It is possible, you can set the intervals (utterances) to single tokens.

Use a character-based model. Also, the number of output tokens of the ASR model must be at least 2 times larger compared to the number of ground truth tokens; if not, the aligned tokens will be overlap and become inaccurate.

The CTC output is always in the form of spikes at the output index when a certain token "occurs". The CTC segmentation algorithm sets the interval as a next-neighbor around the token activation. This is usually a bit tricky for the last token (but that also depends on the audio and ASR model).

@Daniel-asr You can test this for example with the character-based WSJ model kamo-naoyuki/wsj on the example audio file in test_utils/ctc_align_test.wav. The algorithm expects a list of utterances (the parameter text), you can input a list of characters instead, then it will align those and you get an interval for each of those characters.

sw005320 · 2022-04-18T16:32:03Z

@Daniel-asr You can test this for example with the character-based WSJ model kamo-naoyuki/wsj on the example audio file in test_utils/ctc_align_test.wav. The algorithm expects a list of utterances (the parameter text), you can input a list of characters instead, then it will align those and you get an interval for each of those characters.

Oh, I see. If we prepare a list of characters or tokens, the model expects to work on providing the token level alignments.
Thanks for your TIPS!

Add a description of token-level alignment, as discussed in #4278

sw005320 added Feature request Force alignment including CTC segmentation labels Apr 14, 2022

sw005320 added a commit that referenced this issue Apr 18, 2022

Update README.md

00d2125

Add a description of token-level alignment, as discussed in #4278

sw005320 mentioned this issue Apr 18, 2022

Update README.md #4284

Merged

sw005320 closed this as completed Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection between output tokens and input audio #4278

Connection between output tokens and input audio #4278

Daniel-asr commented Apr 14, 2022

sw005320 commented Apr 14, 2022 •

edited

Loading

sw005320 commented Apr 14, 2022

Daniel-asr commented Apr 14, 2022

sw005320 commented Apr 14, 2022

lumaku commented Apr 16, 2022

sw005320 commented Apr 18, 2022

Connection between output tokens and input audio #4278

Connection between output tokens and input audio #4278

Comments

Daniel-asr commented Apr 14, 2022

sw005320 commented Apr 14, 2022 • edited Loading

sw005320 commented Apr 14, 2022

Daniel-asr commented Apr 14, 2022

sw005320 commented Apr 14, 2022

lumaku commented Apr 16, 2022

sw005320 commented Apr 18, 2022

sw005320 commented Apr 14, 2022 •

edited

Loading