Inference for streaming transducer #4548

jhlee9010 · 2022-07-30T14:23:31Z

Create a new PR as mentioned in #4530.
This is a draft version of streming transducer inference code.

I trained contextual conformer transducer for Librispeech-100 with attached yaml file.
A beamsearch algorithm for streaming transducer is maes, other algorithms are not considered.

	CER	WER
dev_clean	4.0	9.1
dev_other	12.7	23.7
test_clean	4.1	9.6
test_other	13.0	24.7

Also, I attached sample test code which is almost same as
https://espnet.github.io/espnet/notebook/espnet2_streaming_asr_demo.html#Prepare-for-inference.
RTF for sample speech was 0.26.

sample.zip

…spnet into streaming_transducer

b-flo · 2022-07-30T19:21:00Z

Hi,

Thanks a lot for this contribution!

At quick glance I'm not sure this is finished so I switched your PR to a draft for now. Also, not sure if I should start reviewing so please tell me when it's OK!

CER WER

dev_clean 4.0 9.1

dev_other 12.7 23.7

test_clean 4.1 9.6

test_other 13.0 24.7

It's for another PR but I think some tuning may be needed. It can't be directly compared (different configs and streaming approaches) but with PR #4479, I currently have the following WER :

dev-clean	dev-other	test-clean	test-other
6.6	18.9	7.0	19.2

I can share my config if you want.

RTF for sample speech was 0.26

It's also high but expected, I think you can safely lower beam-size (20 -> 10) and also modify maES parameters (the default ones are used if I'm not mistaken?) with: nstep: 1, expansion_gamma: 1.5, expansion_beta: 1 (or 0) and prefix-alpha: 1

jhlee9010 · 2022-07-31T07:26:21Z

Hi,

Thanks a lot for this contribution!

At quick glance I'm not sure this is finished so I switched your PR to a draft for now. Also, not sure if I should start reviewing so please tell me when it's OK!

CER
WER

dev_clean
4.0
9.1

dev_other
12.7
23.7

test_clean
4.1
9.6

test_other
13.0
24.7

It's for another PR but I think some tuning may be needed. It can't be directly compared (different configs and streaming approaches) but with PR #4479, I currently have the following WER :

dev-clean dev-other test-clean test-other
6.6 18.9 7.0 19.2
I can share my config if you want.

Yes, please. It would be helpful.

RTF for sample speech was 0.26

It's also high but expected, I think you can safely lower beam-size (20 -> 10) and also modify maES parameters (the default ones are used if I'm not mistaken?) with: nstep: 1, expansion_gamma: 1.5, expansion_beta: 1 (or 0) and prefix-alpha: 1

Right, I used default maes parameters. I'll check with new maes parameters soon.

I added several beam_search algorithms(default, tsd, nsc, maes) in the last commit.

karthik19967829 · 2022-08-22T00:27:19Z

@jhlee9010 thanks for this contribution . could you share a link to your trained model ?
Also @b-flo is there any way we can decode a Non-streaming conformer-RNNT https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp via this decoding this strategy , I understand there will be performance drop , but just want to streaming decode the available model

b-flo · 2022-08-22T06:24:21Z

Also @b-flo is there any way we can decode a Non-streaming conformer-RNNT https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp via this decoding this strategy , I understand there will be performance drop , but just want to streaming decode the available model

Sorry I'm a bit confused. Are you asking if we can do online decoding with an non-streaming model or if a specific decoding strategy can be performed in an online manner?

Btw, you can ask Yifan for an online Transducer model. We're working on it but the baseline online system with the new version is on par with the old (offline) one:

|decode_transducer_asr_model_valid.loss.ave_10best/dev_clean|2703|54402|97.9|1.9|0.2|0.2|2.3|29.0|
|decode_transducer_asr_model_valid.loss.ave_10best/dev_other|2864|50948|94.7|4.8|0.5|0.6|5.9|47.9|
|decode_transducer_asr_model_valid.loss.ave_10best/test_clean|2620|52576|97.7|2.1|0.2|0.3|2.6|31.1|
|decode_transducer_asr_model_valid.loss.ave_10best/test_other|2939|52343|94.7|4.8|0.5|0.7|6.0|48.9|

karthik19967829 · 2022-08-24T01:56:50Z

@b-flo yes I would like to know how to perform online decoding of non-streaming conformer https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp .

Thanks for info on online transducer will get it in touch with Yifan

joazoa · 2022-09-06T14:07:24Z

@b-flo can you share the config you used ? (I suppose it's for the new standalone version?)

b-flo · 2022-09-06T14:34:03Z

@joazoa For Librispeech or Librispeech-100?

joazoa · 2022-09-06T14:37:34Z

Thanks for the super quick reply!
For librispeech if possible, but librispeech-100 would help as well. (any other config for the new version would help me as well)

b-flo · 2022-09-06T15:21:53Z

Sure but I only have Librispeech-100 configs at hand right now.

Training conf:

# general
batch_type: numel
batch_bins: 2000000
accum_grad: 16
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
    lr: 0.002
    weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 15000

# criterion
val_scheduler_criterion:
    - valid
    - loss
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
    transducer_weight: 1.0
    auxiliary_ctc_weight: 0.3
    report_cer: True
    report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

encoder_conf:
    main_conf:
      pos_wise_act_type: swish
      conv_mod_act_type: swish
      pos_enc_dropout_rate: 0.2
      dynamic_chunk_training: True
      short_chunk_size: 25
      left_chunk_size: 4
    input_conf:
      vgg_like: True
    body_conf:
    - block_type: conformer
      linear_size: 1024
      hidden_size: 256
      heads: 4
      dropout_rate: 0.1
      pos_wise_dropout_rate: 0.1
      att_dropout_rate: 0.1
      conv_mod_kernel_size: 31
      num_blocks: 18
decoder: rnn
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    embed_size: 256
    hidden_size: 256
    dropout_rate: 0.1
    embed_dropout_rate: 0.2
joint_network_conf:
    joint_space_size: 256

Decoding conf (offline, I'm using mAES for online decoding):

beam_size: 5 # 10 produces slightly better results.
beam_search_config:
    search_type: default

Note that it's not tuned yet. Also, don't forget to set --asr_task asr_transducer in run script (and --inference_asr_model valid.loss.ave_10best.pth if you use model averaging).

Edit: Forgot to mention I used a single A100 for training.

joazoa · 2022-09-06T17:39:04Z

Thanks! I will give it a try.

duj12 · 2022-11-17T09:11:40Z

Sure but I only have Librispeech-100 configs at hand right now.

Training conf:

# general
batch_type: numel
batch_bins: 2000000
accum_grad: 16
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
    lr: 0.002
    weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 15000

# criterion
val_scheduler_criterion:
    - valid
    - loss
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
    transducer_weight: 1.0
    auxiliary_ctc_weight: 0.3
    report_cer: True
    report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

encoder_conf:
    main_conf:
      pos_wise_act_type: swish
      conv_mod_act_type: swish
      pos_enc_dropout_rate: 0.2
      dynamic_chunk_training: True
      short_chunk_size: 25
      left_chunk_size: 4
    input_conf:
      vgg_like: True
    body_conf:
    - block_type: conformer
      linear_size: 1024
      hidden_size: 256
      heads: 4
      dropout_rate: 0.1
      pos_wise_dropout_rate: 0.1
      att_dropout_rate: 0.1
      conv_mod_kernel_size: 31
      num_blocks: 18
decoder: rnn
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    embed_size: 256
    hidden_size: 256
    dropout_rate: 0.1
    embed_dropout_rate: 0.2
joint_network_conf:
    joint_space_size: 256

Decoding conf (offline, I'm using mAES for online decoding):

beam_size: 5 # 10 produces slightly better results.
beam_search_config:
    search_type: default

Note that it's not tuned yet. Also, don't forget to set --asr_task asr_transducer in run script (and --inference_asr_model valid.loss.ave_10best.pth if you use model averaging).

Edit: Forgot to mention I used a single A100 for training.

I use this config to do an experiment with AISHELL dataset, and I call this config as Dynamic_chunked_conformer+RNNT, but when decoding I set streaming=True;
And I also use a config similar to @jhlee9010 mentioned, which I call Contextual_conformer+RNNT;
By comparing the result, I found Contextual_conformer+RNNT is better than Dynamic_chunked_conformer+RNNT, which is not the same as the result @b-flo mentioned on Librispeech_100, I wonder if @b-flo your result is an offline result(with streaming=False)?

The results are as follows（without LM）:
-----------------------------------------------------------------|-valid-cer-|---test-cer
Contextual_conformer+RNNT--------------- streaming-----|----8.1---- |-----9.7
Dynamic_chunked_conformer+RNNT-----non-streaming---|----5.5----|------6
Dynamic_chunked_conformer+RNNT--------streaming-----|----13.1---|-----14.9

b-flo · 2022-11-17T10:33:34Z

@duj12 Hi, sorry I was away from the project for a bit.

Dynamic_chunked_conformer+RNNT--------streaming-----|----13.1---|-----14.9

If you're using true streaming decoding (streaming: True), there is a training/decoding mismatch right now, Chunk-by-chunk decoding will perform poorly in current version.

I'm preparing a patch, sorry about that.

mergify · 2023-12-17T11:27:51Z

This pull request is now in conflict :(

mergify · 2024-02-06T02:48:38Z

This pull request is now in conflict :(

root and others added 2 commits July 29, 2022 13:50

Add streaming transducer

bf2d461

Revise for use_streaming option

01ef583

mergify bot added ESPnet1 ESPnet2 labels Jul 30, 2022

jhlee9010 and others added 3 commits July 30, 2022 23:27

Merge branch 'espnet:master' into streaming_transducer

6a058a0

ci check

1731ac4

Merge branch 'streaming_transducer' of https://github.com/jhlee9010/e…

6c6ff1a

…spnet into streaming_transducer

b-flo marked this pull request as draft July 30, 2022 19:20

b-flo added ASR RNNT Streaming labels Jul 30, 2022

b-flo self-assigned this Jul 30, 2022

jhlee9010 added 2 commits July 31, 2022 16:10

Add multiple beam search algorithms

f797bde

ci check

513335f

jhlee9010 added 2 commits July 31, 2022 17:03

fix isort

47017d6

fix centos error

3e674f9

mergify bot added the conflicts label Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference for streaming transducer #4548

Inference for streaming transducer #4548

jhlee9010 commented Jul 30, 2022 •

edited

Loading

b-flo commented Jul 30, 2022 •

edited

Loading

jhlee9010 commented Jul 31, 2022 •

edited

Loading

karthik19967829 commented Aug 22, 2022

b-flo commented Aug 22, 2022

karthik19967829 commented Aug 24, 2022

joazoa commented Sep 6, 2022

b-flo commented Sep 6, 2022

joazoa commented Sep 6, 2022 •

edited

Loading

b-flo commented Sep 6, 2022 •

edited

Loading

joazoa commented Sep 6, 2022

duj12 commented Nov 17, 2022 •

edited

Loading

b-flo commented Nov 17, 2022

mergify bot commented Dec 17, 2023

mergify bot commented Feb 6, 2024

Inference for streaming transducer #4548

Are you sure you want to change the base?

Inference for streaming transducer #4548

Conversation

jhlee9010 commented Jul 30, 2022 • edited Loading

b-flo commented Jul 30, 2022 • edited Loading

jhlee9010 commented Jul 31, 2022 • edited Loading

karthik19967829 commented Aug 22, 2022

b-flo commented Aug 22, 2022

karthik19967829 commented Aug 24, 2022

joazoa commented Sep 6, 2022

b-flo commented Sep 6, 2022

joazoa commented Sep 6, 2022 • edited Loading

b-flo commented Sep 6, 2022 • edited Loading

joazoa commented Sep 6, 2022

duj12 commented Nov 17, 2022 • edited Loading

b-flo commented Nov 17, 2022

mergify bot commented Dec 17, 2023

mergify bot commented Feb 6, 2024

jhlee9010 commented Jul 30, 2022 •

edited

Loading

b-flo commented Jul 30, 2022 •

edited

Loading

jhlee9010 commented Jul 31, 2022 •

edited

Loading

joazoa commented Sep 6, 2022 •

edited

Loading

b-flo commented Sep 6, 2022 •

edited

Loading

duj12 commented Nov 17, 2022 •

edited

Loading