Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference for streaming transducer #4548

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jhlee9010
Copy link
Contributor

@jhlee9010 jhlee9010 commented Jul 30, 2022

Create a new PR as mentioned in #4530.
This is a draft version of streming transducer inference code.

I trained contextual conformer transducer for Librispeech-100 with attached yaml file.
A beamsearch algorithm for streaming transducer is maes, other algorithms are not considered.

  CER WER
dev_clean 4.0 9.1
dev_other 12.7 23.7
test_clean 4.1 9.6
test_other 13.0 24.7

Also, I attached sample test code which is almost same as
https://espnet.github.io/espnet/notebook/espnet2_streaming_asr_demo.html#Prepare-for-inference.
RTF for sample speech was 0.26.

sample.zip

@b-flo b-flo marked this pull request as draft July 30, 2022 19:20
@b-flo
Copy link
Member

b-flo commented Jul 30, 2022

Hi,

Thanks a lot for this contribution!

At quick glance I'm not sure this is finished so I switched your PR to a draft for now. Also, not sure if I should start reviewing so please tell me when it's OK!

CER WER
dev_clean 4.0 9.1
dev_other 12.7 23.7
test_clean 4.1 9.6
test_other 13.0 24.7

It's for another PR but I think some tuning may be needed. It can't be directly compared (different configs and streaming approaches) but with PR #4479, I currently have the following WER :

dev-clean dev-other test-clean test-other
6.6 18.9 7.0 19.2

I can share my config if you want.

RTF for sample speech was 0.26

It's also high but expected, I think you can safely lower beam-size (20 -> 10) and also modify maES parameters (the default ones are used if I'm not mistaken?) with: nstep: 1, expansion_gamma: 1.5, expansion_beta: 1 (or 0) and prefix-alpha: 1

@b-flo b-flo added ASR Automatic speech recogntion RNNT (RNN) transducer related issue Streaming labels Jul 30, 2022
@b-flo b-flo self-assigned this Jul 30, 2022
@jhlee9010
Copy link
Contributor Author

jhlee9010 commented Jul 31, 2022

Hi,

Thanks a lot for this contribution!

At quick glance I'm not sure this is finished so I switched your PR to a draft for now. Also, not sure if I should start reviewing so please tell me when it's OK!

CER
WER

dev_clean
4.0
9.1

dev_other
12.7
23.7

test_clean
4.1
9.6

test_other
13.0
24.7

It's for another PR but I think some tuning may be needed. It can't be directly compared (different configs and streaming approaches) but with PR #4479, I currently have the following WER :

dev-clean dev-other test-clean test-other
6.6 18.9 7.0 19.2
I can share my config if you want.

Yes, please. It would be helpful.

RTF for sample speech was 0.26

It's also high but expected, I think you can safely lower beam-size (20 -> 10) and also modify maES parameters (the default ones are used if I'm not mistaken?) with: nstep: 1, expansion_gamma: 1.5, expansion_beta: 1 (or 0) and prefix-alpha: 1

Right, I used default maes parameters. I'll check with new maes parameters soon.

I added several beam_search algorithms(default, tsd, nsc, maes) in the last commit.

@karthik19967829
Copy link
Contributor

@jhlee9010 thanks for this contribution . could you share a link to your trained model ?
Also @b-flo is there any way we can decode a Non-streaming conformer-RNNT https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp via this decoding this strategy , I understand there will be performance drop , but just want to streaming decode the available model

@b-flo
Copy link
Member

b-flo commented Aug 22, 2022

Also @b-flo is there any way we can decode a Non-streaming conformer-RNNT https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp via this decoding this strategy , I understand there will be performance drop , but just want to streaming decode the available model

Sorry I'm a bit confused. Are you asking if we can do online decoding with an non-streaming model or if a specific decoding strategy can be performed in an online manner?

Btw, you can ask Yifan for an online Transducer model. We're working on it but the baseline online system with the new version is on par with the old (offline) one:

|decode_transducer_asr_model_valid.loss.ave_10best/dev_clean|2703|54402|97.9|1.9|0.2|0.2|2.3|29.0|
|decode_transducer_asr_model_valid.loss.ave_10best/dev_other|2864|50948|94.7|4.8|0.5|0.6|5.9|47.9|
|decode_transducer_asr_model_valid.loss.ave_10best/test_clean|2620|52576|97.7|2.1|0.2|0.3|2.6|31.1|
|decode_transducer_asr_model_valid.loss.ave_10best/test_other|2939|52343|94.7|4.8|0.5|0.7|6.0|48.9|

@karthik19967829
Copy link
Contributor

@b-flo yes I would like to know how to perform online decoding of non-streaming conformer https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp .

Thanks for info on online transducer will get it in touch with Yifan

@joazoa
Copy link

joazoa commented Sep 6, 2022

@b-flo can you share the config you used ? (I suppose it's for the new standalone version?)

@b-flo
Copy link
Member

b-flo commented Sep 6, 2022

@joazoa For Librispeech or Librispeech-100?

@joazoa
Copy link

joazoa commented Sep 6, 2022

Thanks for the super quick reply!
For librispeech if possible, but librispeech-100 would help as well. (any other config for the new version would help me as well)

@b-flo
Copy link
Member

b-flo commented Sep 6, 2022

Sure but I only have Librispeech-100 configs at hand right now.

Training conf:

# general
batch_type: numel
batch_bins: 2000000
accum_grad: 16
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
    lr: 0.002
    weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 15000

# criterion
val_scheduler_criterion:
    - valid
    - loss
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
    transducer_weight: 1.0
    auxiliary_ctc_weight: 0.3
    report_cer: True
    report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

encoder_conf:
    main_conf:
      pos_wise_act_type: swish
      conv_mod_act_type: swish
      pos_enc_dropout_rate: 0.2
      dynamic_chunk_training: True
      short_chunk_size: 25
      left_chunk_size: 4
    input_conf:
      vgg_like: True
    body_conf:
    - block_type: conformer
      linear_size: 1024
      hidden_size: 256
      heads: 4
      dropout_rate: 0.1
      pos_wise_dropout_rate: 0.1
      att_dropout_rate: 0.1
      conv_mod_kernel_size: 31
      num_blocks: 18
decoder: rnn
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    embed_size: 256
    hidden_size: 256
    dropout_rate: 0.1
    embed_dropout_rate: 0.2
joint_network_conf:
    joint_space_size: 256

Decoding conf (offline, I'm using mAES for online decoding):

beam_size: 5 # 10 produces slightly better results.
beam_search_config:
    search_type: default

Note that it's not tuned yet. Also, don't forget to set --asr_task asr_transducer in run script (and --inference_asr_model valid.loss.ave_10best.pth if you use model averaging).

Edit: Forgot to mention I used a single A100 for training.

@joazoa
Copy link

joazoa commented Sep 6, 2022

Thanks! I will give it a try.

@duj12
Copy link

duj12 commented Nov 17, 2022

Sure but I only have Librispeech-100 configs at hand right now.

Training conf:

# general
batch_type: numel
batch_bins: 2000000
accum_grad: 16
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
    lr: 0.002
    weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 15000

# criterion
val_scheduler_criterion:
    - valid
    - loss
best_model_criterion:
-   - valid
    - loss
    - min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
    transducer_weight: 1.0
    auxiliary_ctc_weight: 0.3
    report_cer: True
    report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

encoder_conf:
    main_conf:
      pos_wise_act_type: swish
      conv_mod_act_type: swish
      pos_enc_dropout_rate: 0.2
      dynamic_chunk_training: True
      short_chunk_size: 25
      left_chunk_size: 4
    input_conf:
      vgg_like: True
    body_conf:
    - block_type: conformer
      linear_size: 1024
      hidden_size: 256
      heads: 4
      dropout_rate: 0.1
      pos_wise_dropout_rate: 0.1
      att_dropout_rate: 0.1
      conv_mod_kernel_size: 31
      num_blocks: 18
decoder: rnn
decoder_conf:
    rnn_type: lstm
    num_layers: 1
    embed_size: 256
    hidden_size: 256
    dropout_rate: 0.1
    embed_dropout_rate: 0.2
joint_network_conf:
    joint_space_size: 256

Decoding conf (offline, I'm using mAES for online decoding):

beam_size: 5 # 10 produces slightly better results.
beam_search_config:
    search_type: default

Note that it's not tuned yet. Also, don't forget to set --asr_task asr_transducer in run script (and --inference_asr_model valid.loss.ave_10best.pth if you use model averaging).

Edit: Forgot to mention I used a single A100 for training.

I use this config to do an experiment with AISHELL dataset, and I call this config as Dynamic_chunked_conformer+RNNT, but when decoding I set streaming=True;
And I also use a config similar to @jhlee9010 mentioned, which I call Contextual_conformer+RNNT;
By comparing the result, I found Contextual_conformer+RNNT is better than Dynamic_chunked_conformer+RNNT, which is not the same as the result @b-flo mentioned on Librispeech_100, I wonder if @b-flo your result is an offline result(with streaming=False)?

The results are as follows(without LM):
-----------------------------------------------------------------|-valid-cer-|---test-cer
Contextual_conformer+RNNT--------------- streaming-----|----8.1---- |-----9.7
Dynamic_chunked_conformer+RNNT-----non-streaming---|----5.5----|------6
Dynamic_chunked_conformer+RNNT--------streaming-----|----13.1---|-----14.9

@b-flo
Copy link
Member

b-flo commented Nov 17, 2022

@duj12 Hi, sorry I was away from the project for a bit.

Dynamic_chunked_conformer+RNNT--------streaming-----|----13.1---|-----14.9

If you're using true streaming decoding (streaming: True), there is a training/decoding mismatch right now, Chunk-by-chunk decoding will perform poorly in current version.

I'm preparing a patch, sorry about that.

Copy link
Contributor

mergify bot commented Dec 17, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Dec 17, 2023
Copy link
Contributor

mergify bot commented Feb 6, 2024

This pull request is now in conflict :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion conflicts ESPnet1 ESPnet2 RNNT (RNN) transducer related issue Streaming
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants