-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference for streaming transducer #4548
base: master
Are you sure you want to change the base?
Conversation
Hi, Thanks a lot for this contribution! At quick glance I'm not sure this is finished so I switched your PR to a draft for now. Also, not sure if I should start reviewing so please tell me when it's OK!
It's for another PR but I think some tuning may be needed. It can't be directly compared (different configs and streaming approaches) but with PR #4479, I currently have the following WER :
I can share my config if you want.
It's also high but expected, I think you can safely lower beam-size (20 -> 10) and also modify maES parameters (the default ones are used if I'm not mistaken?) with: |
Yes, please. It would be helpful.
Right, I used default maes parameters. I'll check with new maes parameters soon. I added several beam_search algorithms(default, tsd, nsc, maes) in the last commit. |
@jhlee9010 thanks for this contribution . could you share a link to your trained model ? |
Sorry I'm a bit confused. Are you asking if we can do online decoding with an non-streaming model or if a specific decoding strategy can be performed in an online manner? Btw, you can ask Yifan for an online Transducer model. We're working on it but the baseline online system with the new version is on par with the old (offline) one:
|
@b-flo yes I would like to know how to perform online decoding of non-streaming conformer https://huggingface.co/espnet/chai_librispeech_asr_train_conformer-rnn_transducer_raw_en_bpe5000_sp . Thanks for info on online transducer will get it in touch with Yifan |
@b-flo can you share the config you used ? (I suppose it's for the new standalone version?) |
@joazoa For Librispeech or Librispeech-100? |
Thanks for the super quick reply! |
Sure but I only have Librispeech-100 configs at hand right now. Training conf: # general
batch_type: numel
batch_bins: 2000000
accum_grad: 16
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0
# optimizer
optim: adam
optim_conf:
lr: 0.002
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 15000
# criterion
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10 # 20 produces slightly better results.
model_conf:
transducer_weight: 1.0
auxiliary_ctc_weight: 0.3
report_cer: True
report_wer: True
# specaug conf
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5
encoder_conf:
main_conf:
pos_wise_act_type: swish
conv_mod_act_type: swish
pos_enc_dropout_rate: 0.2
dynamic_chunk_training: True
short_chunk_size: 25
left_chunk_size: 4
input_conf:
vgg_like: True
body_conf:
- block_type: conformer
linear_size: 1024
hidden_size: 256
heads: 4
dropout_rate: 0.1
pos_wise_dropout_rate: 0.1
att_dropout_rate: 0.1
conv_mod_kernel_size: 31
num_blocks: 18
decoder: rnn
decoder_conf:
rnn_type: lstm
num_layers: 1
embed_size: 256
hidden_size: 256
dropout_rate: 0.1
embed_dropout_rate: 0.2
joint_network_conf:
joint_space_size: 256 Decoding conf (offline, I'm using mAES for online decoding): beam_size: 5 # 10 produces slightly better results.
beam_search_config:
search_type: default Note that it's not tuned yet. Also, don't forget to set Edit: Forgot to mention I used a single A100 for training. |
Thanks! I will give it a try. |
I use this config to do an experiment with AISHELL dataset, and I call this config as Dynamic_chunked_conformer+RNNT, but when decoding I set The results are as follows(without LM): |
@duj12 Hi, sorry I was away from the project for a bit.
If you're using true streaming decoding ( I'm preparing a patch, sorry about that. |
This pull request is now in conflict :( |
This pull request is now in conflict :( |
Create a new PR as mentioned in #4530.
This is a draft version of streming transducer inference code.
I trained contextual conformer transducer for Librispeech-100 with attached yaml file.
A beamsearch algorithm for streaming transducer is maes, other algorithms are not considered.
Also, I attached sample test code which is almost same as
https://espnet.github.io/espnet/notebook/espnet2_streaming_asr_demo.html#Prepare-for-inference.
RTF for sample speech was 0.26.
sample.zip