Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDACC dataset automatic speech recognition #5996

Merged
merged 34 commits into from
Jan 2, 2025
Merged

Conversation

uwanny
Copy link
Contributor

@uwanny uwanny commented Dec 25, 2024

What?

Add an ASR recipe for EDACC dataset, an accent English speech dataset (website) trained using WavLM + transformer.

Why?

There are little accented English speech corpus in the ESPnet framework.

See also

Specification:

  1. The provided data has only dev set and test set, so part of the dev set was set as train set in data prep stage, the specific number of utterances for train set could be set in data_prep.py.
  2. For the convenience of dump data in stage 3, to avoid OOM error because of loading too large wav file, too long wav file was split in data.sh
  3. For the convenience of decoding in stage 11 in test set, to avoid OOM becasue of too long utterance, it is possible to set specific utterance and alignment rules (you can get from CTC segmentation tools) in truncate_test.py to split the test utterance.

Problem:

  1. Training using wavLM + transformer could work as the training stage actually helps, and the WER of test set did decrease, the training accuracy did increase during the training. The WER is 51 for test set which is larger than paper reported WER (18.7-36.1), however, the paper used some well-established pretrained model for inference, making the result better. I planned to make the PR for the current stage and tried to find some better method if have time.

Reference:

[1] Sanabria, R., Bogoychev, N., Markl, N., Carmantini, A., Klejch, O., & Bell, P. (2023). The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. In ICASSP 2023.

@mergify mergify bot added the ESPnet2 label Dec 25, 2024
Copy link

codecov bot commented Dec 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.16%. Comparing base (964b19e) to head (2fe91b4).
Report is 35 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5996       +/-   ##
===========================================
+ Coverage   14.93%   53.16%   +38.22%     
===========================================
  Files         828      626      -202     
  Lines       77969    59204    -18765     
===========================================
+ Hits        11644    31475    +19831     
+ Misses      66325    27729    -38596     
Flag Coverage Δ
test_integration_espnet1 62.52% <ø> (?)
test_integration_espnet2 47.49% <ø> (?)
test_python_espnetez ?
test_utils ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mergify mergify bot added the README label Dec 26, 2024
@sw005320
Copy link
Contributor

Can you add some discussions about how you use the dev set (you split it to train and valid, right?)?
Also, please describe your treatments for the long segments.

@sw005320 sw005320 added Recipe ASR Automatic speech recogntion labels Dec 26, 2024
@sw005320 sw005320 added this to the v.202503 milestone Dec 26, 2024
@sw005320 sw005320 requested a review from ftshijt December 26, 2024 22:40
Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Please also indicate your entry at egs2/README.md for the new data

egs2/edacc/asr1/local/data.sh Outdated Show resolved Hide resolved
egs2/edacc/asr1/local/data_prep.py Outdated Show resolved Hide resolved
test_set="test test_sub"
train_set="dev_train"
valid_set="dev_non_train"
nbpe=3884 # 3884 vocabulary size of bpe could cover all the sentence in the edacc dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number seems weird to me. Could you please elaborate a bit on why 3884 is selected?

Copy link
Contributor Author

@uwanny uwanny Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in stage 5, when trying to use 5000 vocab size, it indicates that the dataset does not have sufficient diversity to support a vocabulary size of 5000, and the maximum feasible vocabulary size for this dataset is 3884.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually, we use a smaller vocab size as the bpe is for word pieces.

If you are using 3884, it actually returns to a word-based model (this could be one major cause of the poor WER you got from the model, as the word is very sparse with a small dataset).

Given that the word size is small, I would suggest you to go with a smaller vocab size (e.g., 500 or even 100). I expect it could make a significant improvement to the model's performance.

egs2/edacc/asr1/run.sh Outdated Show resolved Hide resolved

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_asr_model_valid.acc.ave/test|9300|163389|56.1|31.3|12.6|7.6|51.5|87.3|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WER seems much worse than what the paper stated. Is there a possible reason for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Reason is here

@ftshijt
Copy link
Collaborator

ftshijt commented Dec 27, 2024

As I mentioned in the detailed discussion, I feel the current performance is largely degraded due to the large vocab size. It is very likely that the system can get better if you can use less word size for training.

@uwanny
Copy link
Contributor Author

uwanny commented Dec 27, 2024

As I mentioned in the detailed discussion, I feel the current performance is largely degraded due to the large vocab size. It is very likely that the system can get better if you can use less word size for training.

I will try to retrain it, thank you

@uwanny
Copy link
Contributor Author

uwanny commented Dec 30, 2024

By trying the BPE to 500 and make some ablation exps, the training stage of models is a little better than previous, as shown in the accuracy figure (using larger BPE can only achieved smaller accuracy)
image
The WER curve shows a big vibration:
image

However, the actual WER in test set could not show a big improvement (comparable with article reported), I think this is becasue firstly in article they decode the sentence every 30s, but they didn't provide the actual transcription in the given for decoding in that way, which means the provided dataset has test set split in different ways as they did in the paper. Secondly, even it is not quite meaningful by comparing the result directly with paper, the current WER in test set, I think is still a bit higher compared with some baseline result in ASR, but actually by trying several parameters the best test result are the one using BPE 500 with training curve shown in the previous two figures (decrease the BPE to 100 could increase the training accuracy a little more, but can't guarantee a improved performance in test case based on my test), this model could achieve a result of:

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 163389 59.5 29.1 11.4 6.7 47.2 88.3

CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 792343 76.0 11.0 13.0 8.3 32.3 88.3

TER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 268438 62.1 26.2 11.6 12.2 50.1 88.3

Better than the previous:

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 163389 56.1 31.3 12.6 7.6 51.5 87.3

CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 792343 68.9 13.5 17.7 8.3 39.4 87.3

TER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_asr_asr_model_valid.acc.ave/test 9300 206542 54.1 29.0 16.9 9.2 55.1 87.3

Based on the previous comment and discussion, I think the model achieved best result given current circumstances, the training config used in the best model is: Please give some advice if you have, otherwise I will update using my current result.

freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
    frontend_conf:
        upstream: wavlm_base_plus
    download_dir: ./hub
    multilayer_feature: True

preencoder: linear
preencoder_conf:
   input_size: 768  # Note: If the upstream is changed, please change this value accordingly.
   output_size: 80

encoder: transformer
encoder_conf:
    output_size: 256
    attention_heads: 4
    linear_units: 1024
    num_blocks: 6
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d2
    normalize_before: true

decoder: transformer
decoder_conf:
    attention_heads: 4
    linear_units: 2048
    num_blocks: 4
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    self_attention_dropout_rate: 0.1
    src_attention_dropout_rate: 0.1

model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1
    length_normalized_loss: false
    extract_feats_in_collect_stats: false

seed: 2022
log_interval: 400
num_att_plot: 0
num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 12000000
accum_grad: 4
max_epoch: 160
patience: none
init: none
best_model_criterion:
-   - valid
    - acc
    - max
keep_nbest_models: 4

use_amp: true
cudnn_deterministic: false
cudnn_benchmark: false


optim: adam
optim_conf:
    lr: 0.008
    weight_decay: 0.001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 1000


specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

@ftshijt
Copy link
Collaborator

ftshijt commented Dec 30, 2024

Thanks for the update! I feel the results shall be sufficient now. As I go through the paper, I see that the data is designed for evaluation purposes only. That is saying that they do allow additional training data involved. Therefore, it would be quite reasonable to have the current results.

Thanks for your great contribution so far. Please finish the PR by updating the config to the latest~

@ftshijt
Copy link
Collaborator

ftshijt commented Jan 2, 2025

Thanks for your contribution!

@ftshijt ftshijt merged commit ef6740c into espnet:master Jan 2, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion ESPnet2 README Recipe
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants