EDACC dataset automatic speech recognition #5996

uwanny · 2024-12-25T19:38:40Z

What?

Add an ASR recipe for EDACC dataset, an accent English speech dataset (website) trained using WavLM + transformer.

Why?

There are little accented English speech corpus in the ESPnet framework.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.16%. Comparing base (964b19e) to head (2fe91b4).
Report is 35 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5996       +/-   ##
===========================================
+ Coverage   14.93%   53.16%   +38.22%     
===========================================
  Files         828      626      -202     
  Lines       77969    59204    -18765     
===========================================
+ Hits        11644    31475    +19831     
+ Misses      66325    27729    -38596

Flag	Coverage Δ
test_integration_espnet1	`62.52% <ø> (?)`
test_integration_espnet2	`47.49% <ø> (?)`
test_python_espnetez	`?`
test_utils	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sw005320 · 2024-12-26T22:39:54Z

Can you add some discussions about how you use the dev set (you split it to train and valid, right?)?
Also, please describe your treatments for the long segments.

ftshijt

Thanks! Please also indicate your entry at egs2/README.md for the new data

egs2/edacc/asr1/local/data.sh

egs2/edacc/asr1/local/data_prep.py

ftshijt · 2024-12-27T04:34:39Z

egs2/edacc/asr1/run.sh

+test_set="test test_sub"
+train_set="dev_train"
+valid_set="dev_non_train"
+nbpe=3884 # 3884 vocabulary size of bpe could cover all the sentence in the edacc dataset


The number seems weird to me. Could you please elaborate a bit on why 3884 is selected?

Because in stage 5, when trying to use 5000 vocab size, it indicates that the dataset does not have sufficient diversity to support a vocabulary size of 5000, and the maximum feasible vocabulary size for this dataset is 3884.

usually, we use a smaller vocab size as the bpe is for word pieces.

If you are using 3884, it actually returns to a word-based model (this could be one major cause of the poor WER you got from the model, as the word is very sparse with a small dataset).

Given that the word size is small, I would suggest you to go with a smaller vocab size (e.g., 500 or even 100). I expect it could make a significant improvement to the model's performance.

egs2/edacc/asr1/run.sh

ftshijt · 2024-12-27T04:35:55Z

egs2/edacc/asr1/README.md

+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_asr_model_valid.acc.ave/test|9300|163389|56.1|31.3|12.6|7.6|51.5|87.3|


The WER seems much worse than what the paper stated. Is there a possible reason for that?

The Reason is here

ftshijt · 2024-12-27T05:30:13Z

As I mentioned in the detailed discussion, I feel the current performance is largely degraded due to the large vocab size. It is very likely that the system can get better if you can use less word size for training.

uwanny · 2024-12-27T05:35:08Z

As I mentioned in the detailed discussion, I feel the current performance is largely degraded due to the large vocab size. It is very likely that the system can get better if you can use less word size for training.

I will try to retrain it, thank you

for more information, see https://pre-commit.ci

uwanny · 2024-12-30T03:45:45Z

By trying the BPE to 500 and make some ablation exps, the training stage of models is a little better than previous, as shown in the accuracy figure (using larger BPE can only achieved smaller accuracy)

The WER curve shows a big vibration:

However, the actual WER in test set could not show a big improvement (comparable with article reported), I think this is becasue firstly in article they decode the sentence every 30s, but they didn't provide the actual transcription in the given for decoding in that way, which means the provided dataset has test set split in different ways as they did in the paper. Secondly, even it is not quite meaningful by comparing the result directly with paper, the current WER in test set, I think is still a bit higher compared with some baseline result in ASR, but actually by trying several parameters the best test result are the one using BPE 500 with training curve shown in the previous two figures (decrease the BPE to 100 could increase the training accuracy a little more, but can't guarantee a improved performance in test case based on my test), this model could achieve a result of:

WER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	163389	59.5	29.1	11.4	6.7	47.2	88.3

CER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	792343	76.0	11.0	13.0	8.3	32.3	88.3

TER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	268438	62.1	26.2	11.6	12.2	50.1	88.3

Better than the previous:

WER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	163389	56.1	31.3	12.6	7.6	51.5	87.3

CER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	792343	68.9	13.5	17.7	8.3	39.4	87.3

TER

dataset	Snt	Wrd	Corr	Sub	Del	Ins	Err	S.Err
decode_asr_asr_model_valid.acc.ave/test	9300	206542	54.1	29.0	16.9	9.2	55.1	87.3

Based on the previous comment and discussion, I think the model achieved best result given current circumstances, the training config used in the best model is: Please give some advice if you have, otherwise I will update using my current result.

freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
    frontend_conf:
        upstream: wavlm_base_plus
    download_dir: ./hub
    multilayer_feature: True

preencoder: linear
preencoder_conf:
   input_size: 768  # Note: If the upstream is changed, please change this value accordingly.
   output_size: 80

encoder: transformer
encoder_conf:
    output_size: 256
    attention_heads: 4
    linear_units: 1024
    num_blocks: 6
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    attention_dropout_rate: 0.1
    input_layer: conv2d2
    normalize_before: true

decoder: transformer
decoder_conf:
    attention_heads: 4
    linear_units: 2048
    num_blocks: 4
    dropout_rate: 0.1
    positional_dropout_rate: 0.1
    self_attention_dropout_rate: 0.1
    src_attention_dropout_rate: 0.1

model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1
    length_normalized_loss: false
    extract_feats_in_collect_stats: false

seed: 2022
log_interval: 400
num_att_plot: 0
num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 12000000
accum_grad: 4
max_epoch: 160
patience: none
init: none
best_model_criterion:
-   - valid
    - acc
    - max
keep_nbest_models: 4

use_amp: true
cudnn_deterministic: false
cudnn_benchmark: false


optim: adam
optim_conf:
    lr: 0.008
    weight_decay: 0.001
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 1000


specaug: specaug
specaug_conf:
    apply_time_warp: true
    time_warp_window: 5
    time_warp_mode: bicubic
    apply_freq_mask: true
    freq_mask_width_range:
    - 0
    - 27
    num_freq_mask: 2
    apply_time_mask: true
    time_mask_width_ratio_range:
    - 0.
    - 0.05
    num_time_mask: 5

ftshijt · 2024-12-30T04:14:58Z

Thanks for the update! I feel the results shall be sufficient now. As I go through the paper, I see that the data is designed for evaluation purposes only. That is saying that they do allow additional training data involved. Therefore, it would be quite reasonable to have the current results.

Thanks for your great contribution so far. Please finish the PR by updating the config to the latest~

for more information, see https://pre-commit.ci

… EdAcc-dataset

ftshijt · 2025-01-02T07:53:48Z

Thanks for your contribution!

uwanny and others added 7 commits November 29, 2024 20:16

data prep stage for edacc

64f9775

split too large audio file limited memory on PSC, and verified implem…

2488ddc

…entation before training

Merge remote-tracking branch 'origin/master' into EdAcc-dataset

70d2c9c

split and truncate too long test set

17f3ad6

update the training and decode config for wavLM, update run.sh

fb887af

Merge branch 'master' into EdAcc-dataset

647e666

Merge branch 'espnet:master' into EdAcc-dataset

15f8a91

mergify bot added the ESPnet2 label Dec 25, 2024

pre-commit-ci bot and others added 10 commits December 25, 2024 19:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d2848b

for more information, see https://pre-commit.ci

fix the too long line issue, make test set split optional

6a6df59

Merge branch 'EdAcc-dataset' of https://github.com/uwanny/espnet into…

6930b93

… EdAcc-dataset

delete useless file

8abea69

solve line too long issue

5c4e73d

[pre-commit.ci] auto fixes from pre-commit.com hooks

db2a309

for more information, see https://pre-commit.ci

fix line too long

bee1b67

[pre-commit.ci] auto fixes from pre-commit.com hooks

98623d5

for more information, see https://pre-commit.ci

add README

f8d73bb

[pre-commit.ci] auto fixes from pre-commit.com hooks

279a697

for more information, see https://pre-commit.ci

update README, add missing file

6135bab

mergify bot added the README label Dec 26, 2024

sw005320 added Recipe ASR Automatic speech recogntion labels Dec 26, 2024

sw005320 added this to the v.202503 milestone Dec 26, 2024

sw005320 requested a review from ftshijt December 26, 2024 22:40

remove duplicated file

475d159

ftshijt reviewed Dec 27, 2024

View reviewed changes

uwanny and others added 7 commits December 27, 2024 17:17

test line too long error

362cb21

fix line too long, move to README

9033350

make data prep to multiple stages

fbc1ec8

[pre-commit.ci] auto fixes from pre-commit.com hooks

0b40d51

for more information, see https://pre-commit.ci

Update README.md in egs2

6949d77

[pre-commit.ci] auto fixes from pre-commit.com hooks

998c33c

for more information, see https://pre-commit.ci

Merge branch 'master' into EdAcc-dataset

e6a6f11

uwanny and others added 8 commits December 29, 2024 23:38

Update README

95cf86f

update config, update run.sh

a783392

[pre-commit.ci] auto fixes from pre-commit.com hooks

b157f5a

for more information, see https://pre-commit.ci

update README

3901058

Merge branch 'EdAcc-dataset' of https://github.com/uwanny/espnet into…

8912268

… EdAcc-dataset

trigger CI check

bd05c27

update README

13d58fc

Merge branch 'master' into EdAcc-dataset

2fe91b4

ftshijt merged commit ef6740c into espnet:master Jan 2, 2025
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EDACC dataset automatic speech recognition #5996

EDACC dataset automatic speech recognition #5996

uwanny commented Dec 25, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

sw005320 commented Dec 26, 2024

ftshijt left a comment

ftshijt Dec 27, 2024

uwanny Dec 27, 2024 •

edited

Loading

ftshijt Dec 27, 2024

ftshijt Dec 27, 2024

uwanny Dec 27, 2024

ftshijt commented Dec 27, 2024

uwanny commented Dec 27, 2024

uwanny commented Dec 30, 2024

ftshijt commented Dec 30, 2024

ftshijt commented Jan 2, 2025

EDACC dataset automatic speech recognition #5996

EDACC dataset automatic speech recognition #5996

Conversation

uwanny commented Dec 25, 2024 • edited Loading

What?

Why?

See also

Specification:

Problem:

Reference:

codecov bot commented Dec 26, 2024 • edited Loading

Codecov Report

sw005320 commented Dec 26, 2024

ftshijt left a comment

Choose a reason for hiding this comment

ftshijt Dec 27, 2024

Choose a reason for hiding this comment

uwanny Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

ftshijt Dec 27, 2024

Choose a reason for hiding this comment

ftshijt Dec 27, 2024

Choose a reason for hiding this comment

uwanny Dec 27, 2024

Choose a reason for hiding this comment

ftshijt commented Dec 27, 2024

uwanny commented Dec 27, 2024

uwanny commented Dec 30, 2024

WER

CER

TER

WER

CER

TER

ftshijt commented Dec 30, 2024

ftshijt commented Jan 2, 2025

uwanny commented Dec 25, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

uwanny Dec 27, 2024 •

edited

Loading