-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EDACC dataset automatic speech recognition #5996
Conversation
…entation before training
for more information, see https://pre-commit.ci
… EdAcc-dataset
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5996 +/- ##
===========================================
+ Coverage 14.93% 53.16% +38.22%
===========================================
Files 828 626 -202
Lines 77969 59204 -18765
===========================================
+ Hits 11644 31475 +19831
+ Misses 66325 27729 -38596
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Can you add some discussions about how you use the dev set (you split it to train and valid, right?)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Please also indicate your entry at egs2/README.md for the new data
egs2/edacc/asr1/run.sh
Outdated
test_set="test test_sub" | ||
train_set="dev_train" | ||
valid_set="dev_non_train" | ||
nbpe=3884 # 3884 vocabulary size of bpe could cover all the sentence in the edacc dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number seems weird to me. Could you please elaborate a bit on why 3884 is selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in stage 5, when trying to use 5000 vocab size, it indicates that the dataset does not have sufficient diversity to support a vocabulary size of 5000, and the maximum feasible vocabulary size for this dataset is 3884.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually, we use a smaller vocab size as the bpe is for word pieces.
If you are using 3884, it actually returns to a word-based model (this could be one major cause of the poor WER you got from the model, as the word is very sparse with a small dataset).
Given that the word size is small, I would suggest you to go with a smaller vocab size (e.g., 500 or even 100). I expect it could make a significant improvement to the model's performance.
egs2/edacc/asr1/README.md
Outdated
|
||
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| | ||
|---|---|---|---|---|---|---|---|---| | ||
|decode_asr_asr_model_valid.acc.ave/test|9300|163389|56.1|31.3|12.6|7.6|51.5|87.3| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WER seems much worse than what the paper stated. Is there a possible reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Reason is here
As I mentioned in the detailed discussion, I feel the current performance is largely degraded due to the large vocab size. It is very likely that the system can get better if you can use less word size for training. |
I will try to retrain it, thank you |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Thanks for the update! I feel the results shall be sufficient now. As I go through the paper, I see that the data is designed for evaluation purposes only. That is saying that they do allow additional training data involved. Therefore, it would be quite reasonable to have the current results. Thanks for your great contribution so far. Please finish the PR by updating the config to the latest~ |
for more information, see https://pre-commit.ci
… EdAcc-dataset
Thanks for your contribution! |
What?
Add an ASR recipe for EDACC dataset, an accent English speech dataset (website) trained using WavLM + transformer.
Why?
There are little accented English speech corpus in the ESPnet framework.
See also
Specification:
Problem:
Reference:
[1] Sanabria, R., Bogoychev, N., Markl, N., Carmantini, A., Klejch, O., & Bell, P. (2023). The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR. In ICASSP 2023.