-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add centralized data preparation for OWSM #5478
Conversation
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5478 +/- ##
=======================================
Coverage 76.54% 76.54%
=======================================
Files 720 720
Lines 66599 66602 +3
=======================================
+ Hits 50975 50978 +3
Misses 15624 15624
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thanks! Looking forward to v3 |
egs2/mixed_v1/s2t1/local/data.sh
Outdated
train_sets="data/GigaST/XL.en-* \ | ||
data/MLS/train.* \ | ||
data/WenetSpeech/L" | ||
# question (jinchuan): why don't include GigaST-dev? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember GigaST does not have DEV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I'll remove this question
merge remote
for more information, see https://pre-commit.ci
How did you deal with the wide characters, including the space symbols? |
A summary so far: (1) There is a shared TODO: Answer for questions above: |
Can you add some info to https://github.com/espnet/espnet/blob/master/egs2/mixed_v3/s2t1/README.md?
This is risky as each corpus has a different annotation policy (e.g., punctuations, special characters like noises).
What happened to the wide-character space, then? Can |
Some corpora indeed contain special white spaces. That’s why I applied
string.split. I think it works. After applying it, I did not see any
warning or errors about space characters. But please correct me if I am
wrong
…On Wed, Oct 18, 2023 at 12:48 Shinji Watanabe ***@***.***> wrote:
Can you add some info to
https://github.com/espnet/espnet/blob/master/egs2/mixed_v3/s2t1/README.md?
(1) we try to keep the original text data as-is. So haven't done any
special operations on wide characters.
This is risky as each corpus has a different annotation policy (e.g.,
punctuations, special characters like noises).
Please make sure to make it consistent by taking a look at preprocessed
data for each corpus.
(2) string.split() and " ".join() are rapidly used so multiple space and
\t might be replaced by single space.
What happened to the wide-character space, then? Can string.split() deal
with the wide-character space?
—
Reply to this email directly, view it on GitHub
<#5478 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG6TJMJH6GGZ55ORGWUZTKTYAAB4TAVCNFSM6AAAAAA6DCT7RGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRYHE2TEOBVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think you're right. |
I think it's better to discuss how to do the text normalization (punctuations, wide characters, multiple spaces etc.). We also need to take care of some edge cases for audio. After we fix the policy, we can apply it to all datasets. I think we can write the README file after we finish the scripts, to avoid further revisions. |
This pull request is now in conflict :( |
for more information, see https://pre-commit.ci
This pull request is now in conflict :( |
for more information, see https://pre-commit.ci
Can you add some TODO and discussions here? |
for more information, see https://pre-commit.ci
It's at the top of this PR. Please review it. |
Please let me know if this PR is ready to be merged. |
After the WSJ case is solved as in slack, I think this PR is ready for merge. |
Thanks a lot, @jctian98! |
What?
This PR response issue 5469 to provide
data.sh
and related files for OWSM recipes from v1 to v3.The main modification is under:
<espnet_path>/egs2/owsm_v*
directoryUser Guidance for Data Preparation (copy from README.md)
(1) Please work progressively from v1 to v3: this means you need to prepare data for v1, v2 and v3 in order to obtain the full v3 data. To start the data preparation, run
bash local/data.sh --VERSION v1 # or v2, v3
(2) Please revise
db.sh
for all datasets before runninglocal/data.sh
. Some datasets cannot be downloaded and untared automatically due to license issues. Users should take care of it by themselves.(3) Due to the large volume of data, we are not confident the scripts will run smoothly for each dataset. Please raise an issue if you believe there is a bug.
(4) This script only prepares data for train and valid subsets. Test data should be prepared separately following the conventional Espnet2 format.
(5) Even though we provide this centralized data preparation script and combine all datasets in it, we strongly recommend users to NOT use the merged train_v* and valid_v* for feature extractions. Instead, users may run stage 2-4 for each dataset separately and combine all datasets together under
dump/raw
directory. This will allow you to handle all datasets simultaneously; inspection and debugging will also be easier. This is exactly what we did in our experiments.(6) The detailed data list is in
local/data.sh
. Also see: https://arxiv.org/pdf/2309.13876.pdfList of datasets
V1: Aishell, CoVoST2, GigaSpeech, LibriSpeech, MuST-C, SPGISpeech, TEDLIUM3
V2: all in V1, GigaST, Multilingual Librispeech, WenetSpeech
V3: all in V2, AIDATATANG, AMI, Babel, CommonVoice, Fisher (SwitchBoard), Fisher Callhome Spanish, FLEURS, Googlei18n, KsponSpeech, MagicData, ReazonSpeech, Russian Open STT, VCTK, VoxForge, VoxPopuli, WSJ
TODO list (future PRs will link this PR for continuity)
(1) Extend to v4: We intend to collect more data with community efforts.
(2) Unified data collecting and processing policy for multilingual speech data (see more in discussion below).
Discussion: unified data collecting and processing policy
During this PR, we find the following problems are not being solved at this moment. We intend to categorize these problems into 3 divisions. We intend to solve or alleviate these problems with some unified policy and make our solution a public tool or script in Espnet
Pre-processing during data preparation
(1) Wrong language-id: it is found that some language-ids are not correct in the original datasets. E.g., English utterances are in the non-English corpus and are then labeled as English (seen in OpenSLR 32, 35, 52). These errors are at the utterance level. We currently don't have a good solution, except removing these datasets as a whole.
(2) Langauge-id inclusion: Some languages are considered subsets of other languages. E.g., Mandarin and Chinese-TW can both be considered as Chinese; languages with different dialects also have their own language-ids. However, since each utterance has only one exclusive language-id, the current data setup cannot solve this problem perfectly. We mainly keep them as-is.
(3) Special Symbols: Some datasets have transcriptions that contain special symbols like
[Breath]
,[Laughter]
etc. We intend to remove these special symbols. However, for each new dataset, we need to find all special symbols manually.(4) Unified transcription processing: We need a consistent text processing policy for all raw transcriptions, which should consider upper/lower class, wide characters, illegal characters, spaces, digit normalization, etc. We currently mainly keep the raw transcriptions as-is (except, we change transcriptions that are all in upper class into lower class).
Data Cleaning with extra models (force-alignment, VAD, etc)
(1) Meaningless Speech: Some speech examples are very spontaneous and contain nearly no meaningful content. These examples are usually much longer than expected and will harm the time-stamp prediction. This issue is observed in Babel and Magic data. A potential solution is to use force-alignment to clip it.
(2) Long-form misalignment: current setup find segment information as: start_time of the first utterance + end_time of the last utterance. However, not all audio between these two time-stamps is transcribed (e.g., some pieces in the middle are too noisy/meaningless and are discarded in the original dataset). Thus, text and audio sometimes are not well aligned. We don't have a very good policy here.
Postprocessing during scoring
(1) Test-time transcription: Although the current owsm model can output text with upper/lower case characters & punctuations, all evaluation is conducted with normed text w/o punctuations. This is common in ASR research but maybe this is sub-optimal for OWSM-alike models.