Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add centralized data preparation for OWSM #5478

Merged
merged 31 commits into from
Dec 5, 2023
Merged

Conversation

jctian98
Copy link
Contributor

@jctian98 jctian98 commented Oct 17, 2023

What?

This PR response issue 5469 to provide data.sh and related files for OWSM recipes from v1 to v3.
The main modification is under: <espnet_path>/egs2/owsm_v* directory

User Guidance for Data Preparation (copy from README.md)

(1) Please work progressively from v1 to v3: this means you need to prepare data for v1, v2 and v3 in order to obtain the full v3 data. To start the data preparation, run bash local/data.sh --VERSION v1 # or v2, v3
(2) Please revise db.sh for all datasets before running local/data.sh. Some datasets cannot be downloaded and untared automatically due to license issues. Users should take care of it by themselves.
(3) Due to the large volume of data, we are not confident the scripts will run smoothly for each dataset. Please raise an issue if you believe there is a bug.
(4) This script only prepares data for train and valid subsets. Test data should be prepared separately following the conventional Espnet2 format.
(5) Even though we provide this centralized data preparation script and combine all datasets in it, we strongly recommend users to NOT use the merged train_v* and valid_v* for feature extractions. Instead, users may run stage 2-4 for each dataset separately and combine all datasets together under dump/raw directory. This will allow you to handle all datasets simultaneously; inspection and debugging will also be easier. This is exactly what we did in our experiments.
(6) The detailed data list is in local/data.sh. Also see: https://arxiv.org/pdf/2309.13876.pdf

List of datasets

V1: Aishell, CoVoST2, GigaSpeech, LibriSpeech, MuST-C, SPGISpeech, TEDLIUM3
V2: all in V1, GigaST, Multilingual Librispeech, WenetSpeech
V3: all in V2, AIDATATANG, AMI, Babel, CommonVoice, Fisher (SwitchBoard), Fisher Callhome Spanish, FLEURS, Googlei18n, KsponSpeech, MagicData, ReazonSpeech, Russian Open STT, VCTK, VoxForge, VoxPopuli, WSJ

TODO list (future PRs will link this PR for continuity)

(1) Extend to v4: We intend to collect more data with community efforts.
(2) Unified data collecting and processing policy for multilingual speech data (see more in discussion below).

Discussion: unified data collecting and processing policy

During this PR, we find the following problems are not being solved at this moment. We intend to categorize these problems into 3 divisions. We intend to solve or alleviate these problems with some unified policy and make our solution a public tool or script in Espnet

Pre-processing during data preparation

(1) Wrong language-id: it is found that some language-ids are not correct in the original datasets. E.g., English utterances are in the non-English corpus and are then labeled as English (seen in OpenSLR 32, 35, 52). These errors are at the utterance level. We currently don't have a good solution, except removing these datasets as a whole.
(2) Langauge-id inclusion: Some languages are considered subsets of other languages. E.g., Mandarin and Chinese-TW can both be considered as Chinese; languages with different dialects also have their own language-ids. However, since each utterance has only one exclusive language-id, the current data setup cannot solve this problem perfectly. We mainly keep them as-is.
(3) Special Symbols: Some datasets have transcriptions that contain special symbols like [Breath], [Laughter] etc. We intend to remove these special symbols. However, for each new dataset, we need to find all special symbols manually.
(4) Unified transcription processing: We need a consistent text processing policy for all raw transcriptions, which should consider upper/lower class, wide characters, illegal characters, spaces, digit normalization, etc. We currently mainly keep the raw transcriptions as-is (except, we change transcriptions that are all in upper class into lower class).

Data Cleaning with extra models (force-alignment, VAD, etc)

(1) Meaningless Speech: Some speech examples are very spontaneous and contain nearly no meaningful content. These examples are usually much longer than expected and will harm the time-stamp prediction. This issue is observed in Babel and Magic data. A potential solution is to use force-alignment to clip it.
(2) Long-form misalignment: current setup find segment information as: start_time of the first utterance + end_time of the last utterance. However, not all audio between these two time-stamps is transcribed (e.g., some pieces in the middle are too noisy/meaningless and are discarded in the original dataset). Thus, text and audio sometimes are not well aligned. We don't have a very good policy here.

Postprocessing during scoring

(1) Test-time transcription: Although the current owsm model can output text with upper/lower case characters & punctuations, all evaluation is conducted with normed text w/o punctuations. This is common in ASR research but maybe this is sub-optimal for OWSM-alike models.

@mergify mergify bot added the ESPnet2 label Oct 17, 2023
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4610653) 76.54% compared to head (ee00c6c) 76.54%.
Report is 60 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5478   +/-   ##
=======================================
  Coverage   76.54%   76.54%           
=======================================
  Files         720      720           
  Lines       66599    66602    +3     
=======================================
+ Hits        50975    50978    +3     
  Misses      15624    15624           
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (ø)
test_integration_espnet2 50.10% <ø> (+<0.01%) ⬆️
test_python_espnet1 19.08% <ø> (+<0.01%) ⬆️
test_python_espnet2 52.38% <ø> (-0.01%) ⬇️
test_utils 22.15% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pyf98
Copy link
Collaborator

pyf98 commented Oct 17, 2023

Thanks! Looking forward to v3

train_sets="data/GigaST/XL.en-* \
data/MLS/train.* \
data/WenetSpeech/L"
# question (jinchuan): why don't include GigaST-dev?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember GigaST does not have DEV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'll remove this question

@sw005320 sw005320 added this to the v.202312 milestone Oct 17, 2023
@sw005320
Copy link
Contributor

How did you deal with the wide characters, including the space symbols?
Did you deal with them as they are?

@jctian98
Copy link
Contributor Author

A summary so far:

(1) There is a shared ./local/data.sh for all egs2/mixed_v* recipes, which is used to provide the combined dataset for v1, v2 and v3. E.g., for v1, this script can be used with bash local/data.sh --VERSION v1
(2) The script should be used progressively. That means the user should run v1 to v3 in order to get the full v3 data.
(3) For v3, the babel dataset is currently absent. Dan was responsible for this. I have contacted him and will update once I get the script.
(4) For each dataset, we have a ./local/prepare_<dataset>.sh to handle its preparation. Scripts for v1 and v2 were done by @pyf98 ; scripts for v3 are newly added in this PR. Specifically, most datasets in v3 are processed as (1) prepare as original espnet / kaldi format with the existing egs2/<dataset>/asr1/local/data.sh scripts and (2) transform into the OWSM data format with the script ./local/kaldi_to_whisper.py
(5) Have conducted some tests on our scripts, especially for all datasets included in v3. However, due to the large volume of data, some of these scripts are only tested partially (like, only run on dev set).
(6) The original data.sh scripts for some tasks are not smooth (e.g., swbd, fisher_callhome_spanish). The users should also take care of the data sources on some datasets. So the whole process cannot be expected to be very smooth. We can make revisions if we receive users' feedback.
(7) v1 and v2 adopt the original whisper's language IDs. Since we adopt more languages than whisper, our language IDs in v3 are changed to iso-639-3 format.

TODO:
(1) double check the scripts and solve the CI issues
(2) update babel script

Answer for questions above:
(1) we try to keep the original text data as-is. So haven't done any special operations on wide characters.
(2) string.split() and " ".join() are rapidly used so multiple space and \t might be replaced by single space.

@sw005320
Copy link
Contributor

Can you add some info to https://github.com/espnet/espnet/blob/master/egs2/mixed_v3/s2t1/README.md?

(1) we try to keep the original text data as-is. So haven't done any special operations on wide characters.

This is risky as each corpus has a different annotation policy (e.g., punctuations, special characters like noises).
Please make sure to make it consistent by taking a look at preprocessed data for each corpus.

(2) string.split() and " ".join() are rapidly used so multiple space and \t might be replaced by single space.

What happened to the wide-character space, then? Can string.split() deal with the wide-character space?

@pyf98
Copy link
Collaborator

pyf98 commented Oct 18, 2023 via email

@sw005320
Copy link
Contributor

Some corpora indeed contain special white spaces. That’s why I applied string.split. I think it works. After applying it, I did not see any warning or errors about space characters. But please correct me if I am wrong

I think you're right.
string.split() seems to be working on various spaces, including the wide character.

@jctian98
Copy link
Contributor Author

I think it's better to discuss how to do the text normalization (punctuations, wide characters, multiple spaces etc.). We also need to take care of some edge cases for audio. After we fix the policy, we can apply it to all datasets.

I think we can write the README file after we finish the scripts, to avoid further revisions.

@mergify mergify bot added the README label Oct 23, 2023
@mergify
Copy link
Contributor

mergify bot commented Oct 25, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Oct 25, 2023
@kan-bayashi kan-bayashi modified the milestones: v.202310, v.202312 Oct 25, 2023
Copy link
Contributor

mergify bot commented Nov 1, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Nov 1, 2023
@mergify mergify bot removed the conflicts label Nov 10, 2023
@sw005320
Copy link
Contributor

Can you add some TODO and discussions here?
Since this PR includes the directory change, to promote the owsm activities, I want to merge this with the current stage.
Then, please prepare the follow up PRs with the copy of the some TODO and discussions.

@jctian98
Copy link
Contributor Author

jctian98 commented Nov 11, 2023

Can you add some TODO and discussions here?

It's at the top of this PR. Please review it.

@mergify mergify bot added the ESPnet1 label Nov 26, 2023
@sw005320
Copy link
Contributor

Please let me know if this PR is ready to be merged.

@jctian98
Copy link
Contributor Author

After the WSJ case is solved as in slack, I think this PR is ready for merge.

@sw005320 sw005320 merged commit a45a53c into espnet:master Dec 5, 2023
27 checks passed
@sw005320
Copy link
Contributor

sw005320 commented Dec 5, 2023

Thanks a lot, @jctian98!
I’m looking forward to the results of v3.1 and the next iteration with v4

@jctian98 jctian98 deleted the owsm_data branch May 17, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants