-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add centralized data preparation for OWSM #5478
Merged
Merged
Changes from 2 commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
3be1f40
add whisper data.sh for v1 and v2
jctian98 37ab173
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 92bf631
add OWSM v3 data recipe
jctian98 ac8e423
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 a3c24bd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 063dc3f
fix ci issues
jctian98 5e14a62
update with ci issues
jctian98 7b707cd
change egs name from mixed_v* to owsm_v*
jctian98 14204e2
v3 shuold be ready except wsj
jctian98 ae05a6c
add wsj
jctian98 c515f76
update db.sh
jctian98 b53ce47
Merge branch 'master' into owsm_data
jctian98 ec109e2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 31ad173
almost finish all scripts
jctian98 8a09625
fix small problems
jctian98 952acf6
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 2fd2668
merge master
jctian98 c53afd0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] bdaf344
update the langauge mapping
jctian98 d379fd0
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 b2cb427
Merge branch 'master' into owsm_data
jctian98 f5e5414
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 51e3691
fix CI issue
jctian98 7f75d15
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 66176bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3d89d78
update wsj and commonvoice
jctian98 8f1e0fa
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 77fe14b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c391765
update wsj text norm script
jctian98 642fd22
update wsj text norm 2
jctian98 ee00c6c
revise voxpopuli
jctian98 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
#!/usr/bin/env bash | ||
# Set bash to 'debug' mode, it will exit on : | ||
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', | ||
|
||
# Centralized data preparation for OWSM (https://arxiv.org/abs/2309.13876) | ||
|
||
# Note (jinchuan): | ||
# (1) please work progressively from v1 to v3: you need to | ||
# prepare data for v1, v2 and v3 in order to obtain the full v3 data | ||
# (2) please revise db.sh for all datasets before running this script. | ||
# Some datasets cannot be downloaded and untared automatically due to | ||
# liscence issue. Please take care of it in advance. | ||
# (3) Due to the large volume of data, we can not ensure the scripts | ||
# will run smoothly for each dataset. Please raise an issue if you | ||
# believe there is a bug. | ||
# (4) This script only prepare data for train and valid. Test data | ||
# should be prepared separately following standard Espnet2 format. | ||
|
||
set -e | ||
set -u | ||
set -o pipefail | ||
|
||
. ./path.sh || exit 1; | ||
. ./db.sh || exit 1; | ||
|
||
function check_sorted { | ||
file=$1 | ||
sort -k1,1 -u <$file >$file.tmp | ||
if ! cmp -s $file $file.tmp; then | ||
echo "$0: file $1 is not in sorted order or not unique, sorting it" | ||
mv $file.tmp $file | ||
else | ||
rm $file.tmp | ||
fi | ||
} | ||
|
||
log() { | ||
local fname=${BASH_SOURCE[1]##*/} | ||
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*" | ||
} | ||
SECONDS=0 | ||
|
||
VERSION=v1 # specify v1, v2 or v3 | ||
stage=1 | ||
stop_stage=2 | ||
|
||
. utils/parse_options.sh | ||
|
||
# Change accordingly if you only want to prepare a subset of it | ||
if [ ${VERSION} = "v1" ]; then | ||
datasets="aishell covost2 gigaspeech librispeech must-c spgispeech" | ||
train_sets="data/AISHELL-1/train \ | ||
data/CoVoST2/train \ | ||
data/GigaSpeech/XL \ | ||
data/LibriSpeech/train-clean-100 \ | ||
data/LibriSpeech/train-clean-360 \ | ||
data/LibriSpeech/train-other-500 \ | ||
data/MuST-C_v1.2/train \ | ||
data/MuST-C_v2/train \ | ||
data/MuST-C_v3/train \ | ||
data/SPGISpeech/train \ | ||
data/TEDLIUM3/train" | ||
valid_sets="data/AISHELL-1/dev \ | ||
data/CoVoST2/dev \ | ||
data/GigaSpeech/DEV \ | ||
data/LibriSpeech/dev-clean \ | ||
data/LibriSpeech/dev-other \ | ||
data/MuST-C_v1.2/dev \ | ||
data/MuST-C_v2/dev \ | ||
data/MuST-C_v3/dev \ | ||
data/SPGISpeech/val \ | ||
data/TEDLIUM3/dev" | ||
|
||
elif [ ${VERSION} = "v2" ]; then | ||
datasets="gigast multilingual_librispeech wenetspeech" | ||
train_sets="data/GigaST/XL.en-* \ | ||
data/MLS/train.* \ | ||
data/WenetSpeech/L" | ||
# question (jinchuan): why don't include GigaST-dev? | ||
valid_sets="data/MLS/dev.* \ | ||
data/WenetSpeech/DEV" | ||
|
||
elif [ ${VERSION} = "v3" ]; then | ||
datasets="aidatatang ami babel commonvoice swbd fisher_callhome \ | ||
fleurs ksponspeech magicdata reazonspeech ru_open_stt \ | ||
vctk voxpopuli wsj" \ | ||
# still working on it | ||
train_sets="data/aidatatang/train \ | ||
" | ||
valid_sets="data/aidatatang/dev \ | ||
" | ||
else | ||
echo "Invalid version argument ${VERSION}." && exit 1; | ||
fi | ||
echo "Preparing data for OSWM with version ${VERSION}" | ||
echo "Datasets to prepare: ${datasets}" | ||
|
||
utt_extra_files="text.prev text.ctc" | ||
train_out=data/train_${VERSION} | ||
valid_out=data/valid_${VERSION} | ||
|
||
# call data preparation script for each dataset | ||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then | ||
for dataset in ${datasets}; do | ||
if [ -f data/.${dataset}.done ]; then | ||
echo ${dataset} has been processed. Skip! | ||
else | ||
echo preparing ${dataset} dataset ... | ||
./local/prepare_${dataset}.sh && touch data/.${dataset}.done | ||
fi | ||
done | ||
fi | ||
|
||
# combine all datasets. | ||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then | ||
|
||
if [ ${VERSION} = "v2" ]; then | ||
if [ ! -d data/train_v1 ] || [ ! -d data/valid_v1 ]; then | ||
echo "Cannot find v1 data. Please link it here. Exit!" && exit 1; | ||
fi | ||
train_sets="${train_sets} data/train_v1" | ||
valid_sets="${valid_sets} data/valid_v1" | ||
fi | ||
|
||
if [ ${VERSION} = "v3" ]; then | ||
if [ ! -d data/train_v2 ] || [ ! -d data/valid_v2 ]; then | ||
echo "Cannot find v2 data. Please link it here. Exit!" && exit 1; | ||
fi | ||
train_sets="${train_sets} data/train_v2" | ||
valid_sets="${valid_sets} data/valid_v2" | ||
fi | ||
|
||
# Combine valid | ||
utils/combine_data.sh --skip_fix true --extra-files "${utt_extra_files}" \ | ||
${valid_out} ${valid_sets} || exit 1; | ||
# NOTE(yifan): extra text files must be sorted and unique | ||
for f in ${utt_extra_files}; do | ||
check_sorted ${valid_out}/${f} | ||
done | ||
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" ${valid_out} || exit 1; | ||
utils/validate_data_dir.sh --no-feats --non-print ${valid_out} || exit 1; | ||
|
||
# Combine train | ||
utils/combine_data.sh --skip_fix true --extra-files "${utt_extra_files}" \ | ||
${train_out} ${train_sets} || exit 1; | ||
# NOTE(yifan): extra text files must be sorted and unique | ||
for f in ${utt_extra_files}; do | ||
check_sorted ${train_out}/${f} | ||
done | ||
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" ${train_out} || exit 1; | ||
utils/validate_data_dir.sh --no-feats --non-print ${train_out} || exit 1; | ||
fi | ||
|
||
# todo: some v3-specific operations | ||
|
||
|
||
log "Successfully finished. [elapsed=${SECONDS}s]" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../../mixed_v1/s2t1/local/data.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember GigaST does not have DEV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I'll remove this question