Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add centralized data preparation for OWSM #5478

Merged
merged 31 commits into from
Dec 5, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
3be1f40
add whisper data.sh for v1 and v2
jctian98 Oct 17, 2023
37ab173
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 17, 2023
92bf631
add OWSM v3 data recipe
jctian98 Oct 17, 2023
ac8e423
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Oct 17, 2023
a3c24bd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 17, 2023
063dc3f
fix ci issues
jctian98 Oct 18, 2023
5e14a62
update with ci issues
jctian98 Oct 18, 2023
7b707cd
change egs name from mixed_v* to owsm_v*
jctian98 Oct 23, 2023
14204e2
v3 shuold be ready except wsj
jctian98 Oct 30, 2023
ae05a6c
add wsj
jctian98 Oct 30, 2023
c515f76
update db.sh
jctian98 Oct 30, 2023
b53ce47
Merge branch 'master' into owsm_data
jctian98 Oct 30, 2023
ec109e2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 30, 2023
31ad173
almost finish all scripts
jctian98 Nov 10, 2023
8a09625
fix small problems
jctian98 Nov 10, 2023
952acf6
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Nov 10, 2023
2fd2668
merge master
jctian98 Nov 10, 2023
c53afd0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 10, 2023
bdaf344
update the langauge mapping
jctian98 Nov 11, 2023
d379fd0
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Nov 11, 2023
b2cb427
Merge branch 'master' into owsm_data
jctian98 Nov 11, 2023
f5e5414
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Nov 11, 2023
51e3691
fix CI issue
jctian98 Nov 11, 2023
7f75d15
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Nov 11, 2023
66176bc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 11, 2023
3d89d78
update wsj and commonvoice
jctian98 Nov 26, 2023
8f1e0fa
Merge commit 'FETCH_HEAD' into owsm_data
jctian98 Nov 26, 2023
77fe14b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 26, 2023
c391765
update wsj text norm script
jctian98 Nov 26, 2023
642fd22
update wsj text norm 2
jctian98 Nov 26, 2023
ee00c6c
revise voxpopuli
jctian98 Nov 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions egs2/mixed_v1/s2t1/local/data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',

# Centralized data preparation for OWSM (https://arxiv.org/abs/2309.13876)

# Note (jinchuan):
# (1) please work progressively from v1 to v3: you need to
# prepare data for v1, v2 and v3 in order to obtain the full v3 data
# (2) please revise db.sh for all datasets before running this script.
# Some datasets cannot be downloaded and untared automatically due to
# liscence issue. Please take care of it in advance.
# (3) Due to the large volume of data, we can not ensure the scripts
# will run smoothly for each dataset. Please raise an issue if you
# believe there is a bug.
# (4) This script only prepare data for train and valid. Test data
# should be prepared separately following standard Espnet2 format.

set -e
set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

function check_sorted {
file=$1
sort -k1,1 -u <$file >$file.tmp
if ! cmp -s $file $file.tmp; then
echo "$0: file $1 is not in sorted order or not unique, sorting it"
mv $file.tmp $file
else
rm $file.tmp
fi
}

log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
SECONDS=0

VERSION=v1 # specify v1, v2 or v3
stage=1
stop_stage=2

. utils/parse_options.sh

# Change accordingly if you only want to prepare a subset of it
if [ ${VERSION} = "v1" ]; then
datasets="aishell covost2 gigaspeech librispeech must-c spgispeech"
train_sets="data/AISHELL-1/train \
data/CoVoST2/train \
data/GigaSpeech/XL \
data/LibriSpeech/train-clean-100 \
data/LibriSpeech/train-clean-360 \
data/LibriSpeech/train-other-500 \
data/MuST-C_v1.2/train \
data/MuST-C_v2/train \
data/MuST-C_v3/train \
data/SPGISpeech/train \
data/TEDLIUM3/train"
valid_sets="data/AISHELL-1/dev \
data/CoVoST2/dev \
data/GigaSpeech/DEV \
data/LibriSpeech/dev-clean \
data/LibriSpeech/dev-other \
data/MuST-C_v1.2/dev \
data/MuST-C_v2/dev \
data/MuST-C_v3/dev \
data/SPGISpeech/val \
data/TEDLIUM3/dev"

elif [ ${VERSION} = "v2" ]; then
datasets="gigast multilingual_librispeech wenetspeech"
train_sets="data/GigaST/XL.en-* \
data/MLS/train.* \
data/WenetSpeech/L"
# question (jinchuan): why don't include GigaST-dev?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember GigaST does not have DEV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'll remove this question

valid_sets="data/MLS/dev.* \
data/WenetSpeech/DEV"

elif [ ${VERSION} = "v3" ]; then
datasets="aidatatang ami babel commonvoice swbd fisher_callhome \
fleurs ksponspeech magicdata reazonspeech ru_open_stt \
vctk voxpopuli wsj" \
# still working on it
train_sets="data/aidatatang/train \
"
valid_sets="data/aidatatang/dev \
"
else
echo "Invalid version argument ${VERSION}." && exit 1;
fi
echo "Preparing data for OSWM with version ${VERSION}"
echo "Datasets to prepare: ${datasets}"

utt_extra_files="text.prev text.ctc"
train_out=data/train_${VERSION}
valid_out=data/valid_${VERSION}

# call data preparation script for each dataset
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
for dataset in ${datasets}; do
if [ -f data/.${dataset}.done ]; then
echo ${dataset} has been processed. Skip!
else
echo preparing ${dataset} dataset ...
./local/prepare_${dataset}.sh && touch data/.${dataset}.done
fi
done
fi

# combine all datasets.
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then

if [ ${VERSION} = "v2" ]; then
if [ ! -d data/train_v1 ] || [ ! -d data/valid_v1 ]; then
echo "Cannot find v1 data. Please link it here. Exit!" && exit 1;
fi
train_sets="${train_sets} data/train_v1"
valid_sets="${valid_sets} data/valid_v1"
fi

if [ ${VERSION} = "v3" ]; then
if [ ! -d data/train_v2 ] || [ ! -d data/valid_v2 ]; then
echo "Cannot find v2 data. Please link it here. Exit!" && exit 1;
fi
train_sets="${train_sets} data/train_v2"
valid_sets="${valid_sets} data/valid_v2"
fi

# Combine valid
utils/combine_data.sh --skip_fix true --extra-files "${utt_extra_files}" \
${valid_out} ${valid_sets} || exit 1;
# NOTE(yifan): extra text files must be sorted and unique
for f in ${utt_extra_files}; do
check_sorted ${valid_out}/${f}
done
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" ${valid_out} || exit 1;
utils/validate_data_dir.sh --no-feats --non-print ${valid_out} || exit 1;

# Combine train
utils/combine_data.sh --skip_fix true --extra-files "${utt_extra_files}" \
${train_out} ${train_sets} || exit 1;
# NOTE(yifan): extra text files must be sorted and unique
for f in ${utt_extra_files}; do
check_sorted ${train_out}/${f}
done
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" ${train_out} || exit 1;
utils/validate_data_dir.sh --no-feats --non-print ${train_out} || exit 1;
fi

# todo: some v3-specific operations


log "Successfully finished. [elapsed=${SECONDS}s]"
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_aishell.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/AISHELL-1
data_dir=${AISHELL}
prefix=AISHELL-1
output_dir=data/AISHELL-1
splits="dev train"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_covost2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/espnet-whisper-public/egs2/covost2/st1/data
data_dir=${COVOST2}
prefix=CoVoST2
output_dir=data/CoVoST2
splits="dev train"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_gigaspeech.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/GigaSpeech
data_dir=${GIGASPEECH}
prefix=GigaSpeech
output_dir=data/GigaSpeech
splits="DEV XL"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_librispeech.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/librispeech_full/LibriSpeech
data_dir=${LIBRISPEECH}
prefix=LibriSpeech
output_dir=data/LibriSpeech
splits="dev-clean dev-other train-clean-100 train-clean-360 train-other-500"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_must-c.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/MuST-C_v1.2
data_dir=${MUST_C}
prefix=$(basename ${data_dir})
output_dir=data/${prefix}
splits="dev train"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_spgispeech.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/SPGISpeech
data_dir=${SPGISPEECH}
prefix=SPGISpeech
output_dir=data/SPGISpeech
splits="val train"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v1/s2t1/local/prepare_tedlium.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

data_dir=/scratch/bbjs/peng6/corpora/TEDLIUM/TEDLIUM_release-3/legacy/
data_dir=${TEDLIUM3}
prefix=TEDLIUM3
output_dir=data/TEDLIUM3
splits="dev train"
Expand Down
1 change: 1 addition & 0 deletions egs2/mixed_v2/s2t1/local/data.sh
5 changes: 3 additions & 2 deletions egs2/mixed_v2/s2t1/local/prepare_gigast.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,8 +26,8 @@ log() {
}
SECONDS=0

gigaspeech_dir=/scratch/bbjs/peng6/corpora/GigaSpeech
gigast_dir=/scratch/bbjs/peng6/corpora/GigaST
gigaspeech_dir=${GIGASPEECH}
gigast_dir=${GIGAST}
prefix=GigaST
output_dir=data/GigaST
languages="de zh"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v2/s2t1/local/prepare_multilingual_librispeech.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

mls_dir=/scratch/bbjs/peng6/corpora/multilingual_librispeech
mls_dir=${MLS}
prefix=MLS
output_dir=data/${prefix}
# languages="nl fr de it pl pt es en"
Expand Down
3 changes: 2 additions & 1 deletion egs2/mixed_v2/s2t1/local/prepare_wenetspeech.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -u
set -o pipefail

. ./path.sh || exit 1;
. ./db.sh || exit 1;

# Copied from utils/fix_data_dir.sh
function check_sorted {
Expand All @@ -25,7 +26,7 @@ log() {
}
SECONDS=0

wenetspeech_dir=/scratch/bbjs/peng6/corpora_shared/WenetSpeech/untar
wenetspeech_dir=${WENETSPEECH}
prefix=WenetSpeech
output_dir=data/WenetSpeech
splits="DEV L"
Expand Down
Loading