Recipe template is used to build recipes easily. It is designed to support the common functionalities and requirements that each individual tasks often has.
Table of Contents generated with DocToc
-
Copying a template directory
% task=asr1 # enh1, tts1, mt1, st1 % egs2/TEMPLATE/${task}/setup.sh egs2/foo/${task}
-
Create
egs2/foo/${task}/data
directory to put your corpus: See https://github.com/espnet/data_example or next section. -
Run (e.g.
asr
case)cd egs2/foo/${task} # We always assume that our scripts are executed at this directory. # Assuming Stage1 creating `data`, so you can skip it if you have `data`. ./asr.sh \ --stage 2 \ --ngpu 1 \ --train_set train \ --valid_set valid \ --test_sets "test" \ --lm_train_text "data/train/text" # Use CUDA_VISIBLE_DEVICES to specify a gpu device id # If you meet CUDA out of memory error, change `batch_bins` ( or `batch_size`)
-
For more detail
- Read the config files: e.g. https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1/conf
- Read the main script: e.g. https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/asr.sh
- Documentation: https://espnet.github.io/espnet/
Each directory of training set, development set, and evaluation set, has same directory structure. See also http://kaldi-asr.org/doc/data_prep.html about Kaldi data structure.
We recommend you running mini_an4
recipe and checking the contents of data/
by yourself.
cd egs2/mini_an4/asr1
./run.sh
-
Directory structure
data/ train/ - text # The transcription - wav.scp # Wave file path - utt2spk # A file mapping utterance-id to speaker-id - spk2utt # A file mapping speaker-id to utterance-id - segments # [Option] Specifying start and end time of each utterance dev/ ... test/ ...
-
text
formatuttidA <transcription> uttidB <transcription> ...
-
wav.scp
formatuttidA /path/to/uttidA.wav uttidB /path/to/uttidB.wav ...
-
utt2spk
formatuttidA speakerA uttidB speakerB uttidC speakerA uttidD speakerB ...
-
spk2utt
formatspeakerA uttidA uttidC ... speakerB uttidB uttidD ... ...
Note that
spk2utt
file can be generated byutt2spk
, andutt2spk
can be generated byspk2utt
, so it's enough to create either one of them.utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt utils/spk2utt_to_utt2spk.pl data/train/spk2utt > data/train/utt2spk
If your corpus doesn't include speaker information, give the same speaker id as the utterance id to satisfy the directory format, otherwise give the same speaker id for all utterances (Actually we don't use speaker information for asr recipe now).
uttidA uttidA uttidB uttidB ...
OR
uttidA dummy uttidB dummy ...
-
[Option]
segments
formatIf the audio data is originally long recording, about > ~1 hour, and each audio file includes multiple utterances in each section, you need to create
segments
file to specify the start time and end time of each utterance. The format is<utterance_id> <wav_id> <start_time> <end_time>
.sw02001-A_000098-001156 sw02001-A 0.98 11.56 ...
Note that if using
segments
,wav.scp
has<wav_id>
which corresponds to thesegments
instead ofutterance_id
.sw02001-A /path/to/sw02001-A.wav ...
Once you complete creating the data directory, it's better to check it by utils/validate_data_dir.sh
.
utils/validate_data_dir.sh --no-feats data/train
utils/validate_data_dir.sh --no-feats data/dev
utils/validate_data_dir.sh --no-feats data/test
ESPnet2 doesn't prepare different recipes for each corpus unlike ESPnet1, but we prepare common recipes for each task, which are named as asr.sh
, enh.sh
, tts.sh
, or etc. We carefully designed these common scripts to perform with any types of corpus, so ideally you can train using your own corpus without modifying almost all parts of these recipes. Only you have to do is just creating local/data.sh
.
-
Create directory in egs/
% task=asr1 # enh1, tts1, mt1, st1 % egs2/TEMPLATE/${task}/setup.sh egs2/foo/${task}
-
Create
run.sh
andlocal/data.sh
somehow% cd egs2/foo/${task} % cp ../../mini_an4/${task}/run.sh . % vi run.sh
run.sh
is a thin wrapper of a common recipe for each task as follows,# The contents of run.sh ./asr.sh \ --train_set train \ --valid_set dev \ --test_sets "dev test1 test2" \ --lm_train_text "data/train/text" "$@"
- We use a common recipe, thus you must absorb the difference of each corpus by the command line options of
asr.sh
. - We expect that
local/data.sh
generates training data (e.g.,data/train
), validation data (e.g.,data/dev
), and (multiple) test data (e.g,data/test1
anddata/test2
), which have Kaldi style (See stage1 ofasr.sh
). - Note that some corpora only provide the test data and would not officially prepare the development set. In this case, you can prepare the validation data by extracting the part of the training data and regard the rest of training data as a new training data by yourself (e.g., check
egs2/csj/asr1/local/data.sh
). - Also, the validation data used during training must be a single data directory. If you have multiple validation data directories, you must combine them by using
utils/combine_data.sh
. - On the other hand, the recipe accepts multiple test data directories during inference. So, you can include the validation data to evaluate the ASR performance of the validation data.
- If you'll create your recipe from scratch, you have to understand Kaldi data structure. See the next section.
- If you'll port the recipe from ESPnet1 or Kaldi, you need to embed the data preparation part of the original recipe in
local/data.sh
. Note that the common steps includeFeature extraction
,Speed Perturbation
, andRemoving long/short utterances
, so you don't need to do them atlocal/data.sh
- We use a common recipe, thus you must absorb the difference of each corpus by the command line options of
-
If the recipe uses some corpora and they are not listed in
db.sh
, then write it.... YOUR_CORPUS= ...
-
If the recipe depends on some special tools, then write the requirements to
local/path.sh
path.sh:
# e.g. flac command is required if ! which flac &> /dev/null; then echo "Error: flac is not installed" return 1 fi