Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDACC dataset automatic speech recognition #5996

Merged
merged 34 commits into from
Jan 2, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
64f9775
data prep stage for edacc
uwanny Nov 30, 2024
2488ddc
split too large audio file limited memory on PSC, and verified implem…
uwanny Dec 12, 2024
70d2c9c
Merge remote-tracking branch 'origin/master' into EdAcc-dataset
uwanny Dec 12, 2024
17f3ad6
split and truncate too long test set
uwanny Dec 25, 2024
fb887af
update the training and decode config for wavLM, update run.sh
uwanny Dec 25, 2024
647e666
Merge branch 'master' into EdAcc-dataset
uwanny Dec 25, 2024
15f8a91
Merge branch 'espnet:master' into EdAcc-dataset
uwanny Dec 25, 2024
6d2848b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 25, 2024
6a6df59
fix the too long line issue, make test set split optional
uwanny Dec 25, 2024
6930b93
Merge branch 'EdAcc-dataset' of https://github.com/uwanny/espnet into…
uwanny Dec 25, 2024
8abea69
delete useless file
uwanny Dec 25, 2024
5c4e73d
solve line too long issue
uwanny Dec 25, 2024
db2a309
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 25, 2024
bee1b67
fix line too long
uwanny Dec 26, 2024
98623d5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 26, 2024
f8d73bb
add README
uwanny Dec 26, 2024
279a697
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 26, 2024
6135bab
update README, add missing file
uwanny Dec 26, 2024
475d159
remove duplicated file
uwanny Dec 26, 2024
362cb21
test line too long error
uwanny Dec 27, 2024
9033350
fix line too long, move to README
uwanny Dec 27, 2024
fbc1ec8
make data prep to multiple stages
uwanny Dec 27, 2024
0b40d51
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 27, 2024
6949d77
Update README.md in egs2
uwanny Dec 27, 2024
998c33c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 27, 2024
e6a6f11
Merge branch 'master' into EdAcc-dataset
uwanny Dec 29, 2024
95cf86f
Update README
uwanny Dec 30, 2024
a783392
update config, update run.sh
uwanny Dec 30, 2024
b157f5a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 30, 2024
3901058
update README
uwanny Dec 30, 2024
8912268
Merge branch 'EdAcc-dataset' of https://github.com/uwanny/espnet into…
uwanny Dec 30, 2024
bd05c27
trigger CI check
uwanny Dec 30, 2024
13d58fc
update README
uwanny Dec 31, 2024
2fe91b4
Merge branch 'master' into EdAcc-dataset
uwanny Dec 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| dns_ins21 | Deep Noise Suppression Challenge – INTERSPEECH 2021 | SE | 11 languages + singing| https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2021/ | |
| dsing | Automatic Lyric Transcription from Karaoke Vocal Tracks (From DAMP Sing300x30x2) | ASR (ALT) | ENG singing | https://github.com/groadabike/Kaldi-Dsing-task | |
| easycom | An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Classification | ASR | ENG | https://github.com/facebookresearch/EasyComDataset | |
| edacc | THE EDINBURGH INTERNATIONAL ACCENTS OF ENGLISH CORPUS | ASR | ENG | https://groups.inf.ed.ac.uk/edacc/index.html#contribute-section | |
| esc50 | Dataset for Environmental Sound Classification | Audio Classification | | https://github.com/karolpiczak/ESC-50 | |
| fisher_callhome_spanish | Fisher and CALLHOME Spanish--English Speech Translation | ASR/ST | SPA->ENG | https://catalog.ldc.upenn.edu/LDC2014T23 | |
| fleurs | Few-shot Learning Evaluation of Universal Representations of Speech | ASR/Multilingual | 102 languages | https://huggingface.co/datasets/google/fleurs | |
Expand Down
1 change: 1 addition & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,7 @@ HIFITTS=downloads
CLOTHO_V2=downloads
AUDIOCAPS=
CLOTHO_CHATGPT_MIXUP=
EDACC=downloads

# For only CMU TIR environment
if [[ "$(hostname)" == tir* ]]; then
Expand Down
178 changes: 178 additions & 0 deletions egs2/edacc/asr1/README.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions egs2/edacc/asr1/asr.sh
110 changes: 110 additions & 0 deletions egs2/edacc/asr1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
6 changes: 6 additions & 0 deletions egs2/edacc/asr1/conf/decode_asr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
beam_size: 10
ctc_weight: 0.3
lm_weight: 0.0
maxlenratio: 0.0
minlenratio: 0.0
penalty: 0.0
2 changes: 2 additions & 0 deletions egs2/edacc/asr1/conf/fbank.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--sample-frequency=16000
--num-mel-bins=80
11 changes: 11 additions & 0 deletions egs2/edacc/asr1/conf/pbs.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Default configuration
command qsub -V -v PATH -S /bin/bash
option name=* -N $0
option mem=* -l mem=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -l ncpus=$0
option num_threads=1 # Do not add anything to qsub_opts
option num_nodes=* -l nodes=$0:ppn=1
default gpu=0
option gpu=0
option gpu=* -l ngpus=$0
1 change: 1 addition & 0 deletions egs2/edacc/asr1/conf/pitch.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--sample-frequency=16000
12 changes: 12 additions & 0 deletions egs2/edacc/asr1/conf/queue.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option name=* -N $0
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q
14 changes: 14 additions & 0 deletions egs2/edacc/asr1/conf/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p cpu
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.
91 changes: 91 additions & 0 deletions egs2/edacc/asr1/conf/train_asr_wavlm_transformer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
freeze_param: [
"frontend.upstream"
]

frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wavlm_base_plus
download_dir: ./hub
multilayer_feature: True

preencoder: linear
preencoder_conf:
input_size: 768 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80

encoder: transformer
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 1024
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: conv2d2
normalize_before: true

decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 4
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
extract_feats_in_collect_stats: false

seed: 2022
log_interval: 400
num_att_plot: 0
num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 12000000
accum_grad: 4
max_epoch: 160
patience: none
init: none
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 4

use_amp: true
cudnn_deterministic: false
cudnn_benchmark: false


optim: adam
optim_conf:
lr: 0.008
weight_decay: 0.001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 1000


specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5
1 change: 1 addition & 0 deletions egs2/edacc/asr1/db.sh
105 changes: 105 additions & 0 deletions egs2/edacc/asr1/local/data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
SECONDS=0


stage=1
stop_stage=100
train_set="dev_train"
valid_set="dev_non_train"
test_set="test"
sub_test_set="test_sub"

log "$0 $*"
. utils/parse_options.sh

. ./db.sh
. ./path.sh
. ./cmd.sh

if [ $# -ne 0 ]; then
log "Error: No positional arguments are required."
exit 2
fi


if [ -z "${EDACC}" ]; then
log "Fill the value of 'EDACC' of db.sh"
exit 1
fi

partitions="${train_set} ${valid_set} ${test_set} ${sub_test_set}"

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
if [ ! -e "${EDACC}/edacc_v1.0/README.txt" ]; then
echo "stage 1: Please download data from https://datashare.ed.ac.uk/handle/10283/4836 and save to ${EDACC}"
else
log "stage 1: ${EDACC}/edacc_v1.0/README.txt is already existing. Skip data downloading"
fi
fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
log "stage 2: Data preparation -- preprocess large wav files"

# deal with too large wav file in data folder
audio_path="${EDACC}/edacc_v1.0/data/EDACC-C30.wav"
output_dir="${EDACC}/edacc_v1.0/data/segmentation"
mkdir -p "$output_dir"

if [ -f "$audio_path" ]; then
# segment at 1883 second
ffmpeg -i "$audio_path" -ss 0 -t 1883 "$output_dir/EDACC-C30_P1.wav"
ffmpeg -i "$audio_path" -ss 1883 -c copy "$output_dir/EDACC-C30_P2.wav"

echo "Audio file successfully split into:"
echo " - $output_dir/EDACC-C30_P1.wav"
echo " - $output_dir/EDACC-C30_P2.wav"
else
echo "File $audio_path not found. Please check the file path."
exit 1
fi
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
log "stage 3: Data preparation -- prepare kaldi files, generate ${train_set},
${valid_set}, ${test_set}, ${sub_test_set}"

# prepare the date in Kaldi style, output will be "dev" folder and "test" folder in "data" folder
python3 local/data_prep.py "${EDACC}/edacc_v1.0" "data" "${output_dir}"

# # (optional) split the too long test utterance used for decoding section if necessary,
# # the alignment is based on CTC segmentation tool
# python3 local/truncate_test.py "data/test"

# make training data from dev, as original data has no training data
utils/subset_data_dir.sh --utt-list data/train_utterlist data/dev "data/${train_set}"
utils/subset_data_dir.sh --utt-list data/valid_utterlist data/dev "data/${valid_set}"

# make a sub test set from test set
utils/subset_data_dir.sh --first data/test 500 "data/${sub_test_set}"

# sort the data, and make utt2spk to spk2utt
for x in ${partitions}; do
for f in text wav.scp utt2spk segments; do
sort data/${x}/${f} -o data/${x}/${f}
done
utils/utt2spk_to_spk2utt.pl data/${x}/utt2spk > data/${x}/spk2utt
done

# Validate data
for x in ${partitions}; do
utils/validate_data_dir.sh --no-feats "data/${x}"
done
fi


log "Successfully finished. [elapsed=${SECONDS}s]"
Loading
Loading