Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AudioMNIST experiments #1

Merged
merged 154 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
154 commits
Select commit Hold shift + click to select a range
962ad50
data prep scripts update
Adel-Moumen Feb 10, 2024
39b5049
iterate over utterances
Adel-Moumen Feb 10, 2024
b313734
without parallel map
Adel-Moumen Feb 10, 2024
7bdb17f
parallel map -> so fast omfg
Adel-Moumen Feb 10, 2024
a3d2d4c
gigaspeech data prep done
Adel-Moumen Feb 10, 2024
4cb3257
speechcolab extra dep if one must download gigaspeech
Adel-Moumen Feb 10, 2024
e521cc1
create ASR CTC folder
Adel-Moumen Feb 10, 2024
92a17c1
base yaml + update data prep to better reflect potential different na…
Adel-Moumen Feb 10, 2024
4dd02a0
update recipe
Adel-Moumen Feb 10, 2024
c254085
update recipe to be compliant with gigaspeech csv
Adel-Moumen Feb 10, 2024
b4de83a
add transformers dep
Adel-Moumen Feb 10, 2024
c3afdcc
convert opus to wav
Adel-Moumen Feb 10, 2024
945b8bb
recipe --debug mode works.
Adel-Moumen Feb 10, 2024
ae91209
typo GRABAGE_UTTERANCE_TAGS -> GARBAGE_UTTERANCE_TAGS
Adel-Moumen Feb 10, 2024
28b4257
tmp DL file
Adel-Moumen Feb 11, 2024
3a6396c
update DL FILE
Adel-Moumen Feb 11, 2024
6e771d7
add DL file in ASR/CTC
Adel-Moumen Feb 11, 2024
ebfcddb
update extra_requirements.txt
Adel-Moumen Feb 11, 2024
a68d0b8
add support of savedir within Pretrained subclasses
Adel-Moumen Feb 12, 2024
b2ed2a9
add wbs requirements
Adel-Moumen Feb 12, 2024
4b8c533
webdataset
Adel-Moumen Feb 13, 2024
44785c0
remove print
Adel-Moumen Feb 13, 2024
e203d77
tmp files webdataset
Adel-Moumen Feb 13, 2024
9b44e8d
verbosity + metada.json
Adel-Moumen Feb 14, 2024
1426156
letzo now label_encoder can actually train + the recipe seems to work.
Adel-Moumen Feb 14, 2024
0786b0b
Merge branch 'develop' of https://github.com/Adel-Moumen/speechbrain …
Adel-Moumen Feb 14, 2024
99bdfb1
Merge branch 'speechbrain:develop' into gigaspeech
Adel-Moumen Feb 14, 2024
aaeee16
Merge branch 'gigaspeech' of https://github.com/Adel-Moumen/speechbra…
Adel-Moumen Feb 14, 2024
ce12662
remove wbs
Adel-Moumen Mar 18, 2024
ed3ba03
DL info
Adel-Moumen Mar 18, 2024
8ae360b
HF DL support
Adel-Moumen Mar 18, 2024
1601ddc
remove webdataset as it sucks :p
Adel-Moumen Mar 18, 2024
9531d0b
name
Adel-Moumen Mar 18, 2024
1356ff1
ngram commands
Adel-Moumen Mar 18, 2024
4fa921b
Merge branch 'speechbrain:develop' into gigaspeech
Adel-Moumen Mar 18, 2024
0485173
whisper baseline
Adel-Moumen Mar 18, 2024
b360f8b
fix HF
Adel-Moumen Mar 18, 2024
3d71a04
Merge remote-tracking branch 'speechbrain/develop' into gigaspeech
Adel-Moumen Mar 29, 2024
81884ee
pre-commit + sentencepiece char
Adel-Moumen Mar 29, 2024
0f3da32
remove csv
Adel-Moumen Mar 29, 2024
cf2507a
Add quirks.py, move global PyTorch config and GPU workarounds there
asumagic Sep 17, 2024
0ea337f
Add support for SB_DISABLE_QUIRKS environment variable
asumagic Sep 17, 2024
265aa24
Fetch rework: make savedir optional
asumagic Oct 4, 2024
10b5286
Merge branch 'develop' into gigaspeech
TParcollet Oct 8, 2024
0009cf2
bunch of updates to make it run
TParcollet Oct 8, 2024
8bdbd1e
no download script
TParcollet Oct 8, 2024
8083872
fix precommit
TParcollet Oct 8, 2024
a362bca
fix precommit
TParcollet Oct 8, 2024
603049c
readmes
TParcollet Oct 8, 2024
d4b3f0d
readmes
TParcollet Oct 8, 2024
ef87027
readmes
TParcollet Oct 8, 2024
8d53430
readmes
TParcollet Oct 8, 2024
762a7b2
doc update
TParcollet Oct 8, 2024
14a9df7
CI god not happy, make CI god happy
TParcollet Oct 8, 2024
19d4753
why you here little encoder
TParcollet Oct 8, 2024
beb2ab2
adding a tranduscer streaming recipe, because why not
TParcollet Oct 8, 2024
cde564a
add test for transducer
TParcollet Oct 8, 2024
7f1ff0e
works better when me not stupid
TParcollet Oct 8, 2024
d27e285
fix yaml
TParcollet Oct 8, 2024
800d637
update req
TParcollet Oct 8, 2024
b76911b
add warning for cache dir
TParcollet Oct 9, 2024
f1be37b
add warning for cache dir
TParcollet Oct 9, 2024
d96d2ce
enable multiprocessing
TParcollet Oct 9, 2024
2926264
Minor cleanups to fetching
pplantinga Oct 9, 2024
f87e350
Change default behavior of inference to not create savedir if not spe…
pplantinga Oct 9, 2024
5259f27
allow data prep without ddp
TParcollet Oct 10, 2024
c0ea27a
fix tests
TParcollet Oct 10, 2024
688cbe3
smoll readme update
TParcollet Oct 10, 2024
99d998e
fix review comments
TParcollet Oct 11, 2024
484e8f4
Merge branch 'develop' into gigaspeech
TParcollet Oct 11, 2024
0d77a46
fixed concat_start_index check (#2717)
gfdb Oct 11, 2024
9912b25
Ensure adapted models save their parameters (#2716)
pplantinga Oct 11, 2024
679e270
wtf
Oct 11, 2024
a33cd7b
update doc
TParcollet Oct 11, 2024
9e2af5b
more documentation on storage
Oct 11, 2024
468147d
missing arg
TParcollet Oct 11, 2024
575a55c
a bit of logs
TParcollet Oct 11, 2024
0886ec6
new schedulers
TParcollet Oct 11, 2024
e285300
new schedulers
TParcollet Oct 11, 2024
e31a066
Fixes #2656: Remove EOS from SoundChoice
flexthink Oct 11, 2024
a06221b
fix my stupidity
TParcollet Oct 11, 2024
8eb530d
Merge branch 'speechbrain:develop' into gigaspeech
Adel-Moumen Oct 11, 2024
6bd627c
Update non-HF code path for new preprocessing code in GigaSpeech
asumagic Oct 15, 2024
dd28c73
Fix CSV path for non-HF Gigaspeech
asumagic Oct 15, 2024
ab79b48
Fix formatting
asumagic Oct 15, 2024
410fe2f
Kmeans fix (#2642)
poonehmousavi Oct 15, 2024
4822cba
Merge branch 'develop' into fetch-take-two
mravanelli Oct 15, 2024
5043059
add call on start of fit_batch fn
Adel-Moumen Oct 17, 2024
cdf4860
Update core.py
Adel-Moumen Oct 17, 2024
d3599dc
Update core.py
Adel-Moumen Oct 17, 2024
c650072
Merge branch 'speechbrain:develop' into fix_call_on_start_fit_batch
Adel-Moumen Oct 17, 2024
339360a
Merge pull request #2722 from Adel-Moumen/fix_call_on_start_fit_batch
asumagic Oct 18, 2024
2d157c4
Merge pull request #2718 from flexthink/speechbrain-g2p-fix
asumagic Oct 18, 2024
4e64041
Fix preprocess_text example
asumagic Oct 18, 2024
2dd7232
Fix guess_source docstring with up-to-date info
asumagic Oct 18, 2024
2ee50fb
Also remove default savedir from Pretrained
pplantinga Oct 18, 2024
6e999fe
Merge pull request #2712 from pplantinga/fetch-take-two
asumagic Oct 18, 2024
942d5ed
Merge branch 'develop' into gpu-quirks
asumagic Oct 20, 2024
d9efa5a
Fix function name for log_applied_quirks
asumagic Oct 20, 2024
932dcde
wip audiomnist+gt
naspert Oct 21, 2024
bad5e05
Revert "fix normalization for LFB"
naspert Oct 21, 2024
32f6038
audiomnist classification setup
naspert Oct 21, 2024
24ed44e
fix config
naspert Oct 21, 2024
2535f2c
add missing file
naspert Oct 21, 2024
510396b
update dataset load/training
naspert Oct 21, 2024
3a583f7
remove unnecessary params
naspert Oct 21, 2024
fbef11a
remove sort
naspert Oct 21, 2024
52d6744
remove unnecessary code
naspert Oct 21, 2024
ccf0f89
fix paths
naspert Oct 21, 2024
cc58456
fix loss computation
naspert Oct 21, 2024
aa55757
add missing flatten
naspert Oct 21, 2024
08cc458
print summary
naspert Oct 22, 2024
c7bb76f
Explain quirks in docs/experiment.md
asumagic Oct 22, 2024
1d8074f
ok stupid linter check that hates intentional leading spaces in markdown
asumagic Oct 22, 2024
1bb368a
Merge pull request #2558 from asumagic/gpu-quirks
asumagic Oct 22, 2024
906ada0
add citing in README
Adel-Moumen Oct 22, 2024
77c089f
add code to pad all wavs to the same length
naspert Oct 22, 2024
7cb19e9
fix pad call
naspert Oct 22, 2024
ce718dc
fix error computation
naspert Oct 22, 2024
d551595
fix error computation
naspert Oct 23, 2024
0db8721
Make `collect_in` optional for `Pretrainer`, disable it by default
asumagic Oct 22, 2024
2026dfe
Change more defaults to `savedir=None` and `fetch_strategy=SYMLINK`
asumagic Oct 23, 2024
6f2b7ad
move flatten in audionet
naspert Oct 23, 2024
f3a6337
Merge remote-tracking branch 'upstream/develop' into gigaspeech
asumagic Oct 23, 2024
de7a7e8
Fix GS transducer test prediction decoding?
asumagic Oct 23, 2024
dd5f7d2
fix data prep logic and paths
naspert Oct 23, 2024
5b15078
Actually fix GS transducer test prediction decoding
asumagic Oct 23, 2024
76a803b
Remove punctuation filtering that is handled elsewhere
asumagic Oct 23, 2024
231c78a
HuggingFance
asumagic Oct 23, 2024
743f902
fix skip data prep logic
naspert Oct 23, 2024
59794d0
add original audionet feature extraction
naspert Oct 23, 2024
de3426e
fix pooling for audionet feature extraction
naspert Oct 24, 2024
1772e51
fix audionet shape + remove input norm
naspert Oct 24, 2024
be5cb6e
try data augmentation
naspert Oct 24, 2024
c0d0838
add missing refs
naspert Oct 24, 2024
f2188ad
- rework AudioNet to have optional pooling
naspert Oct 25, 2024
2ed662b
fix typo in url
naspert Oct 25, 2024
6d93019
update audionet hparams
naspert Oct 25, 2024
16e4408
update audionet custom hparams
naspert Oct 25, 2024
6c7b563
update audionet custom hparams
naspert Oct 25, 2024
7168e0c
Updated warning for load_collected
asumagic Oct 25, 2024
fd0cd20
Merge pull request #2727 from asumagic/pretrainer-no-collect-dir
asumagic Oct 25, 2024
d98e949
Add results and notices for results for GigaSpeech transducer & wavlm
asumagic Oct 25, 2024
db5b629
english hard
asumagic Oct 25, 2024
3d2eeee
Merge pull request #2405 from Adel-Moumen/gigaspeech
asumagic Oct 25, 2024
e978711
update audionet custom hparams
naspert Oct 28, 2024
ca375b3
fix doc + pre-commit clean
naspert Oct 28, 2024
4b9d9fe
fix code examples
naspert Oct 28, 2024
56812a3
Merge remote-tracking branch 'origin/develop' into gammatone_dev
naspert Oct 28, 2024
a6b47f3
fix consistency tests
naspert Oct 29, 2024
13a9cbb
fix pre commit
naspert Oct 29, 2024
7807cb4
remove config
naspert Oct 29, 2024
7a3ff84
fix docstring for LFB
naspert Oct 29, 2024
92a88a7
fix docstring for GammatoneConv1D
naspert Oct 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/experiment.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,14 @@ Beyond waveforms, this format is used for any tensor in the computation pipeline
(batch_size, time_steps, n_filters)
```

## Modified PyTorch globals and GPU quirks

For various reasons, SpeechBrain modifies some PyTorch global configuration to work around issues or improve execution speed, sometimes depending on GPU configuration.
We do so when we consider that some modified defaults make more sense given our usecases than PyTorch's defaults. For instance, we very commonly encounter dynamic tensor shapes, which comes at odds with certain auto-tuning methods.

These changes are applied in a standardized location, [`quirks.py`](https://github.com/speechbrain/speechbrain/tree/develop/speechbrain/utils/quirks.py). They are logged when starting an experiment.

The `SB_DISABLE_QUIRKS` environment variable lets you disable quirks easily. For instance, to disable TensorFloat32 and re-enable JIT profiling, you would use `SB_DISABLE_QUIRKS=allow_tf32,disable_jit_profiling`.

## Reproducibility

Expand Down
18 changes: 17 additions & 1 deletion recipes/AudioMNIST/audiomnist_prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,12 @@
import json
import math
import os
import random
from functools import partial
from glob import glob
from subprocess import list2cmdline

import torch.nn.functional as Ft
import torchaudio
from torchaudio import functional as F
from tqdm.auto import tqdm
Expand Down Expand Up @@ -61,6 +63,7 @@ def prepare_audiomnist(
norm=True,
highpass=True,
process_audio=None,
pad_output=None,
skip_prep=False,
):
"""Auto-downloads and prepares the AudioMNIST dataset
Expand Down Expand Up @@ -106,6 +109,8 @@ def prepare_audiomnist(
process_audio: callable
a custom function used to process audio files - instead of
the standard transform (resample + normalize + trim)
pad_output: int
the length in samples of the output signal. If None, no padding is applied.
skip_prep: bool
whether preparation should be skipped

Expand Down Expand Up @@ -174,6 +179,7 @@ def prepare_audiomnist(
trim_threshold=trim_threshold,
norm=norm,
highpass=highpass,
pad_output=pad_output,
)

# Get file lists for train/valid/test splits
Expand Down Expand Up @@ -224,7 +230,7 @@ def skip(json_files, save_opt, conf):
"""

# Checking csv files
skip = any(not os.path.isfile(json_file) for json_file in json_files)
skip = all(os.path.isfile(json_file) for json_file in json_files.values())

# Checking saved options
if skip is True:
Expand Down Expand Up @@ -739,6 +745,7 @@ def process_audio_default(
src_sample_rate=48000,
tgt_sample_rate=22050,
trim_threshold=-30.0,
pad_output=None,
):
"""Standard audio preprocessing / conversion

Expand All @@ -758,6 +765,8 @@ def process_audio_default(
the target sample rate
trim_threshold: float
the decibels threshold for trimming the file
pad_output: int
the length of the output signal (if padding is needed). If None, no padding is applied.

Returns
-------
Expand All @@ -779,6 +788,13 @@ def process_audio_default(
)
sig = sig.squeeze(0)

if pad_output is not None:
delta = pad_output - len(sig)
offset = random.randint(
0, delta
) # if padding, insert blank space of random length at start of signal
sig = Ft.pad(sig, (offset, delta - offset), "constant", 0)

# Normalize
if norm:
sig = sig / sig.abs().max()
Expand Down
75 changes: 75 additions & 0 deletions recipes/AudioMNIST/classification/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# AudioMNIST Dataset classification
This folder contains recipes for spoken digit recognition with [AudioMNIST Dataset](https://github.com/soerenab/AudioMNIST),
including sample recipes for the [Learnable gammatone filterbank audio frontend](https://septentrio.uit.no/index.php/nldl/article/view/6279).
The recipes include the original [AudioNet architecture](https://arxiv.org/abs/1807.03418) and two other versions using a learnable
Gammatone filter bank as frontend.

# How to run
To run it, please type:

```
python train.py hparams/audionet.yaml --data_folder=/path_to_/AudioMNIST (AudioNet)
# LFB frontends
python train.py hparams/audionet_lfb.yaml --data_folder=/path_to_/AudioMNIST --seed=1234 (AudioNet with Gammatone LFB)
python train.py hparams/audionet_custom_lfb.yaml --data_folder=/path_to_/AudioMNIST --seed=1234 (customized AudioNet with Gammatone LFB)
```

# Performance summary

[Test accuracy on AudioMNIST split 0]
| System | Accuracy |
|---------------------- | ------------ |
| AudioNet | 94.05% |
| AudioNet + LFB | 97.03% |
| AudioNet custom + LFB | 96.70% |


# Checkpoints and Training logs

You can find the full experiment folder (i.e., checkpoints, logs, etc) here:
- AudioNet: https://os.unil.cloud.switch.ch/swift/v1/lts2-speechbrain/AudioMNIST/results/audionet
- AudioNet + LFB: https://os.unil.cloud.switch.ch/swift/v1/lts2-speechbrain/AudioMNIST/results/audionet_lfb
- AudioNet custom + LFB: https://os.unil.cloud.switch.ch/swift/v1/lts2-speechbrain/AudioMNIST/results/audionet_custom_lfb

## Notes

- The recipe automatically downloads the AudioMNSIT dataset. You only need to specify the path to which you would like to download it.

- The dataset has 5 different splits for training/validation/test samples. Check the yaml recipe for more information.

---------------------------------------------------------------------------------------------------------

## Citing

If you find this recipe useful, please cite:

```bibtex
@inproceedings{learnablefb,
title = {Learnable filter-banks for CNN-based audio applications},
author = {Peic Tukuljac, Helena and Ricaud, Benjamin and Aspert, Nicolas and Colbois, Laurent},
journal = {Proceedings of the Northern Lights Deep Learning Workshop 2022 },
series = {Proceedings of the Northern Lights Deep Learning Workshop. 3},
pages = {9},
year = {2022},
abstract = {We investigate the design of a convolutional layer where kernels are parameterized functions. This layer aims at being the input layer of convolutional neural networks for audio applications or applications involving time-series. The kernels are defined as one-dimensional functions having a band-pass filter shape, with a limited number of trainable parameters. Building on the literature on this topic, we confirm that networks having such an input layer can achieve state-of-the-art accuracy on several audio classification tasks. We explore the effect of different parameters on the network accuracy and learning ability. This approach reduces the number of weights to be trained and enables larger kernel sizes, an advantage for audio applications. Furthermore, the learned filters bring additional interpretability and a better understanding of the audio properties exploited by the network.},
url = {https://septentrio.uit.no/index.php/nldl/article/view/6279},
doi = {10.7557/18.6279},
}
```

If you use **SpeechBrain**, please cite:

```bibtex
@misc{speechbrain,
title={{SpeechBrain}: A General-Purpose Speech Toolkit},
author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
year={2021},
eprint={2106.04624},
archivePrefix={arXiv},
primaryClass={eess.AS},
note={arXiv:2106.04624}
}
```

---------------------------------------------------------------------------------------------------------

1 change: 1 addition & 0 deletions recipes/AudioMNIST/classification/audiomnist_prepare.py
110 changes: 110 additions & 0 deletions recipes/AudioMNIST/classification/hparams/audionet.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ################################
# Model: Classification with AudioNet and Gammatone LFB
# Authors: Nicolas Aspert 2024
# ################################

# Basic parameters
seed: 1879
__set_seed: !apply:torch.manual_seed [!ref <seed>]

number_of_digits: 10
# use "split[0-4].zip" for the official audiomnist splits, null for default data url in audiomnist_prepare.py
metadata_repo: https://os.unil.cloud.switch.ch/swift/v1/lts2-speechbrain/AudioMNIST/metadata_split0.zip
metadata_folder: null
output_folder: !ref ./results/audionet/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt
data_save_folder: !ref <data_folder>/audiomnist_prepared
train_json: !ref <data_save_folder>/train.json
valid_json: !ref <data_save_folder>/valid.json
test_json: !ref <data_save_folder>/test.json

# Data files
data_folder: !PLACEHOLDER # e.g. /path/to/audiomnist

skip_prep: False
ckpt_interval_minutes: 15 # save checkpoint every N min

# Preparation Parameters
data_prepare_norm: True
data_prepare_trim: False
data_prepare_trim_threshold: -30.
data_prepare_sample_rate_src: 48000
data_prepare_sample_rate_tgt: 8000
# make all samples 1s long
data_prepare_pad_output: !ref <data_prepare_sample_rate_tgt>


####################### Training Parameters ####################################
number_of_epochs: 50
batch_size: 128
lr: 0.001
shuffle: True


# Feature parameters
audionet_features: 100

# Number of classes (i.e. different commands)
out_n_neurons: !ref <number_of_digits>

num_workers: 4
dataloader_options:
batch_size: !ref <batch_size>
shuffle: !ref <shuffle>
num_workers: !ref <num_workers>

# Functions
compute_features: !new:speechbrain.lobes.models.AudioNet.AudioNetFrontend
out_channels: !ref <audionet_features>
in_channels: 1
conv_kernel_size: 3
conv_dilation: 1
activation: !name:torch.nn.ReLU

embedding_model: !new:speechbrain.lobes.models.AudioNet.AudioNet
in_channels: !ref <audionet_features>
activation: !name:torch.nn.ReLU
conv_blocks: 5
conv_channels: [64, 128, 128, 128, 128]
conv_kernel_sizes: [3, 3, 3, 3, 3]
conv_dilations: [1, 1, 1, 1, 1]
max_pooling_kernel: [2, 2, 2, 2, 2]
max_pooling_stride: [2, 2, 2, 2, 2]

classifier: !new:speechbrain.lobes.models.AudioNet.Classifier
input_shape: [null, 16000]
activation: null
lin_blocks: 2
lin_neurons: [1024, 512]
out_neurons: !ref <out_n_neurons>

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>


modules:
compute_features: !ref <compute_features>
embedding_model: !ref <embedding_model>
classifier: !ref <classifier>

# Cost + optimization
compute_cost: !name:speechbrain.nnet.losses.nll_loss

opt_class: !name:torch.optim.Adam
lr: !ref <lr>

# Logging + checkpoints
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>

error_stats: !name:speechbrain.utils.metric_stats.MetricStats
metric: !name:speechbrain.nnet.losses.classification_error
reduction: batch

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
embedding_model: !ref <embedding_model>
classifier: !ref <classifier>
counter: !ref <epoch_counter>
Loading
Loading