-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix wav2vec2 masking #1799
Fix wav2vec2 masking #1799
Conversation
@anautsch I will make some changes to a few recipes on that PR... It may also create some side effects of HF models (performance-wise only) but it is necessary as this is a 'major' bug for HF models ... Any chance you could run the recipe testing on all recipes concerned once I am done? I am really running out of time ... |
@TParcollet you mean testing them on a test partition (for which datasets)? lmk when... |
I meant applying your recipe testing PR to those changes, it will be for all recipes that are impacted (yaml changed in this PR). I'll try to do it asap. Worst case, next Monday. |
For testing, you'll need to edit the yamls - as they are now, testing will fail - the override 'wav2vec2_folder' is missing (needed to change paths in testing/debug mode). Example for necessary changes $ diff recipes/CommonVoice/ASR/CTC/hparams/train_en_with_wav2vec.yaml.orig recipes/CommonVoice/ASR/CTC/hparams/train_en_with_wav2vec.yaml
16a17
> wav2vec2_folder: !ref <save_folder>/wav2vec2_checkpoint
115c116
< save_path: !ref <save_folder>/wav2vec2_checkpoint
---
> save_path: !ref <wav2vec2_folder> To "blend-in" PR 1600: git clone git@github.com:TParcollet/speechbrain-released.git
cd speechbrain-released
git checkout fix_wav2vec2_masking
# or another environment manager (depending on licensing)
conda create -y -n pr1799 python=3.8
conda activate pr1799
pip install -r requirements.txt
pip install transformers datasets
pip install --editable .
rsync -P ~/pr1600/speechbrain/speechbrain/core.py speechbrain/core.py
rsync -avhP ~/pr1600/speechbrain/tests/recipes tests/
rsync -avhP ~/pr1600/speechbrain/tests/samples tests/
rsync -avhP ~/pr1600/speechbrain/tests/utils tests/ Recipe flow testing: python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Script_file"], filters=[["recipes/CommonVoice/ASR/CTC/train_with_wav2vec.py"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'
# [...]
(1/5) Running test for CommonVoice_row_2...
(2/5) Running test for CommonVoice_row_3...
(3/5) Running test for CommonVoice_row_4...
(4/5) Running test for CommonVoice_row_5...
(5/5) Running test for CommonVoice_row_6...
TEST PASSED works - expected as you worked on it until it was running in your PR environment. Next: testing eval performance with the CommonVoice dataset. |
The follow-up test I had in mind concerns yaml/interface updates on depending HF repos. I'd recommend to make another PR for this type of changes instead of local trial & error w/o version control. I get the feeling, you'll want to update HF repos also... As an example, for refactoring SB's transformer interface (PR 1596), I keep track of HF yaml changes in #1623 These updates are to this folder only (there, HF yamls are stored; as a copy): There's a part of the testing refactoring (PR 1600), which is not documented as of now, since I expected it to be used and improved throughout the PR 1596—looks like we will do that already here then 😋 The testing for updating HF yamls would be in two steps: # this one will create a summary file: tests/tmp/refactoring_results.yaml
# it will contain the performance values of a datasets test partition BEFORE refactoring
# this will also create a folder: tests/tmp/hf_interfaces
# ... and clone a SB repo into it & switch branch to one that can merge into speechbrain:testing-refactoring
PYTHONPATH=`realpath .` python tests/utils/refactoring_checks.py tests/utils/overrides.yaml --LibriSpeech_data="" --CommonVoice_EN_data="" --CommonVoice_FR_data="" --IEMOCAP_data="" --after=False
# and here for adding eval results for AFTER refactoring
# it will also perform a comparison for same results
PYTHONPATH=`realpath .` python tests/utils/refactoring_checks.py tests/utils/overrides.yaml --LibriSpeech_data="" --CommonVoice_EN_data="" --CommonVoice_FR_data="" --IEMOCAP_data="" --after=True Here is a detailed description of how testing configs can be managed: Example: How this applies to this PR's CommonVoice/CTC w2v2 recipes. a) Each language partition of a dataset needs specification for testing in CommonVoice_EN_data: !PLACEHOLDER Then, there is a meta descriptor for this dataset & overrides specific to it. CommonVoice_EN:
data_folder: !ref <CommonVoice_EN_data> —since this one has no override flags, let's take a look at this one in comparison: LibriSpeech:
data_folder: !ref <LibriSpeech_data>
skip_prep: True It should be straight forward to expand this to new datasets & other test partitions (e.g. more CommonVoice languages). Since LibriSpeech is one entity, it's test partitions are considered here as one entity just alike. SB has a different handling on the different languages for CommonVoice (one does not test FR on EN, and vice versa). b) the repo:branch tracking HF yaml/interface updates needs specification (also in new_interfaces_git: https://github.com/speechbrain/speechbrain # change this to your repo
new_interfaces_branch: testing-refactoring # change it to sth that relates to this PR also; more will come
new_interfaces_local_dir: tests/tmp/hf_interfaces # can remain as-is, unless you do heavier lifting This will be used during the BEFORE run to gather testing specification and store them in the Note: another possible override in in # Filter HF repos (will be used in a local glob dir crawling)
# glob_filter: "*wav2vec2*"
# glob_filter: "*libri*"
glob_filter: "*" c) At the moment, eval dataset testing capacity is provided regarding CommonVoice only for itts EN & FR partitions. The HF testing meta files for IT & RW are brief: # https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-it/test.yaml
sample: example-it.wav
cls: EncoderDecoderASR
fnx: transcribe_batch
# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-rw/test.yaml
sample: example.mp3
cls: EncoderASR
fnx: transcribe_batch The ones for EN & FR are more telling: # https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-en/test.yaml
sample: example.wav
cls: EncoderDecoderASR
fnx: transcribe_batch
dataset: CommonVoice_EN
recipe_yaml: recipes/CommonVoice/ASR/seq2seq/hparams/train_en_with_wav2vec.yaml
overrides:
output_folder: !ref tests/tmp/<dataset>
dataio: from recipes.CommonVoice.ASR.seq2seq.train_with_wav2vec import dataio_prepare
test_datasets: dataio_prepare(recipe_hparams, model.tokenizer)[2]
test_loader: test_dataloader_options
performance:
CER:
handler: cer_computer
field: error_rate
WER:
handler: error_rate_computer
field: error_rate
predicted: predictions[0]
# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-fr/test.yaml
sample: example-fr.wav
cls: EncoderASR
fnx: transcribe_batch
dataset: CommonVoice_FR
recipe_yaml: recipes/CommonVoice/ASR/CTC/hparams/train_fr_with_wav2vec.yaml
overrides:
output_folder: !ref tests/tmp/<dataset>
dataio: from recipes.CommonVoice.ASR.CTC.train_with_wav2vec import dataio_prepare
test_datasets: dataio_prepare(recipe_hparams, model.tokenizer)[2]
test_loader: test_dataloader_options
performance:
CER:
handler: cer_computer
field: error_rate
WER:
handler: error_rate_computer
field: error_rate
predicted: predictions[0] This part of the testing script uses the recipe's dataloader options for testing (hence it needs to know Please take a look at the yaml/interface changes for the transformer refactoring it will be a lot to track everything, manually. edit: For treating the HF repo yaml/interfaces, I created a new orphan branch speechbrain:hf-interface-testing, which has its own git tree. To make a new PR to that branch, I have SB as a remote in my repo: $ git remote -v
origin git@github.com:anautsch/speechbrain.git (fetch)
origin git@github.com:anautsch/speechbrain.git (push)
sb git@github.com:speechbrain/speechbrain.git (fetch)
sb git@github.com:speechbrain/speechbrain.git (push) so I can simply run: git checkout -b pr1600-yaml-refactoring sb/hf-interface-testing # change pr1600-yaml-refactoring to your needs
git branch --unset-upstream # so you have to specify the upstream & direct it to your repo (to create a PR from there) implement changes & open a PR to that So, if this repo is placed under |
@anautsch @Adel-Moumen I am bypassing branch protection on this PR as this is of critical interest for ongoing work. According to my tests, it should not fail (@anautsch I did not use your recipe testing as I don't have the time to do the necessary changes right now, but they will be integrated by your PR any way, so it's just redundant) ; Just keep in mind that any issue related to w2v2 and "wav_len", could come from this PR. |
Fix an issue making fairseq and HF w2v2 models differ in terms of results.
We were not passing padding masks.
This will potential break all previous W2V2 models (results-wise).