Fix wav2vec2 masking #1799

TParcollet · 2023-01-11T18:26:03Z

Fix an issue making fairseq and HF w2v2 models differ in terms of results.

We were not passing padding masks.

This will potential break all previous W2V2 models (results-wise).

TParcollet · 2023-01-13T09:47:57Z

@anautsch I will make some changes to a few recipes on that PR... It may also create some side effects of HF models (performance-wise only) but it is necessary as this is a 'major' bug for HF models ... Any chance you could run the recipe testing on all recipes concerned once I am done? I am really running out of time ...

anautsch · 2023-01-13T09:53:21Z

@TParcollet you mean testing them on a test partition (for which datasets)? lmk when...

TParcollet · 2023-01-13T10:06:06Z

I meant applying your recipe testing PR to those changes, it will be for all recipes that are impacted (yaml changed in this PR). I'll try to do it asap. Worst case, next Monday.

anautsch · 2023-01-13T10:44:08Z

For testing, you'll need to edit the yamls - as they are now, testing will fail - the override 'wav2vec2_folder' is missing (needed to change paths in testing/debug mode).

Example for necessary changes

$ diff recipes/CommonVoice/ASR/CTC/hparams/train_en_with_wav2vec.yaml.orig recipes/CommonVoice/ASR/CTC/hparams/train_en_with_wav2vec.yaml
16a17
> wav2vec2_folder: !ref <save_folder>/wav2vec2_checkpoint
115c116
<     save_path: !ref <save_folder>/wav2vec2_checkpoint
---
>     save_path: !ref <wav2vec2_folder>

To "blend-in" PR 1600:

git clone git@github.com:TParcollet/speechbrain-released.git
cd speechbrain-released
git checkout fix_wav2vec2_masking
# or another environment manager (depending on licensing)
conda create -y -n pr1799 python=3.8
conda activate pr1799
pip install -r requirements.txt
pip install transformers datasets
pip install --editable .
rsync -P ~/pr1600/speechbrain/speechbrain/core.py speechbrain/core.py
rsync -avhP ~/pr1600/speechbrain/tests/recipes tests/
rsync -avhP ~/pr1600/speechbrain/tests/samples tests/
rsync -avhP ~/pr1600/speechbrain/tests/utils tests/

Recipe flow testing:

python -c 'from tests.utils.recipe_tests import run_recipe_tests; print("TEST FAILED!") if not(run_recipe_tests(filters_fields=["Script_file"], filters=[["recipes/CommonVoice/ASR/CTC/train_with_wav2vec.py"]], do_checks=False, run_opts="--device=cuda")) else print("TEST PASSED")'
# [...]
(1/5) Running test for CommonVoice_row_2...
(2/5) Running test for CommonVoice_row_3...
(3/5) Running test for CommonVoice_row_4...
(4/5) Running test for CommonVoice_row_5...
(5/5) Running test for CommonVoice_row_6...
TEST PASSED

works - expected as you worked on it until it was running in your PR environment.

Next: testing eval performance with the CommonVoice dataset.

anautsch · 2023-01-13T11:56:25Z

The follow-up test I had in mind concerns yaml/interface updates on depending HF repos. I'd recommend to make another PR for this type of changes instead of local trial & error w/o version control. I get the feeling, you'll want to update HF repos also...

As an example, for refactoring SB's transformer interface (PR 1596), I keep track of HF yaml changes in #1623

These updates are to this folder only (there, HF yamls are stored; as a copy):
https://github.com/speechbrain/speechbrain/tree/testing-refactoring/updates_pretrained_models

There's a part of the testing refactoring (PR 1600), which is not documented as of now, since I expected it to be used and improved throughout the PR 1596—looks like we will do that already here then 😋
(Or at least, we'll have prelim documentation for it ready now.)

The testing for updating HF yamls would be in two steps:

# this one will create a summary file: tests/tmp/refactoring_results.yaml
# it will contain the performance values of a datasets test partition BEFORE refactoring
# this will also create a folder: tests/tmp/hf_interfaces
# ... and clone a SB repo into it & switch branch to one that can merge into speechbrain:testing-refactoring
PYTHONPATH=`realpath .` python tests/utils/refactoring_checks.py tests/utils/overrides.yaml --LibriSpeech_data="" --CommonVoice_EN_data="" --CommonVoice_FR_data="" --IEMOCAP_data="" --after=False

# and here for adding eval results for AFTER refactoring
# it will also perform a comparison for same results
PYTHONPATH=`realpath .` python tests/utils/refactoring_checks.py tests/utils/overrides.yaml --LibriSpeech_data="" --CommonVoice_EN_data="" --CommonVoice_FR_data="" --IEMOCAP_data="" --after=True

Here is a detailed description of how testing configs can be managed:
https://github.com/anautsch/speechbrain/blob/refactor-recipe-testing/tests/coverage/Refactoring.md

Example: How this applies to this PR's CommonVoice/CTC w2v2 recipes.

a) Each language partition of a dataset needs specification for testing in tests/utils/overrides.yaml
There's a PLACEHOLDER which should be set to the path (or be a "" override to ignore that dataset).

CommonVoice_EN_data: !PLACEHOLDER

Then, there is a meta descriptor for this dataset & overrides specific to it.

CommonVoice_EN:
  data_folder: !ref <CommonVoice_EN_data>

—since this one has no override flags, let's take a look at this one in comparison:

LibriSpeech:
  data_folder: !ref <LibriSpeech_data>
  skip_prep: True

It should be straight forward to expand this to new datasets & other test partitions (e.g. more CommonVoice languages). Since LibriSpeech is one entity, it's test partitions are considered here as one entity just alike. SB has a different handling on the different languages for CommonVoice (one does not test FR on EN, and vice versa).

b) the repo:branch tracking HF yaml/interface updates needs specification (also in tests/utils/overrides.yaml)

new_interfaces_git: https://github.com/speechbrain/speechbrain  # change this to your repo
new_interfaces_branch: testing-refactoring  # change it to sth that relates to this PR also; more will come
new_interfaces_local_dir: tests/tmp/hf_interfaces  # can remain as-is, unless you do heavier lifting

This will be used during the BEFORE run to gather testing specification and store them in the new_interfaces_local_dir. The BEFORE run will use the yaml/interface from the HF repo. The AFTER run will use the yaml/interface from the new_interfaces_branch.

Note: another possible override in in tests/utils/overrides.yaml

# Filter HF repos (will be used in a local glob dir crawling)
# glob_filter: "*wav2vec2*"
# glob_filter: "*libri*"
glob_filter: "*"

c) At the moment, eval dataset testing capacity is provided regarding CommonVoice only for itts EN & FR partitions.

The HF testing meta files for IT & RW are brief:

# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-it/test.yaml

sample: example-it.wav
cls: EncoderDecoderASR
fnx: transcribe_batch

# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-rw/test.yaml

sample: example.mp3
cls: EncoderASR
fnx: transcribe_batch

The ones for EN & FR are more telling:

# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-en/test.yaml

sample: example.wav
cls: EncoderDecoderASR
fnx: transcribe_batch
dataset: CommonVoice_EN
recipe_yaml: recipes/CommonVoice/ASR/seq2seq/hparams/train_en_with_wav2vec.yaml
overrides:
  output_folder: !ref tests/tmp/<dataset>
dataio: from recipes.CommonVoice.ASR.seq2seq.train_with_wav2vec import dataio_prepare
test_datasets: dataio_prepare(recipe_hparams, model.tokenizer)[2]
test_loader: test_dataloader_options
performance:
  CER:
    handler: cer_computer
    field: error_rate
  WER:
    handler: error_rate_computer
    field: error_rate
predicted: predictions[0]

# https://github.com/speechbrain/speechbrain/blob/testing-refactoring/updates_pretrained_models/asr-wav2vec2-commonvoice-fr/test.yaml

sample: example-fr.wav
cls: EncoderASR
fnx: transcribe_batch
dataset: CommonVoice_FR
recipe_yaml: recipes/CommonVoice/ASR/CTC/hparams/train_fr_with_wav2vec.yaml
overrides:
  output_folder: !ref tests/tmp/<dataset>
dataio: from recipes.CommonVoice.ASR.CTC.train_with_wav2vec import dataio_prepare
test_datasets: dataio_prepare(recipe_hparams, model.tokenizer)[2]
test_loader: test_dataloader_options
performance:
  CER:
    handler: cer_computer
    field: error_rate
  WER:
    handler: error_rate_computer
    field: error_rate
predicted: predictions[0]

This part of the testing script
https://github.com/anautsch/speechbrain/blob/b7e1b02a8cb3be81640c40c23a99d5af646a24e5/tests/utils/refactoring_checks.py#L225

uses the recipe's dataloader options for testing (hence it needs to know recipe_yaml) & takes from the above yaml python snippets that summarise what the recipe is effectively doing regarding reporting eval results.

Please take a look at the yaml/interface changes for the transformer refactoring
https://github.com/speechbrain/speechbrain/pull/1623/files

it will be a lot to track everything, manually.

edit:

For treating the HF repo yaml/interfaces, I created a new orphan branch speechbrain:hf-interface-testing, which has its own git tree.

To make a new PR to that branch, I have SB as a remote in my repo:

$ git remote -v
origin  git@github.com:anautsch/speechbrain.git (fetch)
origin  git@github.com:anautsch/speechbrain.git (push)
sb      git@github.com:speechbrain/speechbrain.git (fetch)
sb      git@github.com:speechbrain/speechbrain.git (push)

so I can simply run:

git checkout -b pr1600-yaml-refactoring sb/hf-interface-testing  # change pr1600-yaml-refactoring to your needs
git branch --unset-upstream  # so you have to specify the upstream & direct it to your repo (to create a PR from there)

implement changes & open a PR to that hf-interface-testing branch.

So, if this repo is placed under tests/tmp/hf_interfaces, then after a cd into that folder switches the git context (and one can work on two branches in parallel).

TParcollet · 2023-01-15T17:02:56Z

@anautsch @Adel-Moumen I am bypassing branch protection on this PR as this is of critical interest for ongoing work. According to my tests, it should not fail (@anautsch I did not use your recipe testing as I don't have the time to do the necessary changes right now, but they will be integrated by your PR any way, so it's just redundant) ;

Just keep in mind that any issue related to w2v2 and "wav_len", could come from this PR.

TParcollet added 9 commits January 11, 2023 16:21

masking

5338ec0

masking

2a8389c

masking

82fb1f8

add wav_lens

a685059

changing...

db8286c

changing...

8edcb15

adding padding masks to pretrain

f312342

Refactor fit

d64caa4

add padding

cfeb901

modify all recipes

002779c

TParcollet merged commit 44d1316 into speechbrain:develop Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wav2vec2 masking #1799

Fix wav2vec2 masking #1799

TParcollet commented Jan 11, 2023

TParcollet commented Jan 13, 2023

anautsch commented Jan 13, 2023

TParcollet commented Jan 13, 2023

anautsch commented Jan 13, 2023

anautsch commented Jan 13, 2023 •

edited

Loading

TParcollet commented Jan 15, 2023

Fix wav2vec2 masking #1799

Fix wav2vec2 masking #1799

Conversation

TParcollet commented Jan 11, 2023

TParcollet commented Jan 13, 2023

anautsch commented Jan 13, 2023

TParcollet commented Jan 13, 2023

anautsch commented Jan 13, 2023

anautsch commented Jan 13, 2023 • edited Loading

TParcollet commented Jan 15, 2023

anautsch commented Jan 13, 2023 •

edited

Loading