shuffle full Librispeech data #574

huangruizhe · 2022-09-20T18:43:28Z

We noticed when training with the librispecch tdnn_lstm_ctc recipe, there is an occilation in the training loss curve:

[tensorboard]
Initial learning rate = 1e-4
WER: 7.04 test-clean, 18.28 test-other

With @desh2608 's help and suggestion (-- he had this issue in the SPGI recipe), we tried two ways to deal with it.

As done in this PR, we shuffle the CutSet for librispeech. Due to the lazy evaluation mechanism, extra step has to be taken to force the CutSet to be "not-lazy" so that the full dataset can be mixed properly.

[tensorboard]
Initial learning rate = 1e-4
WER: 6.85 test-clean, 17.74 test-other

However, I noticed a new issue of the shuffling. The running time become 1.5x slower even when using the same GPUs, if you compare the two tensorboards above.
It may be due to the lost of benefits of reading CutSet lazily, but this may not be the case. See below.

We shuffle the jsonl files offline, and load them lazily as in here:

cat <(gunzip -c data/fbank/librispeech_cuts_train-clean-100.jsonl.gz) \
  <(gunzip -c data/fbank/librispeech_cuts_train-clean-360.jsonl.gz) \
  <(gunzip -c data/fbank/librispeech_cuts_train-other-500.jsonl.gz) | \
  shuf | gzip -c > data/fbank/librispeech_cuts_train-all-shuf.jsonl.gz

The result is in:

[tensorboard]
Initial learning rate = 1e-4
WER: 6.86 test-clean, 17.8 test-other

The running time turns out the same as the first method.

What makes me more confusing is that, in my previous experiment, shuffling the cuts offline does not make training slower.

[tensorboard]
Initial learning rate = 1e-4
WER: 6.83 test-clean, 17.91 test-other

However, I could not replicate this experiment anymore.

I hope to put this issue here now for more dicussion, if you think more investigation is worthwhile. Thanks.

pzelasko · 2022-09-20T18:59:14Z

Interesting... the only difference I can think of when using eager vs lazy CutSets is that with eager, you are holding much more objects in memory, and the cuts are being pickled and piped over processes in the DataLoader. But without proper profiling it is very difficult for me to hypothesize why you would see such a slowdown...

huangruizhe · 2022-09-20T22:06:17Z

Thanks, Piotr, for the comments! What would you suggest to set up the profiling properly?
I have been worried that I might have done something wrong without knowing it. I am thinking to replicate the "offline shuffling" experiment first. In that case, there shouldn't be any training time slowdown right?

pzelasko · 2022-09-21T00:27:11Z

Since it looks like the issue is related to dataloading, first thing I'd do is remove the training code and just iterate over the dataloader for simpler profiling. Then you can do it multiple ways. One way is to measure the time it takes between every step of iteration over mini-batches. If your hypothesis is true, you should see that the average per-step time it takes to iterate over an eager CutSet is longer than with a lazy CutSet. You can probably dig deeper using py-spy (either top or record commands) and try to identify in which function the program spends more time when using eager vs lazy. You might need to attach to different processes from multiple terminals (the main process/consumer and the dataloading process/producer).

I also realized later that whatever slowdown you observe might be related to how much load there is on your cluster. If a lot of people are using magnetic disks data for reads/writes, or overloading the CPUs, etc. you might see slowdowns. Try to monitor these things with tools like htop / iotop / nmon to confirm the slowdown is not related to external load.

pzelasko · 2022-09-21T00:27:41Z

And yes, for "offline shuffling" we shouldn't expect any slowdown.

danpovey · 2022-09-21T03:25:45Z

Perhaps we could change the data-prep scripts to shuffle the data offline?
We were trying to move towards lazy loading of data anyway.

huangruizhe · 2022-09-21T03:31:27Z

Ok, if that's so, I will make the offline shuffling work for now.
What if people want to train on libri-100 + 360 parts only? Then, in the data prep, we will need to provide several shuffled mixture: 100, 100+360, 100+360+500. Is that what we hope to have?

danpovey · 2022-09-21T06:09:52Z

We only have options for 100 and 960 right now. so no need to do all possible amounts.

csukuangfj · 2022-09-28T13:21:40Z

@huangruizhe

Could you please fix the code style issues?

huangruizhe · 2022-09-28T15:05:01Z

Yes, I will do that.

desh2608 · 2022-09-28T23:35:28Z

egs/librispeech/ASR/prepare.sh

  if [ ! -e data/fbank/.librispeech-validated.done ]; then
    log "Validating data/fbank for LibriSpeech"
    parts=(
      train-clean-100
      train-clean-360
      train-other-500
+      train-all-shuf


Since train-all-shuf contains all the other 3 train sets, perhaps it is sufficient to only validate this.

huangruizhe · 2022-09-29T01:01:52Z

Not sure why it does not pass the automatic checks. All checks are passed locally

csukuangfj · 2022-09-29T04:04:04Z

Not sure why it does not pass the automatic checks. All checks are passed locally

Are you using the same version of black as listed in
https://k2-fsa.github.io/icefall/contributing/code-style.html

huangruizhe · 2022-09-29T05:43:34Z

It works!
With the black comes with "git commit" automatically, the changes were passed locally but not here.
With the suggested version of black installed and run black your_changed_file.py, the style inconsitency is found.

csukuangfj · 2022-09-29T10:15:42Z

@huangruizhe

Could you also propagate the changes to other folders for librispeech?

conformer_mmi and streaming_conformer_ctc)

wangtiance · 2022-11-27T03:11:22Z

Is this change going to be merged?

csukuangfj · 2022-11-27T03:26:02Z

Is this change going to be merged?

Sorry. Missed this PR.

huangruizhe added 2 commits September 20, 2022 13:03

shuffled full/partial librispeech data

d2e5f2f

Merge branch 'master' into librispeech_shuffle

8902201

huangruizhe added 3 commits September 28, 2022 18:16

fixed the code style issue

61c3162

Shuffled full librispeech data off-line

f836d2f

Merge branch 'master' into librispeech_shuffle

6676817

desh2608 reviewed Sep 28, 2022

View reviewed changes

Fixed style, addressed comments, and removed redandunt codes

4b6c97e

Used the suggested version of black

e8b8e4d

Propagated the changes to other folders for librispeech (except

ea6f8a5

conformer_mmi and streaming_conformer_ctc)

csukuangfj merged commit 6693d90 into k2-fsa:master Nov 27, 2022

yaozengwei mentioned this pull request Feb 3, 2023

shuffle full Librispeech for zipformer recipes #869

Merged

csukuangfj mentioned this pull request Jun 17, 2023

Not shuffling training data in updated zipformer receipt #1137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffle full Librispeech data #574

shuffle full Librispeech data #574

huangruizhe commented Sep 20, 2022 •

edited

Loading

pzelasko commented Sep 20, 2022

huangruizhe commented Sep 20, 2022

pzelasko commented Sep 21, 2022

pzelasko commented Sep 21, 2022

danpovey commented Sep 21, 2022

huangruizhe commented Sep 21, 2022

danpovey commented Sep 21, 2022

csukuangfj commented Sep 28, 2022 •

edited

Loading

huangruizhe commented Sep 28, 2022

desh2608 Sep 28, 2022

huangruizhe commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

huangruizhe commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

wangtiance commented Nov 27, 2022

csukuangfj commented Nov 27, 2022

shuffle full Librispeech data #574

shuffle full Librispeech data #574

Conversation

huangruizhe commented Sep 20, 2022 • edited Loading

pzelasko commented Sep 20, 2022

huangruizhe commented Sep 20, 2022

pzelasko commented Sep 21, 2022

pzelasko commented Sep 21, 2022

danpovey commented Sep 21, 2022

huangruizhe commented Sep 21, 2022

danpovey commented Sep 21, 2022

csukuangfj commented Sep 28, 2022 • edited Loading

huangruizhe commented Sep 28, 2022

desh2608 Sep 28, 2022

Choose a reason for hiding this comment

huangruizhe commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

huangruizhe commented Sep 29, 2022

csukuangfj commented Sep 29, 2022

wangtiance commented Nov 27, 2022

csukuangfj commented Nov 27, 2022

huangruizhe commented Sep 20, 2022 •

edited

Loading

csukuangfj commented Sep 28, 2022 •

edited

Loading