Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle full Librispeech data #574

Merged
merged 8 commits into from
Nov 27, 2022

Conversation

huangruizhe
Copy link
Contributor

@huangruizhe huangruizhe commented Sep 20, 2022

We noticed when training with the librispecch tdnn_lstm_ctc recipe, there is an occilation in the training loss curve:

[tensorboard]
Initial learning rate = 1e-4
WER: 7.04 test-clean, 18.28 test-other

With @desh2608 's help and suggestion (-- he had this issue in the SPGI recipe), we tried two ways to deal with it.

  1. As done in this PR, we shuffle the CutSet for librispeech. Due to the lazy evaluation mechanism, extra step has to be taken to force the CutSet to be "not-lazy" so that the full dataset can be mixed properly.

[tensorboard]
Initial learning rate = 1e-4
WER: 6.85 test-clean, 17.74 test-other

However, I noticed a new issue of the shuffling. The running time become 1.5x slower even when using the same GPUs, if you compare the two tensorboards above.
It may be due to the lost of benefits of reading CutSet lazily, but this may not be the case. See below.

  1. We shuffle the jsonl files offline, and load them lazily as in here:
cat <(gunzip -c data/fbank/librispeech_cuts_train-clean-100.jsonl.gz) \
  <(gunzip -c data/fbank/librispeech_cuts_train-clean-360.jsonl.gz) \
  <(gunzip -c data/fbank/librispeech_cuts_train-other-500.jsonl.gz) | \
  shuf | gzip -c > data/fbank/librispeech_cuts_train-all-shuf.jsonl.gz

The result is in:

[tensorboard]
Initial learning rate = 1e-4
WER: 6.86 test-clean, 17.8 test-other

The running time turns out the same as the first method.

What makes me more confusing is that, in my previous experiment, shuffling the cuts offline does not make training slower.

[tensorboard]
Initial learning rate = 1e-4
WER: 6.83 test-clean, 17.91 test-other

However, I could not replicate this experiment anymore.

I hope to put this issue here now for more dicussion, if you think more investigation is worthwhile. Thanks.

@pzelasko
Copy link
Collaborator

Interesting... the only difference I can think of when using eager vs lazy CutSets is that with eager, you are holding much more objects in memory, and the cuts are being pickled and piped over processes in the DataLoader. But without proper profiling it is very difficult for me to hypothesize why you would see such a slowdown...

@huangruizhe
Copy link
Contributor Author

Thanks, Piotr, for the comments! What would you suggest to set up the profiling properly?
I have been worried that I might have done something wrong without knowing it. I am thinking to replicate the "offline shuffling" experiment first. In that case, there shouldn't be any training time slowdown right?

@pzelasko
Copy link
Collaborator

Since it looks like the issue is related to dataloading, first thing I'd do is remove the training code and just iterate over the dataloader for simpler profiling. Then you can do it multiple ways. One way is to measure the time it takes between every step of iteration over mini-batches. If your hypothesis is true, you should see that the average per-step time it takes to iterate over an eager CutSet is longer than with a lazy CutSet. You can probably dig deeper using py-spy (either top or record commands) and try to identify in which function the program spends more time when using eager vs lazy. You might need to attach to different processes from multiple terminals (the main process/consumer and the dataloading process/producer).

I also realized later that whatever slowdown you observe might be related to how much load there is on your cluster. If a lot of people are using magnetic disks data for reads/writes, or overloading the CPUs, etc. you might see slowdowns. Try to monitor these things with tools like htop / iotop / nmon to confirm the slowdown is not related to external load.

@pzelasko
Copy link
Collaborator

And yes, for "offline shuffling" we shouldn't expect any slowdown.

@danpovey
Copy link
Collaborator

Perhaps we could change the data-prep scripts to shuffle the data offline?
We were trying to move towards lazy loading of data anyway.

@huangruizhe
Copy link
Contributor Author

Ok, if that's so, I will make the offline shuffling work for now.
What if people want to train on libri-100 + 360 parts only? Then, in the data prep, we will need to provide several shuffled mixture: 100, 100+360, 100+360+500. Is that what we hope to have?

@danpovey
Copy link
Collaborator

We only have options for 100 and 960 right now. so no need to do all possible amounts.

@csukuangfj
Copy link
Collaborator

csukuangfj commented Sep 28, 2022

@huangruizhe

Could you please fix the code style issues?

@huangruizhe
Copy link
Contributor Author

Yes, I will do that.

if [ ! -e data/fbank/.librispeech-validated.done ]; then
log "Validating data/fbank for LibriSpeech"
parts=(
train-clean-100
train-clean-360
train-other-500
train-all-shuf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since train-all-shuf contains all the other 3 train sets, perhaps it is sufficient to only validate this.

@huangruizhe
Copy link
Contributor Author

Not sure why it does not pass the automatic checks. All checks are passed locally

@csukuangfj
Copy link
Collaborator

Not sure why it does not pass the automatic checks. All checks are passed locally

Are you using the same version of black as listed in
https://k2-fsa.github.io/icefall/contributing/code-style.html

@huangruizhe
Copy link
Contributor Author

It works!
With the black comes with "git commit" automatically, the changes were passed locally but not here.
With the suggested version of black installed and run black your_changed_file.py, the style inconsitency is found.

@csukuangfj
Copy link
Collaborator

@huangruizhe

Could you also propagate the changes to other folders for librispeech?

conformer_mmi and streaming_conformer_ctc)
@wangtiance
Copy link
Contributor

Is this change going to be merged?

@csukuangfj
Copy link
Collaborator

Is this change going to be merged?

Sorry. Missed this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants