-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shuffle full Librispeech data #574
Conversation
Interesting... the only difference I can think of when using eager vs lazy CutSets is that with eager, you are holding much more objects in memory, and the cuts are being pickled and piped over processes in the DataLoader. But without proper profiling it is very difficult for me to hypothesize why you would see such a slowdown... |
Thanks, Piotr, for the comments! What would you suggest to set up the profiling properly? |
Since it looks like the issue is related to dataloading, first thing I'd do is remove the training code and just iterate over the dataloader for simpler profiling. Then you can do it multiple ways. One way is to measure the time it takes between every step of iteration over mini-batches. If your hypothesis is true, you should see that the average per-step time it takes to iterate over an eager CutSet is longer than with a lazy CutSet. You can probably dig deeper using I also realized later that whatever slowdown you observe might be related to how much load there is on your cluster. If a lot of people are using magnetic disks data for reads/writes, or overloading the CPUs, etc. you might see slowdowns. Try to monitor these things with tools like htop / iotop / nmon to confirm the slowdown is not related to external load. |
And yes, for "offline shuffling" we shouldn't expect any slowdown. |
Perhaps we could change the data-prep scripts to shuffle the data offline? |
Ok, if that's so, I will make the offline shuffling work for now. |
We only have options for 100 and 960 right now. so no need to do all possible amounts. |
Could you please fix the code style issues? |
Yes, I will do that. |
egs/librispeech/ASR/prepare.sh
Outdated
if [ ! -e data/fbank/.librispeech-validated.done ]; then | ||
log "Validating data/fbank for LibriSpeech" | ||
parts=( | ||
train-clean-100 | ||
train-clean-360 | ||
train-other-500 | ||
train-all-shuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since train-all-shuf contains all the other 3 train sets, perhaps it is sufficient to only validate this.
Not sure why it does not pass the automatic checks. All checks are passed locally |
Are you using the same version of black as listed in |
It works! |
Could you also propagate the changes to other folders for librispeech? |
conformer_mmi and streaming_conformer_ctc)
Is this change going to be merged? |
Sorry. Missed this PR. |
We noticed when training with the librispecch tdnn_lstm_ctc recipe, there is an occilation in the training loss curve:
With @desh2608 's help and suggestion (-- he had this issue in the SPGI recipe), we tried two ways to deal with it.
However, I noticed a new issue of the shuffling. The running time become 1.5x slower even when using the same GPUs, if you compare the two tensorboards above.
It may be due to the lost of benefits of reading CutSet lazily, but this may not be the case. See below.
The result is in:
The running time turns out the same as the first method.
What makes me more confusing is that, in my previous experiment, shuffling the cuts offline does not make training slower.
However, I could not replicate this experiment anymore.
I hope to put this issue here now for more dicussion, if you think more investigation is worthwhile. Thanks.