Open
Description
I am facing the following error when training with multiple GPUs (on the same node). I am not sure if this is icefall related, but I thought maybe someone has seen it before? (I also tried running with CUDA_LAUNCH_BLOCKING=1
but got the same error message.
# Running on r7n01
# Started at Fri Mar 11 13:48:01 EST 2022
# python conformer_ctc/train.py --world-size 4
free gpu: 0 1 2 3
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2aab217dc2f2 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x2aab217d967b in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x2aab2156d1f9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2aab217c43a4 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x2aaaad8aecc9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x2aaaad8a3c8a in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x2aaaad8caf22 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x2aaaad207e76 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa2121f (0x2aaaad8ce21f in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369f80 (0x2aaaad216f80 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b1ee (0x2aaaad2181ee in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x10fd35 (0x555555663d35 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #12: <unknown function> + 0x1aa047 (0x5555556fe047 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #13: <unknown function> + 0x110882 (0x555555664882 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #14: <unknown function> + 0x1102a9 (0x5555556642a9 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #15: <unknown function> + 0x110293 (0x555555664293 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #16: <unknown function> + 0x1130b8 (0x5555556670b8 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #17: <unknown function> + 0x1106ff (0x5555556646ff in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #18: <unknown function> + 0x1fba33 (0x55555574fa33 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x2685 (0x55555572c0d5 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #21: _PyFunction_Vectorcall + 0x534 (0x555555721754 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4bf (0x555555729f0f in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1b7 (0x5555557213d7 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71a (0x55555572a16a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #26: _PyFunction_Vectorcall + 0x594 (0x5555557217b4 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1517 (0x55555572af67 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #29: PyEval_EvalCode + 0x23 (0x555555721aa3 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #30: <unknown function> + 0x241382 (0x555555795382 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #31: <unknown function> + 0x252202 (0x5555557a6202 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #32: PyRun_StringFlags + 0x7a (0x5555557a8e4a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #33: PyRun_SimpleStringFlags + 0x3c (0x5555557a8eac in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #34: Py_RunMain + 0x15b (0x5555557a981b in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #35: Py_BytesMain + 0x39 (0x5555557a9c69 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #36: __libc_start_main + 0xf5 (0x2aaaab616445 in /lib64/libc.so.6)
frame #37: <unknown function> + 0x1f7427 (0x55555574b427 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
Traceback (most recent call last):
File "conformer_ctc/train.py", line 787, in <module>
main()
File "conformer_ctc/train.py", line 775, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 701, in run
train_one_epoch(
File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 527, in train_one_epoch
loss.backward()
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
When I train on single GPU, it seems to be working fine:
# Running on r7n07
# Started at Fri Mar 11 13:36:00 EST 2022
# python conformer_ctc/train.py --world-size 1
free gpu: 0
2022-03-11 13:36:03,704 INFO [train.py:589] Training started
2022-03-11 13:36:03,705 INFO [train.py:590] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 100, 'reset_interval': 500, 'valid_interval': 25000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '5ee082ea55f50e8bd42203ba266945ea5a236ab8', 'k2-git-date': 'Sat Feb 26 20:00:48 2022', 'lhotse-version': '1.0.0.dev+git.e6e73e4.dirty', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.8', 'icefall-git-branch': 'spgi', 'icefall-git-sha1': '0c27ba4-dirty', 'icefall-git-date': 'Tue Mar 8 15:01:58 2022', 'icefall-path': '/exp/draj/mini_scale_2022/icefall', 'k2-path': '/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/exp/draj/mini_scale_2022/lhotse/lhotse/__init__.py', 'hostname': 'r7n07', 'IP address': '10.1.7.7'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 20, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'att_rate': 0.8, 'num_decoder_layers': 6, 'lr_factor': 5.0, 'seed': 42, 'manifest_dir': PosixPath('data/manifests'), 'enable_musan': True, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'max_duration': 150.0, 'num_buckets': 30, 'on_the_fly_feats': False, 'shuffle': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80}
2022-03-11 13:36:03,859 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2022-03-11 13:36:04,019 INFO [train.py:638] About to create model
2022-03-11 13:36:08,869 INFO [asr_datamodule.py:295] About to get SPGISpeech dev cuts
2022-03-11 13:36:08,874 INFO [asr_datamodule.py:243] About to create dev dataset
2022-03-11 13:36:09,048 INFO [asr_datamodule.py:258] About to create dev dataloader
2022-03-11 13:36:09,049 INFO [train.py:735] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-03-11 13:36:14,049 INFO [train.py:697] epoch 0, learning rate 5.8593749999999995e-08
2022-03-11 13:36:15,186 INFO [train.py:532] Epoch 0, batch 0, loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], tot_loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], batch size: 13
Metadata
Metadata
Assignees
Labels
No labels
Activity
desh2608 commentedon Mar 11, 2022
The error went away on reducing
--max-duration
in the asr_datamodule.py to 100s, so it seems it was a weirdly thrown OOM issue.danpovey commentedon Mar 12, 2022
Hm. It might be worthwhile trying to debug that a bit, e.g. see if you can do
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1
and possibly the error might show up earlier.
desh2608 commentedon Mar 23, 2022
I get the same error even after adding
export K2_SYNC_KERNELS=1
andexport CUDA_LAUNCH_BLOCKING=1
. I have k2 compiled in the debug mode. Is there some flag I can change to print more information?csukuangfj commentedon Mar 24, 2022
export K2_DISABLE_CHECKS=0
can enable extra checks.You can use the steps in #142 (comment)
to debug the code with
gdb
.ahazned commentedon Apr 13, 2022
Hi, any updates on this issue?
I also get the same error on both single-gpu and multi-gpu setups unless I decrease "--max-duration" to 50.
I've also tried K2_SYNC_KERNELS=1 and CUDA_LAUNCH_BLOCKING=1 but the problem continues.
danpovey commentedon Apr 13, 2022
How up-to-date is your code? We haven't seen this type of error for a while on our end.
ahazned commentedon Apr 13, 2022
Hi Dan,
I cloned Icefall yesterday, my branch is up to date with 'origin/master' and k2 details are below. By the way I'm trying egs/librispeech/ASR/pruned_transducer_stateless2/train.py on Librispeech 100 hours.
Here is what I got:
danpovey commentedon Apr 13, 2022
ahazned commentedon Apr 13, 2022
Thanks. I tried, but unfortunately it doesn't help.
danpovey commentedon Apr 13, 2022
It's supposed to make it print a more detailed error message, not fix the issue.
danpovey commentedon Apr 13, 2022
Anyway I think a version of k2 from March 14th is not recent enough to run the pruned_transducer_stateless2 recipe.
You may have to compile k2 from scratch; or use a more recent version if you can find one.
csukuangfj commentedon Apr 13, 2022
@ahazned
Are you able to run the unit tests of k2? You can follow https://k2-fsa.github.io/k2/installation/for_developers.html to run the tests.
desh2608 commentedon Apr 14, 2022
@csukuangfj I have the most recent versions of k2 and icefall (all tests are passing), but still get this error for larger batch sizes (>100s when training with 4 GPUs with 12G mem each). I am trying to run a pruned_transducer_stateless2 model on SPGISpeech.
danpovey commentedon Apr 15, 2022
@desh2608 see if you can run the training inside cuda-gdb (but I'm not sure whether cuda-gdb is able to handle multiple training processes, and also whether it will be easy for you to install). If the problem can be reproduced with 1 job that might make it easier.
Also
export K2_SYNC_KERNELS=1
export K2_DISABLE_DEBUG=0
export CUDA_LAUNCH_BLOCKING=1
may help to make a problem visible easier.
ahazned commentedon Apr 15, 2022
I successfully run "pruned_transducer_stateless2/train.py" with "--max-duration=300" when I use a newer K2 (1.14, Git date: Wed Apr 13 00:46:49 2022). I use two GPU's with 24GB mem each.
But one interesting thing is that I get different WERs on "egs/yesno/ASR/tdnn/train.py" with different K2/Pytorch/Cuda combinations. Not sure if this is expected.
29 remaining items