Skip to content

Illegal memory error when training with multi-GPU #247

Open
@desh2608

Description

@desh2608

I am facing the following error when training with multiple GPUs (on the same node). I am not sure if this is icefall related, but I thought maybe someone has seen it before? (I also tried running with CUDA_LAUNCH_BLOCKING=1 but got the same error message.

# Running on r7n01
# Started at Fri Mar 11 13:48:01 EST 2022
# python conformer_ctc/train.py --world-size 4 
free gpu: 0 1 2 3

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x2aab217dc2f2 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x2aab217d967b in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x2aab2156d1f9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x2aab217c43a4 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x2aaaad8aecc9 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x2aaaad8a3c8a in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x2aaaad8caf22 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x2aaaad207e76 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa2121f (0x2aaaad8ce21f in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369f80 (0x2aaaad216f80 in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36b1ee (0x2aaaad2181ee in /home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x10fd35 (0x555555663d35 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #12: <unknown function> + 0x1aa047 (0x5555556fe047 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #13: <unknown function> + 0x110882 (0x555555664882 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #14: <unknown function> + 0x1102a9 (0x5555556642a9 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #15: <unknown function> + 0x110293 (0x555555664293 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #16: <unknown function> + 0x1130b8 (0x5555556670b8 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #17: <unknown function> + 0x1106ff (0x5555556646ff in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #18: <unknown function> + 0x1fba33 (0x55555574fa33 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #19: _PyEval_EvalFrameDefault + 0x2685 (0x55555572c0d5 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #20: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #21: _PyFunction_Vectorcall + 0x534 (0x555555721754 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #22: _PyEval_EvalFrameDefault + 0x4bf (0x555555729f0f in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #23: _PyFunction_Vectorcall + 0x1b7 (0x5555557213d7 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x71a (0x55555572a16a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #25: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #26: _PyFunction_Vectorcall + 0x594 (0x5555557217b4 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x1517 (0x55555572af67 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #28: _PyEval_EvalCodeWithName + 0x260 (0x5555557201f0 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #29: PyEval_EvalCode + 0x23 (0x555555721aa3 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #30: <unknown function> + 0x241382 (0x555555795382 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #31: <unknown function> + 0x252202 (0x5555557a6202 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #32: PyRun_StringFlags + 0x7a (0x5555557a8e4a in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #33: PyRun_SimpleStringFlags + 0x3c (0x5555557a8eac in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #34: Py_RunMain + 0x15b (0x5555557a981b in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #35: Py_BytesMain + 0x39 (0x5555557a9c69 in /home/hltcoe/draj/.conda/envs/scale/bin/python)
frame #36: __libc_start_main + 0xf5 (0x2aaaab616445 in /lib64/libc.so.6)
frame #37: <unknown function> + 0x1f7427 (0x55555574b427 in /home/hltcoe/draj/.conda/envs/scale/bin/python)

Traceback (most recent call last):
  File "conformer_ctc/train.py", line 787, in <module>
    main()
  File "conformer_ctc/train.py", line 775, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 701, in run
    train_one_epoch(
  File "/exp/draj/mini_scale_2022/icefall/egs/spgispeech/ASR/conformer_ctc/train.py", line 527, in train_one_epoch
    loss.backward()
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

When I train on single GPU, it seems to be working fine:

# Running on r7n07
# Started at Fri Mar 11 13:36:00 EST 2022
# python conformer_ctc/train.py --world-size 1 
free gpu: 0

2022-03-11 13:36:03,704 INFO [train.py:589] Training started
2022-03-11 13:36:03,705 INFO [train.py:590] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 100, 'reset_interval': 500, 'valid_interval': 25000, 'feature_dim': 80, 'subsampling_factor': 4, 'use_feat_batchnorm': True, 'attention_dim': 512, 'nhead': 8, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'weight_decay': 1e-06, 'warm_step': 80000, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '5ee082ea55f50e8bd42203ba266945ea5a236ab8', 'k2-git-date': 'Sat Feb 26 20:00:48 2022', 'lhotse-version': '1.0.0.dev+git.e6e73e4.dirty', 'torch-cuda-available': True, 'torch-cuda-version': '11.1', 'python-version': '3.8', 'icefall-git-branch': 'spgi', 'icefall-git-sha1': '0c27ba4-dirty', 'icefall-git-date': 'Tue Mar 8 15:01:58 2022', 'icefall-path': '/exp/draj/mini_scale_2022/icefall', 'k2-path': '/home/hltcoe/draj/.conda/envs/scale/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/exp/draj/mini_scale_2022/lhotse/lhotse/__init__.py', 'hostname': 'r7n07', 'IP address': '10.1.7.7'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 20, 'start_epoch': 0, 'exp_dir': PosixPath('conformer_ctc/exp'), 'lang_dir': PosixPath('data/lang_bpe_5000'), 'att_rate': 0.8, 'num_decoder_layers': 6, 'lr_factor': 5.0, 'seed': 42, 'manifest_dir': PosixPath('data/manifests'), 'enable_musan': True, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'max_duration': 150.0, 'num_buckets': 30, 'on_the_fly_feats': False, 'shuffle': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80}
2022-03-11 13:36:03,859 INFO [lexicon.py:176] Loading pre-compiled data/lang_bpe_5000/Linv.pt
2022-03-11 13:36:04,019 INFO [train.py:638] About to create model
2022-03-11 13:36:08,869 INFO [asr_datamodule.py:295] About to get SPGISpeech dev cuts
2022-03-11 13:36:08,874 INFO [asr_datamodule.py:243] About to create dev dataset
2022-03-11 13:36:09,048 INFO [asr_datamodule.py:258] About to create dev dataloader
2022-03-11 13:36:09,049 INFO [train.py:735] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-03-11 13:36:14,049 INFO [train.py:697] epoch 0, learning rate 5.8593749999999995e-08
2022-03-11 13:36:15,186 INFO [train.py:532] Epoch 0, batch 0, loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], tot_loss[ctc_loss=7.717, att_loss=1.04, loss=2.376, over 3593.00 frames.], batch size: 13

Activity

desh2608

desh2608 commented on Mar 11, 2022

@desh2608
CollaboratorAuthor

The error went away on reducing --max-duration in the asr_datamodule.py to 100s, so it seems it was a weirdly thrown OOM issue.

danpovey

danpovey commented on Mar 12, 2022

@danpovey
Collaborator

Hm. It might be worthwhile trying to debug that a bit, e.g. see if you can do
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1
and possibly the error might show up earlier.

desh2608

desh2608 commented on Mar 23, 2022

@desh2608
CollaboratorAuthor

I get the same error even after adding export K2_SYNC_KERNELS=1 and export CUDA_LAUNCH_BLOCKING=1. I have k2 compiled in the debug mode. Is there some flag I can change to print more information?

csukuangfj

csukuangfj commented on Mar 24, 2022

@csukuangfj
Collaborator

export K2_DISABLE_CHECKS=0 can enable extra checks.

You can use the steps in #142 (comment)
to debug the code with gdb.

ahazned

ahazned commented on Apr 13, 2022

@ahazned
Contributor

Hi, any updates on this issue?

I also get the same error on both single-gpu and multi-gpu setups unless I decrease "--max-duration" to 50.

I've also tried K2_SYNC_KERNELS=1 and CUDA_LAUNCH_BLOCKING=1 but the problem continues.

danpovey

danpovey commented on Apr 13, 2022

@danpovey
Collaborator

How up-to-date is your code? We haven't seen this type of error for a while on our end.

ahazned

ahazned commented on Apr 13, 2022

@ahazned
Contributor

Hi Dan,

I cloned Icefall yesterday, my branch is up to date with 'origin/master' and k2 details are below. By the way I'm trying egs/librispeech/ASR/pruned_transducer_stateless2/train.py on Librispeech 100 hours.

/tmp/icefall$ git status
On branch master
Your branch is up to date with 'origin/master'.

python3 -m k2.version
Collecting environment information...

k2 version: 1.14
Build type: Release
Git SHA1: 6833270cb228aba7bf9681fccd41e2b52f7d984c
Git date: Wed Mar 16 03:16:05 2022
Cuda used to build k2: 11.1
cuDNN used to build k2: 8.0.4
Python version used to build k2: 3.8
OS used to build k2: Ubuntu 18.04.6 LTS
CMake version: 3.18.4
GCC version: 7.5.0
CMAKE_CUDA_FLAGS: --expt-extended-lambda -gencode arch=compute_35,code=sm_35 --expt-extended-lambda -gencode arch=compute_50,code=sm_50 --expt-extended-lambda -gencode arch=compute_60,code=sm_60 --expt-extended-lambda -gencode arch=compute_61,code=sm_61 --expt-extended-lambda -gencode arch=compute_70,code=sm_70 --expt-extended-lambda -gencode arch=compute_75,code=sm_75 --expt-extended-lambda -gencode arch=compute_80,code=sm_80 --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-unknown-pragmas --compiler-options -Wno-strict-overflow
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-strict-overflow
PyTorch version used to build k2: 1.8.1
PyTorch is using Cuda: 11.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False

Here is what I got:

python3 pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_ws1 --world-size 2 --num-epochs 40 --full-libri 0 --max-duration 300

/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/bucketing.py:96: UserWarning: Lazy CutSet detected in BucketingSampler: we will read it into memory anyway. Please use lhotse.dataset.DynamicBucketingSampler instead.
warnings.warn(
/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/lhotse/dataset/sampling/bucketing.py:96: UserWarning: Lazy CutSet detected in BucketingSampler: we will read it into memory anyway. Please use lhotse.dataset.DynamicBucketingSampler instead.
warnings.warn(
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f0c4b9b82f2 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f0c4b9b567b in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f0c4bc11219 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f0c4b9a03a4 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e0e5a (0x7f0ca2916e5a in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e0ef1 (0x7f0ca2916ef1 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1a974a (0x5568edb6a74a in /tmp/miniconda3/envs/k2/bin/python3)
frame #7: + 0x10f660 (0x5568edad0660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #8: + 0x10f660 (0x5568edad0660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #9: + 0x10faf5 (0x5568edad0af5 in /tmp/miniconda3/envs/k2/bin/python3)
frame #10: + 0x1a9727 (0x5568edb6a727 in /tmp/miniconda3/envs/k2/bin/python3)
frame #11: + 0x110632 (0x5568edad1632 in /tmp/miniconda3/envs/k2/bin/python3)
frame #12: + 0x110059 (0x5568edad1059 in /tmp/miniconda3/envs/k2/bin/python3)
frame #13: + 0x110043 (0x5568edad1043 in /tmp/miniconda3/envs/k2/bin/python3)
frame #14: + 0x112f68 (0x5568edad3f68 in /tmp/miniconda3/envs/k2/bin/python3)
frame #15: + 0x1104af (0x5568edad14af in /tmp/miniconda3/envs/k2/bin/python3)
frame #16: + 0x1fe1f3 (0x5568edbbf1f3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x2681 (0x5568edb9a021 in /tmp/miniconda3/envs/k2/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #19: _PyFunction_Vectorcall + 0x534 (0x5568edb8eb64 in /tmp/miniconda3/envs/k2/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x4c0 (0x5568edb97e60 in /tmp/miniconda3/envs/k2/bin/python3)
frame #21: _PyFunction_Vectorcall + 0x1b7 (0x5568edb8e7e7 in /tmp/miniconda3/envs/k2/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x71b (0x5568edb980bb in /tmp/miniconda3/envs/k2/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x594 (0x5568edb8ebc4 in /tmp/miniconda3/envs/k2/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1510 (0x5568edb98eb0 in /tmp/miniconda3/envs/k2/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x260 (0x5568edb8d600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #27: PyEval_EvalCode + 0x23 (0x5568edb8eeb3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #28: + 0x242622 (0x5568edc03622 in /tmp/miniconda3/envs/k2/bin/python3)
frame #29: + 0x2531d2 (0x5568edc141d2 in /tmp/miniconda3/envs/k2/bin/python3)
frame #30: PyRun_StringFlags + 0x7a (0x5568edc16e0a in /tmp/miniconda3/envs/k2/bin/python3)
frame #31: PyRun_SimpleStringFlags + 0x3c (0x5568edc16e6c in /tmp/miniconda3/envs/k2/bin/python3)
frame #32: Py_RunMain + 0x15b (0x5568edc177db in /tmp/miniconda3/envs/k2/bin/python3)
frame #33: Py_BytesMain + 0x39 (0x5568edc17c29 in /tmp/miniconda3/envs/k2/bin/python3)
frame #34: __libc_start_main + 0xe7 (0x7f0cd469fc87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: + 0x1f9ad7 (0x5568edbbaad7 in /tmp/miniconda3/envs/k2/bin/python3)

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1616554793803/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f27956ae2f2 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f27956ab67b in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f2795907219 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f27956963a4 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: + 0x6e0e5a (0x7f27ec60ce5a in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x6e0ef1 (0x7f27ec60cef1 in /tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x1a974a (0x55953ec0f74a in /tmp/miniconda3/envs/k2/bin/python3)
frame #7: + 0x10f660 (0x55953eb75660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #8: + 0x10f660 (0x55953eb75660 in /tmp/miniconda3/envs/k2/bin/python3)
frame #9: + 0x10faf5 (0x55953eb75af5 in /tmp/miniconda3/envs/k2/bin/python3)
frame #10: + 0x1a9727 (0x55953ec0f727 in /tmp/miniconda3/envs/k2/bin/python3)
frame #11: + 0x110632 (0x55953eb76632 in /tmp/miniconda3/envs/k2/bin/python3)
frame #12: + 0x110059 (0x55953eb76059 in /tmp/miniconda3/envs/k2/bin/python3)
frame #13: + 0x110043 (0x55953eb76043 in /tmp/miniconda3/envs/k2/bin/python3)
frame #14: + 0x112f68 (0x55953eb78f68 in /tmp/miniconda3/envs/k2/bin/python3)
frame #15: + 0x1104af (0x55953eb764af in /tmp/miniconda3/envs/k2/bin/python3)
frame #16: + 0x1fe1f3 (0x55953ec641f3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #17: _PyEval_EvalFrameDefault + 0x2681 (0x55953ec3f021 in /tmp/miniconda3/envs/k2/bin/python3)
frame #18: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #19: _PyFunction_Vectorcall + 0x534 (0x55953ec33b64 in /tmp/miniconda3/envs/k2/bin/python3)
frame #20: _PyEval_EvalFrameDefault + 0x4c0 (0x55953ec3ce60 in /tmp/miniconda3/envs/k2/bin/python3)
frame #21: _PyFunction_Vectorcall + 0x1b7 (0x55953ec337e7 in /tmp/miniconda3/envs/k2/bin/python3)
frame #22: _PyEval_EvalFrameDefault + 0x71b (0x55953ec3d0bb in /tmp/miniconda3/envs/k2/bin/python3)
frame #23: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #24: _PyFunction_Vectorcall + 0x594 (0x55953ec33bc4 in /tmp/miniconda3/envs/k2/bin/python3)
frame #25: _PyEval_EvalFrameDefault + 0x1510 (0x55953ec3deb0 in /tmp/miniconda3/envs/k2/bin/python3)
frame #26: _PyEval_EvalCodeWithName + 0x260 (0x55953ec32600 in /tmp/miniconda3/envs/k2/bin/python3)
frame #27: PyEval_EvalCode + 0x23 (0x55953ec33eb3 in /tmp/miniconda3/envs/k2/bin/python3)
frame #28: + 0x242622 (0x55953eca8622 in /tmp/miniconda3/envs/k2/bin/python3)
frame #29: + 0x2531d2 (0x55953ecb91d2 in /tmp/miniconda3/envs/k2/bin/python3)
frame #30: PyRun_StringFlags + 0x7a (0x55953ecbbe0a in /tmp/miniconda3/envs/k2/bin/python3)
frame #31: PyRun_SimpleStringFlags + 0x3c (0x55953ecbbe6c in /tmp/miniconda3/envs/k2/bin/python3)
frame #32: Py_RunMain + 0x15b (0x55953ecbc7db in /tmp/miniconda3/envs/k2/bin/python3)
frame #33: Py_BytesMain + 0x39 (0x55953ecbcc29 in /tmp/miniconda3/envs/k2/bin/python3)
frame #34: __libc_start_main + 0xe7 (0x7f281e395c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #35: + 0x1f9ad7 (0x55953ec5fad7 in /tmp/miniconda3/envs/k2/bin/python3)

Traceback (most recent call last):
File "pruned_transducer_stateless2/train.py", line 997, in
main()
File "pruned_transducer_stateless2/train.py", line 988, in main
mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/tmp/icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py", line 878, in run
scan_pessimistic_batches_for_oom(
File "/tmp/icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py", line 964, in scan_pessimistic_batches_for_oom
loss.backward()
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/tmp/miniconda3/envs/k2/lib/python3.8/site-packages/torch/autograd/init.py", line 145, in backward
Variable._execution_engine.run_backward(
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

danpovey

danpovey commented on Apr 13, 2022

@danpovey
Collaborator
ahazned

ahazned commented on Apr 13, 2022

@ahazned
Contributor

Thanks. I tried, but unfortunately it doesn't help.

danpovey

danpovey commented on Apr 13, 2022

@danpovey
Collaborator

It's supposed to make it print a more detailed error message, not fix the issue.

danpovey

danpovey commented on Apr 13, 2022

@danpovey
Collaborator

Anyway I think a version of k2 from March 14th is not recent enough to run the pruned_transducer_stateless2 recipe.
You may have to compile k2 from scratch; or use a more recent version if you can find one.

csukuangfj

csukuangfj commented on Apr 13, 2022

@csukuangfj
Collaborator

@ahazned
Are you able to run the unit tests of k2? You can follow https://k2-fsa.github.io/k2/installation/for_developers.html to run the tests.

desh2608

desh2608 commented on Apr 14, 2022

@desh2608
CollaboratorAuthor

@csukuangfj I have the most recent versions of k2 and icefall (all tests are passing), but still get this error for larger batch sizes (>100s when training with 4 GPUs with 12G mem each). I am trying to run a pruned_transducer_stateless2 model on SPGISpeech.

danpovey

danpovey commented on Apr 15, 2022

@danpovey
Collaborator

@desh2608 see if you can run the training inside cuda-gdb (but I'm not sure whether cuda-gdb is able to handle multiple training processes, and also whether it will be easy for you to install). If the problem can be reproduced with 1 job that might make it easier.
Also
export K2_SYNC_KERNELS=1
export K2_DISABLE_DEBUG=0
export CUDA_LAUNCH_BLOCKING=1
may help to make a problem visible easier.

ahazned

ahazned commented on Apr 15, 2022

@ahazned
Contributor

I successfully run "pruned_transducer_stateless2/train.py" with "--max-duration=300" when I use a newer K2 (1.14, Git date: Wed Apr 13 00:46:49 2022). I use two GPU's with 24GB mem each.

But one interesting thing is that I get different WERs on "egs/yesno/ASR/tdnn/train.py" with different K2/Pytorch/Cuda combinations. Not sure if this is expected.

k2 version: 1.14 | Git date: Wed Mar 16 03:16:05 2022 | PyTorch version used to build k2: 1.8.1+cu111
%WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.11.0+cu102
%WER 2.50% [6 / 240, 5 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.8.1+cu102
%WER 3.33% [8 / 240, 7 ins, 1 del, 0 sub ]

k2 version: 1.14 | Git date: Wed Apr 13 00:46:49 2022 | PyTorch version used to build k2: 1.11.0+cu113
%WER 2.50% [6 / 240, 5 ins, 1 del, 0 sub ]

29 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Illegal memory error when training with multi-GPU · Issue #247 · k2-fsa/icefall