CUDA error: device-side assert triggered while training Marian MT #14798
Description
Environment info
transformers
version: transformers-4.13.0.dev0- Platform: linux
- Python version: 3.8
- PyTorch version (GPU?): torch==1.11.0a0+b6df043 GPU
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?: one node multigpu
Who can help
@patrickvonplaten
@sgugger
@LysandreJik
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
- the official example scripts: (give details below)
- --> NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
Training marianMT on EMEA custom dataset
To reproduce
Steps to reproduce the behavior:
- Clone the latest transformer repo
- /opt/conda/bin/python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 /data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py --train_file /data/atc_tenant/NMT/smancha5/EMEA.en-es.train.json --model_name_or_path Helsinki-NLP/opus-mt-en-es --do_train --source_lang=en --target_lang=es --output_dir=/data/atc_tenant/NMT/model1/ --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --cache_dir=/data/atc_tenant/NMT/cache/
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize
failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize
failed.
Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 621, in
main()
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])
Expected behavior
Training of model complete