Skip to content

CUDA error: device-side assert triggered while training Marian MT  #14798



Environment info

  • transformers version: transformers-4.13.0.dev0
  • Platform: linux
  • Python version: 3.8
  • PyTorch version (GPU?): torch==1.11.0a0+b6df043 GPU
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: one node multigpu

Who can help



Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • --> NMT/smancha5/transformers/examples/pytorch/translation/
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)
    Training marianMT on EMEA custom dataset

To reproduce

Steps to reproduce the behavior:

  1. Clone the latest transformer repo
  2. /opt/conda/bin/python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 /data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/ --train_file /data/atc_tenant/NMT/smancha5/EMEA.en-es.train.json --model_name_or_path Helsinki-NLP/opus-mt-en-es --do_train --source_lang=en --target_lang=es --output_dir=/data/atc_tenant/NMT/model1/ --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --cache_dir=/data/atc_tenant/NMT/cache/

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/ indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/", line 621, in
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/

Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

Expected behavior

Training of model complete



No one assigned


    No labels
    No labels


    No type


    No projects


    No milestone


    None yet


    No branches or pull requests

    Issue actions