Skip to content

CUDA error: device-side assert triggered while training Marian MT  #14798

Closed
@manchandasahil

Description

Environment info

  • transformers version: transformers-4.13.0.dev0
  • Platform: linux
  • Python version: 3.8
  • PyTorch version (GPU?): torch==1.11.0a0+b6df043 GPU
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: one node multigpu

Who can help

@patrickvonplaten
@sgugger
@LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • --> NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)
    Training marianMT on EMEA custom dataset

To reproduce

Steps to reproduce the behavior:

  1. Clone the latest transformer repo
  2. /opt/conda/bin/python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 /data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py --train_file /data/atc_tenant/NMT/smancha5/EMEA.en-es.train.json --model_name_or_path Helsinki-NLP/opus-mt-en-es --do_train --source_lang=en --target_lang=es --output_dir=/data/atc_tenant/NMT/model1/ --per_device_train_batch_size=8 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --cache_dir=/data/atc_tenant/NMT/cache/

/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
/opt/pytorch/pytorch/aten/src/ATen/native/cuda/Indexing.cu:698: indexSelectLargeIndex: block: [194,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.

Traceback (most recent call last):
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 621, in
main()
File "/data/atc_tenant/NMT/smancha5/transformers/examples/pytorch/translation/run_translation.py", line 538, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/.local/lib/python3.8/site-packages/transformers/trainer.py", line 1471, in train
self._total_loss_scalar += tr_loss.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from query at /opt/pytorch/pytorch/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fe03f1d3e1c in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x125 (0x7fe042e6d345 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x78 (0x7fe042e704e8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x158 (0x7fe042e71df8 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xcc9d4 (0x7fe0d47a29d4 in /opt/conda/bin/../lib/libstdc++.so.6)
frame #5: + 0x9609 (0x7fe0d6295609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fe0d6055293 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Debugging Logs:
print(inputs['labels'].shape) : torch.Size([8, 94])
print(inputs['input_ids'].shape) : torch.Size([8, 70])
print(inputs['decoder_input_ids'].shape) : torch.Size([8, 94])

Expected behavior

Training of model complete

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions