Skip to content

error occurs when trainning transformer-xl by ddp #8494

Closed
@ismymajia

Description

my env is as below:

  • transformers version: 3.4.0
  • Platform: 1Ubuntu-18.04
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0+cu101 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

I am trainning the transformer-xl on one machine with multi-gpus by ddp.

my script is as below:

python -m torch.distributed.launch --nproc_per_node 4 run_language_modeling.py --output_dir ${model_dir}
--tokenizer_name $data_dir/wordpiece-custom.json
--config_name $data_dir/$config_file
--train_data_files "$data_dir/train*.txt"
--eval_data_file $data_dir/valid.txt
--block_size=128
--do_train
--do_eval
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 6e-4
--weight_decay 0.01
--adam_epsilon 1e-6
--adam_beta1 0.9
--adam_beta2 0.98
--max_steps 500_000
--warmup_steps 24_000
--fp16
--logging_dir ${model_dir}/tensorboard
--save_steps 5000
--save_total_limit 20
--seed 108
--max_steps -1
--num_train_epochs 20
--dataloader_num_workers 0
--overwrite_output_dir

occur error:

[INFO|language_modeling.py:242] 2020-11-11 11:54:46,363 >> Loading features from cached file /opt/ml/input/data/training/kyzhan/huggingface/data/train40G/cached_lm_PreTrainedTokenizerFast_126_train3.txt [took 116.431 s]
/ th_index_copy
main()
File "run_hf_train_lm_ti.py", line 338, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 758, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1056, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1082, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 1056, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 888, in forward
word_emb = self.word_emb(input_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 448, in forward
emb_flat.index_copy(0, indices_i, emb_i)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #4 'source' in call to th_index_copy

@TevenLeScao

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions