error occurs when trainning transformer-xl by ddp #8494
Description
my env is as below:
transformers
version: 3.4.0- Platform: 1Ubuntu-18.04
- Python version: 3.6.9
- PyTorch version (GPU?): 1.6.0+cu101 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
I am trainning the transformer-xl on one machine with multi-gpus by ddp.
my script is as below:
python -m torch.distributed.launch --nproc_per_node 4 run_language_modeling.py --output_dir ${model_dir}
--tokenizer_name $data_dir/wordpiece-custom.json
--config_name $data_dir/$config_file
--train_data_files "$data_dir/train*.txt"
--eval_data_file $data_dir/valid.txt
--block_size=128
--do_train
--do_eval
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 6e-4
--weight_decay 0.01
--adam_epsilon 1e-6
--adam_beta1 0.9
--adam_beta2 0.98
--max_steps 500_000
--warmup_steps 24_000
--fp16
--logging_dir ${model_dir}/tensorboard
--save_steps 5000
--save_total_limit 20
--seed 108
--max_steps -1
--num_train_epochs 20
--dataloader_num_workers 0
--overwrite_output_dir
occur error:
[INFO|language_modeling.py:242] 2020-11-11 11:54:46,363 >> Loading features from cached file /opt/ml/input/data/training/kyzhan/huggingface/data/train40G/cached_lm_PreTrainedTokenizerFast_126_train3.txt [took 116.431 s]
/ th_index_copy
main()
File "run_hf_train_lm_ti.py", line 338, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 758, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1056, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1082, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 1056, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 888, in forward
word_emb = self.word_emb(input_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 448, in forward
emb_flat.index_copy(0, indices_i, emb_i)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #4 'source' in call to th_index_copy