Open
Description
I run sh run_qwen.sh
locally on a GPU machine. Errors as follow, could someone help.
conda list |grep trl
trl 0.13.0 pypi_0 pypi
conda list |grep transformers
transformers 4.47.1 pypi_0 pypi
sh run_qwen.sh
********************
It's effective
********************
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.01it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 81, in <module>
[rank0]: train()
[rank0]: File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 67, in train
[rank0]: trainer = SFTTrainer(
[rank0]: File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]: return func(*args, **kwargs)
[rank0]: TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'
E1218 16:54:26.878000 140467821201216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 102105) of binary: /home/miniforge3/envs/ligerkernel/bin/python
Traceback (most recent call last):
File "/home/miniforge3/envs/ligerkernel/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
training.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-18_16:54:26
host : 23
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 102105)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================