Skip to content

error when run sh run_qwen.sh #487

Open
@CharlesJhonson

Description

I run sh run_qwen.sh locally on a GPU machine. Errors as follow, could someone help.

conda list |grep trl
trl                       0.13.0                   pypi_0    pypi
conda list |grep transformers
transformers              4.47.1                   pypi_0    pypi
sh run_qwen.sh
********************
It's effective
********************
Applied Liger kernels to Qwen2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.01it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 81, in <module>
[rank0]:     train()
[rank0]:   File "/home/Liger-Kernel-main/examples/huggingface/training.py", line 67, in train
[rank0]:     trainer = SFTTrainer(
[rank0]:   File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 165, in wrapped_func
[rank0]:     return func(*args, **kwargs)
[rank0]: TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'
E1218 16:54:26.878000 140467821201216 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 102105) of binary: /home/miniforge3/envs/ligerkernel/bin/python
Traceback (most recent call last):
  File "/home/miniforge3/envs/ligerkernel/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/miniforge3/envs/ligerkernel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_16:54:26
  host      : 23
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 102105)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions