Skip to content

[BUG] encounter error when running sok dlrm benchmark #461

Open
@Orca-bit

Description

Describe the bug

  1. Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
  2. split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
  3. horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=1 --lr=24

after runing iteration 3790, some errors occur, it looks like something wrong with dataset.

[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>:    trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>:    auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>:    for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>:    return self._prefetch_queue.get().result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>:    return self.__get_result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>:    raise self._exception
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>:    result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>:    tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>:    raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]

To Reproduce
Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: [e.g. Ubuntu xx.yy]
  • Graphic card: [e.g. a single NVIDIA H100]
  • CUDA version: [e.g. CUDA 11.x]
  • Docker image

Additional context
Add any other context about the problem here.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions