[BUG] encounter error when running sok dlrm benchmark #461
Open
Description
Describe the bug
- Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
- split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
- horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=1 --lr=24
after runing iteration 3790, some errors occur, it looks like something wrong with dataset.
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>: trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>: auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>: for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>: return self._prefetch_queue.get().result()
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>: return self.__get_result()
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>: raise self._exception
[1,6]<stderr>: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>: result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>: File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>: tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>: raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>: raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]
To Reproduce
Steps to reproduce the behavior:
- How to build including
docker pull & docker run
commands - How to run including the JSON config file used
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
- OS: [e.g. Ubuntu xx.yy]
- Graphic card: [e.g. a single NVIDIA H100]
- CUDA version: [e.g. CUDA 11.x]
- Docker image
Additional context
Add any other context about the problem here.
Metadata
Assignees
Labels
No labels