[BUG] encounter error when running sok dlrm benchmark

**Describe the bug**
1. Create train.bin and test.bin following HugeCTR dlrm sample. md5sum is same.
2. split data using sok preprocessing split_bin.py. replace --slot_size_array with the list in HugeCTR dlrm sample train.py. other arguments are default. is it need to chage default dtype, i.e., int32, for label_raw_type dense_raw_type and category_raw_type?
3. horovodrun -np 8 ./hvd_wrapper.sh python3 main.py --data_dir=./splited_dataset/ --global_batch=65536 --epochs=1 --lr=24

after runing iteration 3790, some errors occur, it looks like something wrong with dataset.
```
[1,6]<stderr>:Traceback (most recent call last):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 146, in <module>
[1,6]<stderr>:    trainer.train(eval_in_last=False, early_stop=args.early_stop, epochs=args.epochs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 247, in train
[1,6]<stderr>:    auc = evaluate(self._model, self._test_dataset, self._auc_thresholds)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 20, in evaluate
[1,6]<stderr>:    for idx, (samples, labels) in enumerate(dataset):
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 152, in __getitem__
[1,6]<stderr>:    return self._prefetch_queue.get().result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
[1,6]<stderr>:    return self.__get_result()
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
[1,6]<stderr>:    raise self._exception
[1,6]<stderr>:  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[1,6]<stderr>:    result = self.fn(*self.args, **self.kwargs)
[1,6]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/dataset.py", line 205, in _get
[1,6]<stderr>:    tf.RaggedTensor.from_row_lengths(flat_values, row_lengths[i])
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
[1,6]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,6]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/ops/check_ops.py", line 485, in _binary_assert
[1,6]<stderr>:    raise errors.InvalidArgumentError(
[1,6]<stderr>:tensorflow.python.framework.errors_impl.InvalidArgumentError: Arguments to _from_row_partition do not form a valid RaggedTensor
[1,6]<stderr>:Condition x == y did not hold.
[1,6]<stderr>:First 1 elements of x:
[1,6]<stderr>:[8192]
[1,6]<stderr>:First 1 elements of y:
[1,6]<stderr>:[2]
```

**To Reproduce**
Steps to reproduce the behavior:
1. How to build including `docker pull & docker run` commands
5. How to run including the JSON config file used

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Environment (please complete the following information):**
 - OS: [e.g. Ubuntu xx.yy]
 - Graphic card: [e.g. a single NVIDIA H100]
 - CUDA version: [e.g. CUDA 11.x]
 - Docker image

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] encounter error when running sok dlrm benchmark #461

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development