add data for sfttrainer doc (huggingface#1521)

alexvishnevskiy · Apr 11, 2024 · 087fe54 · 087fe54
1 parent ebbd37b
commit 087fe54
Showing 1 changed file with 7 additions and 1 deletion.
diff --git a/docs/source/sft_trainer.mdx b/docs/source/sft_trainer.mdx
@@ -605,6 +605,12 @@ You may experience some issues with GPTQ Quantization after completing training.
 
 [[autodoc]] SFTTrainer
 
-## ConstantLengthDataset
+## Datasets
+
+In the SFTTrainer we smartly support `datasets.IterableDataset` in addition to other style datasets. This is useful if you are using large corpora that you do not want to save all to disk. The data will be tokenized and processed on the fly, even when packing is enabled.
+
+Additionally, in the SFTTrainer, we support pre-tokenized datasets if they are `datasets.Dataset` or `datasets.IterableDataset`. In other words, if such a dataset has a column of `input_ids`, no further processing (tokenization or packing) will be done, and the dataset will be used as-is. This can be useful if you have pretokenized your dataset outside of this script and want to re-use it directly.
+
+### ConstantLengthDataset
 
 [[autodoc]] trainer.ConstantLengthDataset