Skip to content

Commit

Permalink
add data for sfttrainer doc (huggingface#1521)
Browse files Browse the repository at this point in the history
  • Loading branch information
BramVanroy authored Apr 11, 2024
1 parent ebbd37b commit 087fe54
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion docs/source/sft_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -605,6 +605,12 @@ You may experience some issues with GPTQ Quantization after completing training.

[[autodoc]] SFTTrainer

## ConstantLengthDataset
## Datasets

In the SFTTrainer we smartly support `datasets.IterableDataset` in addition to other style datasets. This is useful if you are using large corpora that you do not want to save all to disk. The data will be tokenized and processed on the fly, even when packing is enabled.

Additionally, in the SFTTrainer, we support pre-tokenized datasets if they are `datasets.Dataset` or `datasets.IterableDataset`. In other words, if such a dataset has a column of `input_ids`, no further processing (tokenization or packing) will be done, and the dataset will be used as-is. This can be useful if you have pretokenized your dataset outside of this script and want to re-use it directly.

### ConstantLengthDataset

[[autodoc]] trainer.ConstantLengthDataset

0 comments on commit 087fe54

Please sign in to comment.