SFTTrainer not loading dataset correctly, expected format? #2541

degen2 · 2025-01-03T21:58:25Z

From the SFT website:

Dataset format support

The SFTTrainer supports popular dataset formats. This allows you to pass the dataset to the trainer without any pre-processing directly. The following formats are supported:

instruction format
Copied
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}

My local dataset is exactly structured like this, however the SFTTrainer gives me errors:

ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']

and

raise ValueError("You need to specify either text or text_target.")
ValueError: You need to specify either text or text_target.

or

value = self.data[key]
        ~~~~~~~~~^^^^^

KeyError: 'text'

Either loading it like this:

train_dataset = load_dataset('json', data_files=dataset_file_path)

or like that

train_dataset = load_dataset('json', data_files=dataset_file_path, field='train')

Adding a text key in the dataset doesn't change it, either. So, what is the expected dataset format, if it's not the one specified on the website?

The text was updated successfully, but these errors were encountered:

August-murr · 2025-01-04T06:17:04Z

here's how to fix it:
train_dataset = load_dataset('json', data_files=dataset_file_path, split="train")

I suggest you get quick fixes for simpler issues simply by using ChatGPT or Copilot first as they can save you a lot of time!

degen2 · 2025-01-04T16:41:07Z

I already tried that and still get the same KeyError. Even when loading a dataset from the hub. I also tried adding a ‚text‘ key field to the data.

qgallouedec · 2025-01-04T19:09:33Z

split="train" is the solution. If you still encounter the error please provide a MRE

August-murr added ❓ question Seeking clarification or more information 🏋 SFT Related to SFT 🗃️ data Related to data labels Jan 4, 2025

degen2 closed this as completed Jan 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFTTrainer not loading dataset correctly, expected format? #2541

SFTTrainer not loading dataset correctly, expected format? #2541

degen2 commented Jan 3, 2025

August-murr commented Jan 4, 2025

degen2 commented Jan 4, 2025

qgallouedec commented Jan 4, 2025

SFTTrainer not loading dataset correctly, expected format? #2541

SFTTrainer not loading dataset correctly, expected format? #2541

Comments

degen2 commented Jan 3, 2025

August-murr commented Jan 4, 2025

degen2 commented Jan 4, 2025

qgallouedec commented Jan 4, 2025