Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFTTrainer not loading dataset correctly, expected format? #2541

Closed
degen2 opened this issue Jan 3, 2025 · 3 comments
Closed

SFTTrainer not loading dataset correctly, expected format? #2541

degen2 opened this issue Jan 3, 2025 · 3 comments
Labels
🗃️ data Related to data ❓ question Seeking clarification or more information 🏋 SFT Related to SFT

Comments

@degen2
Copy link

degen2 commented Jan 3, 2025

From the SFT website:

Dataset format support

The SFTTrainer supports popular dataset formats. This allows you to pass the dataset to the trainer without any pre-processing directly. The following formats are supported:

instruction format
Copied
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}
{"prompt": "", "completion": ""}

My local dataset is exactly structured like this, however the SFTTrainer gives me errors:

ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['prompt', 'completion']

and

raise ValueError("You need to specify either text or text_target.")
ValueError: You need to specify either text or text_target.

or

value = self.data[key]
        ~~~~~~~~~^^^^^

KeyError: 'text'

Either loading it like this:

train_dataset = load_dataset('json', data_files=dataset_file_path)

or like that

train_dataset = load_dataset('json', data_files=dataset_file_path, field='train')

Adding a text key in the dataset doesn't change it, either. So, what is the expected dataset format, if it's not the one specified on the website?

@August-murr
Copy link
Collaborator

here's how to fix it:
train_dataset = load_dataset('json', data_files=dataset_file_path, split="train")

I suggest you get quick fixes for simpler issues simply by using ChatGPT or Copilot first as they can save you a lot of time!

@August-murr August-murr added ❓ question Seeking clarification or more information 🏋 SFT Related to SFT 🗃️ data Related to data labels Jan 4, 2025
@degen2
Copy link
Author

degen2 commented Jan 4, 2025

I already tried that and still get the same KeyError. Even when loading a dataset from the hub. I also tried adding a ‚text‘ key field to the data.

@qgallouedec
Copy link
Member

split="train" is the solution. If you still encounter the error please provide a MRE

@degen2 degen2 closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ data Related to data ❓ question Seeking clarification or more information 🏋 SFT Related to SFT
Projects
None yet
Development

No branches or pull requests

3 participants