Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748

che-shr-cat · 2025-01-10T14:52:19Z

Issue Description

When fine-tuning language models in Keras 3, there are inconsistencies in how validation data should be provided. The documentation suggests validation_data should be in (x, y) format, but the actual requirements are unclear and the behavior differs between training and validation phases.

Current Behavior & Problems

Issue 1: Raw text arrays are not accepted for validation

train_texts = ["text1", "text2", ...]
val_texts = ["val1", "val2", ...]

# This fails with ValueError:
model.fit(
   train_texts,
   validation_data=val_texts
)

# Error:
ValueError: Data is expected to be in format `x`, `(x,)`, `(x, y)`, or `(x, y, sample_weight)`, found: ("text1", "text2", ...)

Issue 2: Pre-tokenized validation fails

# Trying to provide tokenized data:
val_tokenized = [tokenizer(text) for text in val_texts]
val_padded = np.array([pad_sequence(seq, max_len) for seq in val_tokenized])
val_input = val_padded[:, :-1]
val_target = val_padded[:, 1:]

model.fit(
    train_texts,
    validation_data=(val_input, val_target)
)

# Error:
TypeError: Input 'input' of 'SentencepieceTokenizeOp' Op has type int64 that does not match expected type of string.

The error suggests the tokenizer is being applied again to already tokenized data. I understand there is the preprocessor=None parameter, but I don't want to preprocess train data manually.

Working Solution (But Needs Documentation)

The working approach is to provide prompt-completion pairs:

# Prepare validation data as prompts and expected outputs
val_inputs = [format_prompt(text) for text in val_input_texts]
val_outputs = [format_output(text) for text in val_output_texts]
val_inputs = np.array(val_inputs)
val_outputs = np.array(val_outputs)

model.fit(
    train_texts,
    validation_data=(val_inputs, val_outputs)
)

Expected Behavior

The documentation should clearly state that validation data for language models should be provided as prompt-completion pairs
The validation data handling should be consistent with how training data is processed
It should be clear whether token shifting is handled internally or needs to be done manually

Environment

Keras Version: 3.x
Python Version: 3.10
Model: Gemma LLM (but likely affects other LLMs too)

Additional Context

While there is a working solution using prompt-completion pairs, this differs from traditional language model training where each token predicts the next token. The documentation should clarify this architectural choice and explain the proper way to provide validation data.

harshaljanjani · 2025-01-12T08:01:47Z

Hello @che-shr-cat!
Thank you for pointing out these issues! I’ve reproduced Issue 1 and made a fix to improve the error message, ensuring it’s more descriptive while maintaining backward compatibility. I’m ready to raise a PR for this if needed (the broader concern is clearly the documentation so I won't insist on the PR), but I'd require more information on how to edit the documentation to bridge the gap. I'd be happy to contribute.

Previous Error:

ValueError: Data is expected to be in format `x`, `(x,)`, `(x, y)`, or `(x, y, sample_weight)`, found: ('val1', 'val2', 'val3', 'val4')

Updated Error (with changes):

ValueError: Raw text data detected. Text data must be preprocessed before training. Please use a text preprocessing pipeline such as:  
1. Tokenizer to convert text to sequences:  
   tokenizer = keras.preprocessing.text.Tokenizer()  
   tokenizer.fit_on_texts(texts)  
   sequences = tokenizer.texts_to_sequences(texts)  
2. Pad sequences to uniform length:  
   padded = keras.preprocessing.sequence.pad_sequences(sequences)  

Received raw text data: ['val1', 'val2', 'val3']... (showing first 3 items)

This should help one understand the need for preprocessing when raw text data is provided.
Interestingly, Issue 2 might actually apply to other models besides Gemma, since I tried the double-tokenization you mentioned, and it looks like it basically gives a TypeError that's labeled as a ValueError for way simpler models that have the same basic problem as what you mentioned.

Here's Issue 2 recreated with a simpler model:

train_texts = ["text1", "text2", "text3"]  
val_texts = ["val1", "val2", "val3"]  

tokenizer = Tokenizer()  
tokenizer.fit_on_texts(train_texts)  
val_tokenized = tokenizer.texts_to_sequences(val_texts)  

max_len = 5  
val_padded = pad_sequences(val_tokenized, maxlen=max_len, padding='post')  

val_input = val_padded[:, :-1]  
val_target = val_padded[:, 1:]  

vectorization = layers.TextVectorization(max_tokens=10000, output_sequence_length=max_len)  
vectorization.adapt(train_texts)  

model = models.Sequential()  
model.add(vectorization)  
model.add(layers.Embedding(input_dim=10000, output_dim=16))  
model.add(layers.Dense(1, activation='sigmoid'))  
model.compile(optimizer='adam', loss='binary_crossentropy')  

model.fit(  
    train_texts,  
    validation_data=(val_input, val_target)  
)

Error:

ValueError: Unrecognized data type: x=['text1', 'text2', 'text3'] (of type <class 'list'>)

I wanted to check if I could raise a PR for Issue 1, and would love to know how we should handle the scope of Issue 2.

Environment:

Keras version: 3.8.0
TensorFlow version: 2.18.0
NumPy version: 2.0.2

github-actions bot added the Gemma Gemma model specific issues label Jan 10, 2025

github-actions bot assigned mehtamansi29 Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748

Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748

che-shr-cat commented Jan 10, 2025

harshaljanjani commented Jan 12, 2025 •

edited

Loading

Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748

Inconsistent validation data handling in Keras 3 for Language Model fine-tuning #20748

Comments

che-shr-cat commented Jan 10, 2025

Issue Description

Current Behavior & Problems

Issue 1: Raw text arrays are not accepted for validation

Issue 2: Pre-tokenized validation fails

Working Solution (But Needs Documentation)

Expected Behavior

Environment

Additional Context

harshaljanjani commented Jan 12, 2025 • edited Loading

harshaljanjani commented Jan 12, 2025 •

edited

Loading