[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

Empatixx · 2024-12-13T15:37:10Z

Describe the bug

When using the EmbeddingIngestor with a custom DocumentSplitter, if all documents passed to the ingest() method result in empty segments from the splitter, the embed() method of the OpenAiEmbeddingModel will throw an exception.

Since it’s not always possible to guarantee that a document can be segmented, the EmbeddingIngestor should handle this scenario gracefully. It should check the output of the DocumentSplitter during the ingestion process. If the splitter produces empty segments, the ingestor should skip embedding without throwing exceptions.
Log and Stack trace

I will provide later
It throws OpenAI request error with "Invalid input" with status 400
Caused by: dev.ai4j.openai4j.OpenAiHttpException: { "error": { "message": "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", "type": "invalid_request_error", "param": null, "code": null } }

To Reproduce

Use DocumentSplitter returning empty segments and use OpenAiEmbeddingModel

Expected behavior

It wont create any embeddings but it will still not throw exceptions.

Please complete the following information:

LangChain4j version: 0.35.0
LLM(s) used: OpenAI gpt4o-mini
Java version: 21
Spring Boot version (if applicable): 3.3.5

Additional context

I will provide later

The text was updated successfully, but these errors were encountered:

dosubot · 2024-12-13T15:39:06Z

Hello @Empatixx! I'm here to help you with any bugs or questions you have while waiting for a human maintainer. Let's tackle this issue together!

To prevent the OpenAiEmbeddingModel from throwing an exception when the EmbeddingIngestor's custom DocumentSplitter returns empty segments, you can modify the code to check for empty segments before attempting to embed them. If the segments are empty, you can skip the embedding process for those segments. Here's a possible approach:

// Assuming you have a list of text segments after splitting
List<TextSegment> segments = documentSplitter.splitAll(documents);

// Check if the segments list is empty
if (!segments.isEmpty()) {
    // Proceed with transforming and embedding only if there are segments
    List<TextSegment> transformedSegments = textSegmentTransformer.transformAll(segments);
    Response<Embedding> response = embeddingModel.embedAll(transformedSegments);
    embeddingStore.addAll(response.getEmbeddings(), transformedSegments);
} else {
    // Handle the case where there are no segments, e.g., log a message or continue
    System.out.println("No segments to embed.");
}

This code snippet checks if the list of segments is empty before proceeding with the transformation and embedding steps. If the list is empty, it skips the embedding process, thus preventing any exceptions related to empty input ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

langchain4j · 2024-12-16T11:31:20Z

Hi @Empatixx, thanks a lot!

I would argue that the DocumentSplitter should be responsible for providing valid outputs. An empty TextSegment could hide a problem in a custom DocumentSplitter implementation, so if we silently skip it, it might mask a problem which will be confsing and hard to debug. I would fail fast in this case TBH.

BTW, what do you mean by "empty segments"? Thire is no way to create a TextSegment with an empty/blank body:

    public TextSegment(String text, Metadata metadata) {
        this.text = ensureNotBlank(text, "text");
        this.metadata = ensureNotNull(metadata, "metadata");
    }

Empatixx · 2024-12-18T18:31:43Z

Hi @langchain4j,

Thanks for pointing that out! What I meant by "empty segments" is actually an empty collection of segments. For instance, you might receive a LangChain Document containing HTML text along with some custom metadata. Then, a custom implementation of a DocumentSplitter, which understands the structure of our HTML, attempts to split the HTML into segments. However, things aren't always perfect, and the HTML might be malformed or incorrect.

In such cases, I’d prefer to fall back to returning an empty collection of segments, indicating that the document couldn't be segmented, rather than throwing an exception. Throwing an exception in this scenario would disrupt the entire document ingestion process, which I want to avoid.

So perhaps returning an empty collection isn't the best solution either. Maybe a better approach would be to log an error and handle the fallback outside of the DocumentSplitter, within the ingestor. What do you think?

Empatixx added the bug Something isn't working label Dec 13, 2024

langchain4j added RAG Error Handling labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

Empatixx commented Dec 13, 2024 •

edited

Loading

dosubot bot commented Dec 13, 2024

langchain4j commented Dec 16, 2024

Empatixx commented Dec 18, 2024 •

edited

Loading

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

Comments

Empatixx commented Dec 13, 2024 • edited Loading

dosubot bot commented Dec 13, 2024

langchain4j commented Dec 16, 2024

Empatixx commented Dec 18, 2024 • edited Loading

Empatixx commented Dec 13, 2024 •

edited

Loading

Empatixx commented Dec 18, 2024 •

edited

Loading