Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

Open
Empatixx opened this issue Dec 13, 2024 · 3 comments
Open

[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273

Empatixx opened this issue Dec 13, 2024 · 3 comments
Labels
bug Something isn't working Error Handling RAG

Comments

@Empatixx
Copy link
Contributor

Empatixx commented Dec 13, 2024

Describe the bug

When using the EmbeddingIngestor with a custom DocumentSplitter, if all documents passed to the ingest() method result in empty segments from the splitter, the embed() method of the OpenAiEmbeddingModel will throw an exception.

Since it’s not always possible to guarantee that a document can be segmented, the EmbeddingIngestor should handle this scenario gracefully. It should check the output of the DocumentSplitter during the ingestion process. If the splitter produces empty segments, the ingestor should skip embedding without throwing exceptions.
Log and Stack trace

I will provide later
It throws OpenAI request error with "Invalid input" with status 400
Caused by: dev.ai4j.openai4j.OpenAiHttpException: { "error": { "message": "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", "type": "invalid_request_error", "param": null, "code": null } }

To Reproduce

Use DocumentSplitter returning empty segments and use OpenAiEmbeddingModel

Expected behavior

It wont create any embeddings but it will still not throw exceptions.

Please complete the following information:

  • LangChain4j version: 0.35.0
  • LLM(s) used: OpenAI gpt4o-mini
  • Java version: 21
  • Spring Boot version (if applicable): 3.3.5

Additional context

I will provide later

@Empatixx Empatixx added the bug Something isn't working label Dec 13, 2024
Copy link

dosubot bot commented Dec 13, 2024

Hello @Empatixx! I'm here to help you with any bugs or questions you have while waiting for a human maintainer. Let's tackle this issue together!

To prevent the OpenAiEmbeddingModel from throwing an exception when the EmbeddingIngestor's custom DocumentSplitter returns empty segments, you can modify the code to check for empty segments before attempting to embed them. If the segments are empty, you can skip the embedding process for those segments. Here's a possible approach:

// Assuming you have a list of text segments after splitting
List<TextSegment> segments = documentSplitter.splitAll(documents);

// Check if the segments list is empty
if (!segments.isEmpty()) {
    // Proceed with transforming and embedding only if there are segments
    List<TextSegment> transformedSegments = textSegmentTransformer.transformAll(segments);
    Response<Embedding> response = embeddingModel.embedAll(transformedSegments);
    embeddingStore.addAll(response.getEmbeddings(), transformedSegments);
} else {
    // Handle the case where there are no segments, e.g., log a message or continue
    System.out.println("No segments to embed.");
}

This code snippet checks if the list of segments is empty before proceeding with the transformation and embedding steps. If the list is empty, it skips the embedding process, thus preventing any exceptions related to empty input [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@langchain4j
Copy link
Owner

Hi @Empatixx, thanks a lot!

I would argue that the DocumentSplitter should be responsible for providing valid outputs. An empty TextSegment could hide a problem in a custom DocumentSplitter implementation, so if we silently skip it, it might mask a problem which will be confsing and hard to debug. I would fail fast in this case TBH.

BTW, what do you mean by "empty segments"? Thire is no way to create a TextSegment with an empty/blank body:

    public TextSegment(String text, Metadata metadata) {
        this.text = ensureNotBlank(text, "text");
        this.metadata = ensureNotNull(metadata, "metadata");
    }

@Empatixx
Copy link
Contributor Author

Empatixx commented Dec 18, 2024

Hi @langchain4j,

Thanks for pointing that out! What I meant by "empty segments" is actually an empty collection of segments. For instance, you might receive a LangChain Document containing HTML text along with some custom metadata. Then, a custom implementation of a DocumentSplitter, which understands the structure of our HTML, attempts to split the HTML into segments. However, things aren't always perfect, and the HTML might be malformed or incorrect.

In such cases, I’d prefer to fall back to returning an empty collection of segments, indicating that the document couldn't be segmented, rather than throwing an exception. Throwing an exception in this scenario would disrupt the entire document ingestion process, which I want to avoid.

So perhaps returning an empty collection isn't the best solution either. Maybe a better approach would be to log an error and handle the fallback outside of the DocumentSplitter, within the ingestor. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Error Handling RAG
Projects
None yet
Development

No branches or pull requests

2 participants