-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] EmbeddingIngestor empty segments to OpenAiEmbeddingModel #2273
Comments
Hello @Empatixx! I'm here to help you with any bugs or questions you have while waiting for a human maintainer. Let's tackle this issue together! To prevent the // Assuming you have a list of text segments after splitting
List<TextSegment> segments = documentSplitter.splitAll(documents);
// Check if the segments list is empty
if (!segments.isEmpty()) {
// Proceed with transforming and embedding only if there are segments
List<TextSegment> transformedSegments = textSegmentTransformer.transformAll(segments);
Response<Embedding> response = embeddingModel.embedAll(transformedSegments);
embeddingStore.addAll(response.getEmbeddings(), transformedSegments);
} else {
// Handle the case where there are no segments, e.g., log a message or continue
System.out.println("No segments to embed.");
} This code snippet checks if the list of segments is empty before proceeding with the transformation and embedding steps. If the list is empty, it skips the embedding process, thus preventing any exceptions related to empty input [1]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Hi @Empatixx, thanks a lot! I would argue that the BTW, what do you mean by "empty segments"? Thire is no way to create a public TextSegment(String text, Metadata metadata) {
this.text = ensureNotBlank(text, "text");
this.metadata = ensureNotNull(metadata, "metadata");
} |
Hi @langchain4j, Thanks for pointing that out! What I meant by "empty segments" is actually an empty collection of segments. For instance, you might receive a LangChain Document containing HTML text along with some custom metadata. Then, a custom implementation of a DocumentSplitter, which understands the structure of our HTML, attempts to split the HTML into segments. However, things aren't always perfect, and the HTML might be malformed or incorrect. In such cases, I’d prefer to fall back to returning an empty collection of segments, indicating that the document couldn't be segmented, rather than throwing an exception. Throwing an exception in this scenario would disrupt the entire document ingestion process, which I want to avoid. So perhaps returning an empty collection isn't the best solution either. Maybe a better approach would be to log an error and handle the fallback outside of the DocumentSplitter, within the ingestor. What do you think? |
Describe the bug
When using the EmbeddingIngestor with a custom DocumentSplitter, if all documents passed to the ingest() method result in empty segments from the splitter, the embed() method of the OpenAiEmbeddingModel will throw an exception.
Since it’s not always possible to guarantee that a document can be segmented, the EmbeddingIngestor should handle this scenario gracefully. It should check the output of the DocumentSplitter during the ingestion process. If the splitter produces empty segments, the ingestor should skip embedding without throwing exceptions.
Log and Stack trace
I will provide later
It throws OpenAI request error with "Invalid input" with status 400
Caused by: dev.ai4j.openai4j.OpenAiHttpException: { "error": { "message": "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", "type": "invalid_request_error", "param": null, "code": null } }
To Reproduce
Use DocumentSplitter returning empty segments and use OpenAiEmbeddingModel
Expected behavior
It wont create any embeddings but it will still not throw exceptions.
Please complete the following information:
Additional context
I will provide later
The text was updated successfully, but these errors were encountered: