Ability to specify the task type for calculating embeddings #722

glaforge · 2024-03-09T16:56:49Z

It's possible to customize different kinds of tasks for embedding:

RETRIEVAL_QUERY
RETRIEVAL_DOCUMENT
SEMANTIC_SIMILARITY
CLASSIFICATION, CLUSTERING

Summary by CodeRabbit

New Features
- Introduced TaskType enum for specifying task types in embedding instances, enhancing customization for retrieval, similarity, classification, and clustering tasks.
- Added taskType parameter to the constructor for setting task type in embedding instances.
Tests
- Added tests for semantic similarity, text classification, and document retrieval embeddings using different task types.

coderabbitai · 2024-03-09T16:57:07Z

Walkthrough

The update involves enhancing the langchain4j-vertex-ai library by introducing a new TaskType enum to categorize tasks and updating tests to cover various embedding tasks, making the library more versatile in handling different types of tasks.

Changes

Files	Change Summary
`.../vertexai/VertexAiEmbeddingInstance.java`	Added `title`, `task_type` fields; Added `setTitle`, `setTaskType`; Removed `final` from `content`
`.../vertexai/VertexAiEmbeddingModel.java` `.../vertexai/VertexAiEmbeddingModelIT.java`	Introduced `TaskType` enum; Modified constructor to include `taskType` Updated imports; Added tests for various task types

Related issues

[FEATURE] GCP VertexAI Ability to use tunned models #440: The enhancements allow for more flexible use of custom and tuned models in Vertex AI, addressing the issue's objective to better integrate tuned models and improve configuration options. The addition of task types and the removal of the final keyword from the content field could indirectly facilitate the use of custom trained LLM models by enabling more dynamic instance creation and modification.

Poem

In the digital fields where code rabbits hop,
New features grow like carrots non-stop. 🥕
With tasks typed and models tuned,
Our library blooms, under the moon. 🌕
Through lines of Java, our changes weave,
Building dreams for AIs to conceive.
Cheers to progress, with leaps and bounds,
In the world of code, where innovation abounds. 🚀

((\
(-.-)
o_(")(")

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 498145c and 4d74d0d.

Files selected for processing (3)

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (1 hunks)
langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)

Additional comments: 9

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (3)

5-5: The removal of the final modifier from the content field introduces mutability to the VertexAiEmbeddingInstance objects. Ensure that this change aligns with the intended use cases and consider thread safety if instances are accessed by multiple threads.

6-15: The addition of the title field and its corresponding setter method setTitle enhances the descriptiveness of embedding instances. Ensure that the title field is appropriately used in downstream processes where necessary.

7-19: The introduction of the task_type field and its setter method setTaskType allows for specifying the task type of an embedding instance, enhancing the model's flexibility. Using an enum for task types ensures type safety and improves code readability.

langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (3)

3-13: The changes to imports, including the addition of Metadata and static import for VertexAiEmbeddingModel.TaskType, are appropriate for the added tests. While wildcard imports are generally discouraged, their use in test files is acceptable. Consider using explicit imports for clarity in non-test code.

165-234: The addition of tests for semantic similarity, text classification, and document retrieval embeddings using different task types is commendable. These tests effectively validate the new functionality introduced in the PR. Ensure that these tests cover all edge cases and potential failure scenarios to maintain robustness.

1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

The use of java.util.Random for generating test data in createRandomSegments is appropriate and aligns with the intended use case. It's important to note that for security-sensitive operations requiring randomness (e.g., generating tokens or passwords), a cryptographically strong random number generator like SecureRandom should be used instead.

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (3)

74-76: The introduction of the TaskType enum is a positive change, enhancing type safety and code readability by clearly defining the supported task types for embeddings. This approach facilitates easier maintenance and extension of task types in the future.

82-89: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [85-114]

The modification to the constructor to include a taskType parameter is a necessary change to support the new functionality of specifying task types for embeddings. This ensures that every VertexAiEmbeddingModel instance is created with a clear task type, aligning with the enhancement goals of the PR.

133-139: The use of taskType in the creation of VertexAiEmbeddingInstance objects within the embedAll method correctly applies the specified task type to each embedding calculation. This implementation ensures that the embedding process is tailored according to the specified task type, aligning with the PR's objectives.

langchain4j

@glaforge thanks!

langchain4j · 2024-03-11T06:13:41Z

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java

@@ -122,8 +130,16 @@ public Response<List<Embedding>> embedAll(List<TextSegment> segments) {

                List<Value> instances = new ArrayList<>();
                for (TextSegment segment : batch) {
+                    VertexAiEmbeddingInstance embeddingInstance = new VertexAiEmbeddingInstance(segment.text());
+                    // Title metadata is used for calculating embeddings for document retrieval
+                    embeddingInstance.setTitle(segment.metadata("title"));


Docu says title is valid only for RETRIEVAL_DOCUMENT.
Would be nice to document this in hte javadoc of this class.
I would also give user an option to define the key of metadata to use instead of hardcoding "title".

How would you let the user customize the title property?

Perhaps something like titleMetadataKey("my_title") in the builder/ctor, but this is not urgent, can be added later. My main concern was in that if there is a "title" metadata entry and task type is not RETRIEVAL_DOCUMENT, then embedding call might fail?

Indeed that's a bit weird, but that's what's happening: using title without RETRIEVAL_DOCUMENT would fail.
I'll add the suggested method to customize the metadata key.

Pushed the extra builder method.

Maybe worth adding a condition here? Set title only for RETRIEVAL_DOCUMENT ?

Good point: pushed a change to set title only for RETRIEVAL_DOCUMENT.
That way, it's still possible to use title when not using that embedding task type.

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 4d74d0d and 58b16e1.

Files selected for processing (2)

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)

Files skipped from review as they are similar to previous changes (1)

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java

Additional comments: 5

langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (5)

3-3: The import of Metadata is appropriate given its usage in the new tests. This aligns with the PR's objectives to enhance embedding calculations with additional metadata.

11-11: Replacing specific imports with a wildcard import from java.util can lead to less readable code by obscuring the origin of used classes. Consider reverting to specific imports if the number of imports from java.util does not significantly reduce clutter.

13-13: Static imports for TaskType and utilities like asList and assertThat improve readability by reducing verbosity in test methods. This is a good practice for test code where clarity and conciseness are valuable.

165-260: The newly added test method testEmbeddingTask effectively validates the embedding functionality across different task types (SEMANTIC_SIMILARITY, CLASSIFICATION, RETRIEVAL_DOCUMENT) as intended by the PR objectives. The use of Metadata and the titleMetadataKey customization demonstrates the enhanced flexibility introduced in the library. However, consider adding assertions for more specific properties of the embeddings (if applicable) to ensure that the task type affects the embeddings as expected.

1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

The use of java.util.Random in createRandomSegments is appropriate for the context of generating test data. Since this usage does not involve security-sensitive operations like password or token generation, using java.util.Random is acceptable. However, for security-sensitive contexts, consider using java.security.SecureRandom.

coderabbitai

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 58b16e1 and 78ec41c.

Files selected for processing (2)

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)

Files skipped from review as they are similar to previous changes (1)

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java

Additional comments: 1

langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (1)

1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

Detected use of java.util.Random in the createRandomSegments method. While this is acceptable for generating test data, be mindful that Random is not suitable for generating cryptographically secure random numbers. If the context ever shifts towards security-sensitive applications, consider using java.security.SecureRandom instead.

coderabbitai · 2024-03-11T08:32:42Z

...chain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java

+import dev.langchain4j.data.document.Metadata;
 import dev.langchain4j.data.embedding.Embedding;
 import dev.langchain4j.data.segment.TextSegment;
 import dev.langchain4j.model.embedding.EmbeddingModel;
 import dev.langchain4j.model.output.Response;
 import dev.langchain4j.model.output.TokenUsage;
 import org.junit.jupiter.api.Test;

-import java.util.Arrays;
-import java.util.Collections;
-import java.util.List;
-import java.util.Random;
+import java.util.*;

+import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;


The changes include adding a specific import for Metadata, replacing specific imports with a wildcard import for java.util, and adding a static import for VertexAiEmbeddingModel.TaskType. While wildcard imports can sometimes make the code cleaner, they can also obscure the origin of classes and lead to conflicts if classes with the same name exist in different packages. Consider reverting to specific imports for clarity and to avoid potential conflicts.

-import java.util.*; +import java.util.List; +import java.util.Arrays; +import java.util.Collections; +import java.util.Random;

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

import dev.langchain4j.data.document.Metadata;

import dev.langchain4j.data.embedding.Embedding;

import dev.langchain4j.data.segment.TextSegment;

import dev.langchain4j.model.embedding.EmbeddingModel;

import dev.langchain4j.model.output.Response;

import dev.langchain4j.model.output.TokenUsage;

import org.junit.jupiter.api.Test;

import java.util.Arrays;

import java.util.Collections;

import java.util.List;

import java.util.Random;

import java.util.*;

import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;

import dev.langchain4j.data.document.Metadata;

import dev.langchain4j.data.embedding.Embedding;

import dev.langchain4j.data.segment.TextSegment;

import dev.langchain4j.model.embedding.EmbeddingModel;

import dev.langchain4j.model.output.Response;

import dev.langchain4j.model.output.TokenUsage;

import org.junit.jupiter.api.Test;

import java.util.List;

import java.util.Arrays;

import java.util.Collections;

import java.util.Random;

import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;

coderabbitai · 2024-03-11T08:32:42Z

...chain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java

+            .build();
+
+        String text = "Embeddings for Text is the name for the model that supports text embeddings. " +
+            "Text embeddings are a NLP technique that converts textual data into numerical vectors " +
+            "that can be processed by machine learning algorithms, especially large models. `" +
+            "These vector representations are designed to capture the semantic meaning and context " +
+            "of the words they represent.";
+
+        Response<Embedding> embeddedText = model.embed(text);
+
+        assertThat(embeddedText.content().dimension()).isEqualTo(768);
+
+        // Text classification embedding
+
+        TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +
+            "the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +
+            "Once the model is trained, it can be used to classify new text inputs into one or more " +
+            "categories based on their embeddings.",
+            new Metadata());
+
+        model = VertexAiEmbeddingModel.builder()
+            .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+            .project(System.getenv("GCP_PROJECT_ID"))
+            .location(System.getenv("GCP_LOCATION"))
+            .publisher("google")
+            .modelName("textembedding-gecko@003")
+            .maxRetries(3)
+            .taskType(CLASSIFICATION)
+            .build();
+
+        Response<Embedding> embeddedSegForClassif = model.embed(segment2);
+
+        assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);
+
+        // Document retrieval embedding
+
+        Metadata metadata = new Metadata();
+        metadata.add("title", "Text embeddings");
+
+        TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +
+            "user's query and the universe of documents in a high-dimensional vector space. Documents " +
+            "that are more semantically similar to the user's query will have a shorter distance in the " +
+            "vector space, and can be ranked higher in the search results.", metadata);
+
+        model = VertexAiEmbeddingModel.builder()
+            .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+            .project(System.getenv("GCP_PROJECT_ID"))
+            .location(System.getenv("GCP_LOCATION"))
+            .publisher("google")
+            .modelName("textembedding-gecko@003")
+            .maxRetries(3)
+            .taskType(RETRIEVAL_DOCUMENT)
+            .build();
+
+        Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);
+
+        assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);
+
+        // Choose a custom metadata key instead of "title"
+        // as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT
+
+        Metadata metadataCustomTitleKey = new Metadata();
+        metadataCustomTitleKey.add("customTitle", "Text embeddings");
+
+        TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +
+            "user's query and the universe of documents in a high-dimensional vector space. Documents " +
+            "that are more semantically similar to the user's query will have a shorter distance in the " +
+            "vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);
+
+        model = VertexAiEmbeddingModel.builder()
+            .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+            .project(System.getenv("GCP_PROJECT_ID"))
+            .location(System.getenv("GCP_LOCATION"))
+            .publisher("google")
+            .modelName("textembedding-gecko@003")
+            .maxRetries(3)
+            .taskType(RETRIEVAL_DOCUMENT)
+            .titleMetadataKey("customTitle")
+            .build();
+
+        Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);
+
+        assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);
+
+        // Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task
+
+        model = VertexAiEmbeddingModel.builder()
+            .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+            .project(System.getenv("GCP_PROJECT_ID"))
+            .location(System.getenv("GCP_LOCATION"))
+            .publisher("google")
+            .modelName("textembedding-gecko@003")
+            .maxRetries(3)
+            .titleMetadataKey("customTitle")
+            .build();
+
+        Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);
+
+        assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);
+    }


The newly added tests demonstrate the functionality of embedding calculations with different task types (SEMANTIC_SIMILARITY, CLASSIFICATION, RETRIEVAL_DOCUMENT). These tests are well-structured and provide good coverage for the new feature. However, there's a repeated pattern of building the VertexAiEmbeddingModel with similar parameters across tests. Consider refactoring this setup into a shared method to reduce code duplication and improve maintainability.

+ private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) { + return VertexAiEmbeddingModel.builder() + .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) + .project(System.getenv("GCP_PROJECT_ID")) + .location(System.getenv("GCP_LOCATION")) + .publisher("google") + .modelName("textembedding-gecko@003") + .maxRetries(3) + .taskType(taskType) + .titleMetadataKey(titleMetadataKey) + .build(); + }

Then, replace the model instantiation in each test with a call to this method.

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

@Test

void testEmbeddingTask() {

// Semantic similarity embedding

VertexAiEmbeddingModel model = VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.taskType(SEMANTIC_SIMILARITY)

.build();

String text = "Embeddings for Text is the name for the model that supports text embeddings. " +

"Text embeddings are a NLP technique that converts textual data into numerical vectors " +

"that can be processed by machine learning algorithms, especially large models. `" +

"These vector representations are designed to capture the semantic meaning and context " +

"of the words they represent.";

Response<Embedding> embeddedText = model.embed(text);

assertThat(embeddedText.content().dimension()).isEqualTo(768);

// Text classification embedding

TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +

"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +

"Once the model is trained, it can be used to classify new text inputs into one or more " +

"categories based on their embeddings.",

new Metadata());

model = VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.taskType(CLASSIFICATION)

.build();

Response<Embedding> embeddedSegForClassif = model.embed(segment2);

assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);

// Document retrieval embedding

Metadata metadata = new Metadata();

metadata.add("title", "Text embeddings");

TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +

"user's query and the universe of documents in a high-dimensional vector space. Documents " +

"that are more semantically similar to the user's query will have a shorter distance in the " +

"vector space, and can be ranked higher in the search results.", metadata);

model = VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.taskType(RETRIEVAL_DOCUMENT)

.build();

Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);

// Choose a custom metadata key instead of "title"

// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT

Metadata metadataCustomTitleKey = new Metadata();

metadataCustomTitleKey.add("customTitle", "Text embeddings");

TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +

"user's query and the universe of documents in a high-dimensional vector space. Documents " +

"that are more semantically similar to the user's query will have a shorter distance in the " +

"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);

model = VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.taskType(RETRIEVAL_DOCUMENT)

.titleMetadataKey("customTitle")

.build();

Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);

assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);

// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task

model = VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.titleMetadataKey("customTitle")

.build();

Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);

}

@Test

void testEmbeddingTask() {

// Semantic similarity embedding

VertexAiEmbeddingModel model = createTestModel(SEMANTIC_SIMILARITY, null);

String text = "Embeddings for Text is the name for the model that supports text embeddings. " +

"Text embeddings are a NLP technique that converts textual data into numerical vectors " +

"that can be processed by machine learning algorithms, especially large models. `" +

"These vector representations are designed to capture the semantic meaning and context " +

"of the words they represent.";

Response<Embedding> embeddedText = model.embed(text);

assertThat(embeddedText.content().dimension()).isEqualTo(768);

// Text classification embedding

TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +

"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +

"Once the model is trained, it can be used to classify new text inputs into one or more " +

"categories based on their embeddings.",

new Metadata());

model = createTestModel(CLASSIFICATION, null);

Response<Embedding> embeddedSegForClassif = model.embed(segment2);

assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);

// Document retrieval embedding

Metadata metadata = new Metadata();

metadata.add("title", "Text embeddings");

TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +

"user's query and the universe of documents in a high-dimensional vector space. Documents " +

"that are more semantically similar to the user's query will have a shorter distance in the " +

"vector space, and can be ranked higher in the search results.", metadata);

model = createTestModel(RETRIEVAL_DOCUMENT, null);

Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);

// Choose a custom metadata key instead of "title"

// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT

Metadata metadataCustomTitleKey = new Metadata();

metadataCustomTitleKey.add("customTitle", "Text embeddings");

TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +

"user's query and the universe of documents in a high-dimensional vector space. Documents " +

"that are more semantically similar to the user's query will have a shorter distance in the " +

"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);

model = createTestModel(RETRIEVAL_DOCUMENT, "customTitle");

Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);

assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);

// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task

model = createTestModel(null, "customTitle");

Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);

}

private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) {

return VertexAiEmbeddingModel.builder()

.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))

.project(System.getenv("GCP_PROJECT_ID"))

.location(System.getenv("GCP_LOCATION"))

.publisher("google")

.modelName("textembedding-gecko@003")

.maxRetries(3)

.taskType(taskType)

.titleMetadataKey(titleMetadataKey)

.build();

}

langchain4j

@glaforge thanks a lot!

glaforge added 2 commits March 9, 2024 17:50

Specify the different types of tasks for calculating embeddings

7c75eac

Unused import

4d74d0d

coderabbitai bot reviewed Mar 9, 2024

View reviewed changes

langchain4j previously approved these changes Mar 11, 2024

View reviewed changes

glaforge added 2 commits March 11, 2024 08:55

Merge branch 'langchain4j:main' into vertexai-embed-conf

3ad339b

Allow setting a custom metadata title for document retrieval

58b16e1

glaforge dismissed langchain4j’s stale review via 58b16e1 March 11, 2024 07:58

coderabbitai bot reviewed Mar 11, 2024

View reviewed changes

Allow using title metadata when not using document retrieval

78ec41c

coderabbitai bot reviewed Mar 11, 2024

View reviewed changes

langchain4j approved these changes Mar 11, 2024

View reviewed changes

langchain4j merged commit f565efc into langchain4j:main Mar 11, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to specify the task type for calculating embeddings #722

Ability to specify the task type for calculating embeddings #722

glaforge commented Mar 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

CodeRabbit Discord Community

coderabbitai bot left a comment

langchain4j left a comment

langchain4j Mar 11, 2024

glaforge Mar 11, 2024

langchain4j Mar 11, 2024 •

edited

Loading

glaforge Mar 11, 2024

glaforge Mar 11, 2024

langchain4j Mar 11, 2024

glaforge Mar 11, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Mar 11, 2024

coderabbitai bot Mar 11, 2024

langchain4j left a comment

Ability to specify the task type for calculating embeddings #722

Ability to specify the task type for calculating embeddings #722

Conversation

glaforge commented Mar 9, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Mar 9, 2024 • edited Loading

Walkthrough

Changes

Related issues

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

langchain4j left a comment

Choose a reason for hiding this comment

langchain4j Mar 11, 2024

Choose a reason for hiding this comment

glaforge Mar 11, 2024

Choose a reason for hiding this comment

langchain4j Mar 11, 2024 • edited Loading

Choose a reason for hiding this comment

glaforge Mar 11, 2024

Choose a reason for hiding this comment

glaforge Mar 11, 2024

Choose a reason for hiding this comment

langchain4j Mar 11, 2024

Choose a reason for hiding this comment

glaforge Mar 11, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Mar 11, 2024

Choose a reason for hiding this comment

coderabbitai bot Mar 11, 2024

Choose a reason for hiding this comment

langchain4j left a comment

Choose a reason for hiding this comment

glaforge commented Mar 9, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)

langchain4j Mar 11, 2024 •

edited

Loading