Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to specify the task type for calculating embeddings #722

Merged
merged 5 commits into from
Mar 11, 2024
Merged

Ability to specify the task type for calculating embeddings #722

merged 5 commits into from
Mar 11, 2024

Conversation

glaforge
Copy link
Collaborator

@glaforge glaforge commented Mar 9, 2024

It's possible to customize different kinds of tasks for embedding:

  • RETRIEVAL_QUERY
  • RETRIEVAL_DOCUMENT
  • SEMANTIC_SIMILARITY
  • CLASSIFICATION, CLUSTERING

Summary by CodeRabbit

  • New Features

    • Introduced TaskType enum for specifying task types in embedding instances, enhancing customization for retrieval, similarity, classification, and clustering tasks.
    • Added taskType parameter to the constructor for setting task type in embedding instances.
  • Tests

    • Added tests for semantic similarity, text classification, and document retrieval embeddings using different task types.

Copy link

coderabbitai bot commented Mar 9, 2024

Walkthrough

The update involves enhancing the langchain4j-vertex-ai library by introducing a new TaskType enum to categorize tasks and updating tests to cover various embedding tasks, making the library more versatile in handling different types of tasks.

Changes

Files Change Summary
.../vertexai/VertexAiEmbeddingInstance.java Added title, task_type fields; Added setTitle, setTaskType; Removed final from content
.../vertexai/VertexAiEmbeddingModel.java
.../vertexai/VertexAiEmbeddingModelIT.java
Introduced TaskType enum; Modified constructor to include taskType
Updated imports; Added tests for various task types

Related issues

  • [FEATURE] GCP VertexAI Ability to use tunned models #440: The enhancements allow for more flexible use of custom and tuned models in Vertex AI, addressing the issue's objective to better integrate tuned models and improve configuration options. The addition of task types and the removal of the final keyword from the content field could indirectly facilitate the use of custom trained LLM models by enabling more dynamic instance creation and modification.

Poem

In the digital fields where code rabbits hop,
New features grow like carrots non-stop. 🥕
With tasks typed and models tuned,
Our library blooms, under the moon. 🌕
Through lines of Java, our changes weave,
Building dreams for AIs to conceive.
Cheers to progress, with leaps and bounds,
In the world of code, where innovation abounds. 🚀

((\
(-.-)
o_(")(")

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit-tests for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit tests for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit tests.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 498145c and 4d74d0d.
Files selected for processing (3)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (1 hunks)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
  • langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Additional comments: 9
langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (3)
  • 5-5: The removal of the final modifier from the content field introduces mutability to the VertexAiEmbeddingInstance objects. Ensure that this change aligns with the intended use cases and consider thread safety if instances are accessed by multiple threads.
  • 6-15: The addition of the title field and its corresponding setter method setTitle enhances the descriptiveness of embedding instances. Ensure that the title field is appropriately used in downstream processes where necessary.
  • 7-19: The introduction of the task_type field and its setter method setTaskType allows for specifying the task type of an embedding instance, enhancing the model's flexibility. Using an enum for task types ensures type safety and improves code readability.
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (3)
  • 3-13: The changes to imports, including the addition of Metadata and static import for VertexAiEmbeddingModel.TaskType, are appropriate for the added tests. While wildcard imports are generally discouraged, their use in test files is acceptable. Consider using explicit imports for clarity in non-test code.
  • 165-234: The addition of tests for semantic similarity, text classification, and document retrieval embeddings using different task types is commendable. These tests effectively validate the new functionality introduced in the PR. Ensure that these tests cover all edge cases and potential failure scenarios to maintain robustness.
  • 1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

The use of java.util.Random for generating test data in createRandomSegments is appropriate and aligns with the intended use case. It's important to note that for security-sensitive operations requiring randomness (e.g., generating tokens or passwords), a cryptographically strong random number generator like SecureRandom should be used instead.

langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (3)
  • 74-76: The introduction of the TaskType enum is a positive change, enhancing type safety and code readability by clearly defining the supported task types for embeddings. This approach facilitates easier maintenance and extension of task types in the future.
  • 82-89: > 📝 NOTE

This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [85-114]

The modification to the constructor to include a taskType parameter is a necessary change to support the new functionality of specifying task types for embeddings. This ensures that every VertexAiEmbeddingModel instance is created with a clear task type, aligning with the enhancement goals of the PR.

  • 133-139: The use of taskType in the creation of VertexAiEmbeddingInstance objects within the embedAll method correctly applies the specified task type to each embedding calculation. This implementation ensures that the embedding process is tailored according to the specified task type, aligning with the PR's objectives.

langchain4j
langchain4j previously approved these changes Mar 11, 2024
Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glaforge thanks!

@@ -122,8 +130,16 @@ public Response<List<Embedding>> embedAll(List<TextSegment> segments) {

List<Value> instances = new ArrayList<>();
for (TextSegment segment : batch) {
VertexAiEmbeddingInstance embeddingInstance = new VertexAiEmbeddingInstance(segment.text());
// Title metadata is used for calculating embeddings for document retrieval
embeddingInstance.setTitle(segment.metadata("title"));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docu says title is valid only for RETRIEVAL_DOCUMENT.
Would be nice to document this in hte javadoc of this class.
I would also give user an option to define the key of metadata to use instead of hardcoding "title".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you let the user customize the title property?

Copy link
Owner

@langchain4j langchain4j Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps something like titleMetadataKey("my_title") in the builder/ctor, but this is not urgent, can be added later. My main concern was in that if there is a "title" metadata entry and task type is not RETRIEVAL_DOCUMENT, then embedding call might fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's a bit weird, but that's what's happening: using title without RETRIEVAL_DOCUMENT would fail.
I'll add the suggested method to customize the metadata key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the extra builder method.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding a condition here? Set title only for RETRIEVAL_DOCUMENT ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: pushed a change to set title only for RETRIEVAL_DOCUMENT.
That way, it's still possible to use title when not using that embedding task type.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 4d74d0d and 58b16e1.
Files selected for processing (2)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
  • langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java
Additional comments: 5
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (5)
  • 3-3: The import of Metadata is appropriate given its usage in the new tests. This aligns with the PR's objectives to enhance embedding calculations with additional metadata.
  • 11-11: Replacing specific imports with a wildcard import from java.util can lead to less readable code by obscuring the origin of used classes. Consider reverting to specific imports if the number of imports from java.util does not significantly reduce clutter.
  • 13-13: Static imports for TaskType and utilities like asList and assertThat improve readability by reducing verbosity in test methods. This is a good practice for test code where clarity and conciseness are valuable.
  • 165-260: The newly added test method testEmbeddingTask effectively validates the embedding functionality across different task types (SEMANTIC_SIMILARITY, CLASSIFICATION, RETRIEVAL_DOCUMENT) as intended by the PR objectives. The use of Metadata and the titleMetadataKey customization demonstrates the enhanced flexibility introduced in the library. However, consider adding assertions for more specific properties of the embeddings (if applicable) to ensure that the task type affects the embeddings as expected.
  • 1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

The use of java.util.Random in createRandomSegments is appropriate for the context of generating test data. Since this usage does not involve security-sensitive operations like password or token generation, using java.util.Random is acceptable. However, for security-sensitive contexts, consider using java.security.SecureRandom.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 58b16e1 and 78ec41c.
Files selected for processing (2)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
  • langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java
Additional comments: 1
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (1)
  • 1-16: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]

Detected use of java.util.Random in the createRandomSegments method. While this is acceptable for generating test data, be mindful that Random is not suitable for generating cryptographically secure random numbers. If the context ever shifts towards security-sensitive applications, consider using java.security.SecureRandom instead.

Comment on lines +3 to +13
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.model.output.TokenUsage;
import org.junit.jupiter.api.Test;

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.*;

import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes include adding a specific import for Metadata, replacing specific imports with a wildcard import for java.util, and adding a static import for VertexAiEmbeddingModel.TaskType. While wildcard imports can sometimes make the code cleaner, they can also obscure the origin of classes and lead to conflicts if classes with the same name exist in different packages. Consider reverting to specific imports for clarity and to avoid potential conflicts.

-import java.util.*;
+import java.util.List;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Random;

Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.model.output.TokenUsage;
import org.junit.jupiter.api.Test;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import java.util.*;
import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.output.Response;
import dev.langchain4j.model.output.TokenUsage;
import org.junit.jupiter.api.Test;
import java.util.List;
import java.util.Arrays;
import java.util.Collections;
import java.util.Random;
import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*;

Comment on lines +165 to +276
.build();

String text = "Embeddings for Text is the name for the model that supports text embeddings. " +
"Text embeddings are a NLP technique that converts textual data into numerical vectors " +
"that can be processed by machine learning algorithms, especially large models. `" +
"These vector representations are designed to capture the semantic meaning and context " +
"of the words they represent.";

Response<Embedding> embeddedText = model.embed(text);

assertThat(embeddedText.content().dimension()).isEqualTo(768);

// Text classification embedding

TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +
"Once the model is trained, it can be used to classify new text inputs into one or more " +
"categories based on their embeddings.",
new Metadata());

model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(CLASSIFICATION)
.build();

Response<Embedding> embeddedSegForClassif = model.embed(segment2);

assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);

// Document retrieval embedding

Metadata metadata = new Metadata();
metadata.add("title", "Text embeddings");

TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadata);

model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(RETRIEVAL_DOCUMENT)
.build();

Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);

// Choose a custom metadata key instead of "title"
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT

Metadata metadataCustomTitleKey = new Metadata();
metadataCustomTitleKey.add("customTitle", "Text embeddings");

TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);

model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(RETRIEVAL_DOCUMENT)
.titleMetadataKey("customTitle")
.build();

Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);

assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);

// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task

model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.titleMetadataKey("customTitle")
.build();

Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);

assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added tests demonstrate the functionality of embedding calculations with different task types (SEMANTIC_SIMILARITY, CLASSIFICATION, RETRIEVAL_DOCUMENT). These tests are well-structured and provide good coverage for the new feature. However, there's a repeated pattern of building the VertexAiEmbeddingModel with similar parameters across tests. Consider refactoring this setup into a shared method to reduce code duplication and improve maintainability.

+ private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) {
+     return VertexAiEmbeddingModel.builder()
+         .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+         .project(System.getenv("GCP_PROJECT_ID"))
+         .location(System.getenv("GCP_LOCATION"))
+         .publisher("google")
+         .modelName("textembedding-gecko@003")
+         .maxRetries(3)
+         .taskType(taskType)
+         .titleMetadataKey(titleMetadataKey)
+         .build();
+ }

Then, replace the model instantiation in each test with a call to this method.


Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
@Test
void testEmbeddingTask() {
// Semantic similarity embedding
VertexAiEmbeddingModel model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(SEMANTIC_SIMILARITY)
.build();
String text = "Embeddings for Text is the name for the model that supports text embeddings. " +
"Text embeddings are a NLP technique that converts textual data into numerical vectors " +
"that can be processed by machine learning algorithms, especially large models. `" +
"These vector representations are designed to capture the semantic meaning and context " +
"of the words they represent.";
Response<Embedding> embeddedText = model.embed(text);
assertThat(embeddedText.content().dimension()).isEqualTo(768);
// Text classification embedding
TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +
"Once the model is trained, it can be used to classify new text inputs into one or more " +
"categories based on their embeddings.",
new Metadata());
model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(CLASSIFICATION)
.build();
Response<Embedding> embeddedSegForClassif = model.embed(segment2);
assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);
// Document retrieval embedding
Metadata metadata = new Metadata();
metadata.add("title", "Text embeddings");
TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadata);
model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(RETRIEVAL_DOCUMENT)
.build();
Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);
assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);
// Choose a custom metadata key instead of "title"
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT
Metadata metadataCustomTitleKey = new Metadata();
metadataCustomTitleKey.add("customTitle", "Text embeddings");
TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);
model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(RETRIEVAL_DOCUMENT)
.titleMetadataKey("customTitle")
.build();
Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);
assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);
// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task
model = VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.titleMetadataKey("customTitle")
.build();
Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);
assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);
}
@Test
void testEmbeddingTask() {
// Semantic similarity embedding
VertexAiEmbeddingModel model = createTestModel(SEMANTIC_SIMILARITY, null);
String text = "Embeddings for Text is the name for the model that supports text embeddings. " +
"Text embeddings are a NLP technique that converts textual data into numerical vectors " +
"that can be processed by machine learning algorithms, especially large models. `" +
"These vector representations are designed to capture the semantic meaning and context " +
"of the words they represent.";
Response<Embedding> embeddedText = model.embed(text);
assertThat(embeddedText.content().dimension()).isEqualTo(768);
// Text classification embedding
TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " +
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " +
"Once the model is trained, it can be used to classify new text inputs into one or more " +
"categories based on their embeddings.",
new Metadata());
model = createTestModel(CLASSIFICATION, null);
Response<Embedding> embeddedSegForClassif = model.embed(segment2);
assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768);
// Document retrieval embedding
Metadata metadata = new Metadata();
metadata.add("title", "Text embeddings");
TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadata);
model = createTestModel(RETRIEVAL_DOCUMENT, null);
Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval);
assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768);
// Choose a custom metadata key instead of "title"
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT
Metadata metadataCustomTitleKey = new Metadata();
metadataCustomTitleKey.add("customTitle", "Text embeddings");
TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " +
"user's query and the universe of documents in a high-dimensional vector space. Documents " +
"that are more semantically similar to the user's query will have a shorter distance in the " +
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey);
model = createTestModel(RETRIEVAL_DOCUMENT, "customTitle");
Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey);
assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768);
// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task
model = createTestModel(null, "customTitle");
Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval);
assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768);
}
private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) {
return VertexAiEmbeddingModel.builder()
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
.project(System.getenv("GCP_PROJECT_ID"))
.location(System.getenv("GCP_LOCATION"))
.publisher("google")
.modelName("textembedding-gecko@003")
.maxRetries(3)
.taskType(taskType)
.titleMetadataKey(titleMetadataKey)
.build();
}

Copy link
Owner

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glaforge thanks a lot!

@langchain4j langchain4j merged commit f565efc into langchain4j:main Mar 11, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants