-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to specify the task type for calculating embeddings #722
Conversation
WalkthroughThe update involves enhancing the Changes
Related issues
Poem
((\ Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (3)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (1 hunks)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
- langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Additional comments: 9
langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingInstance.java (3)
- 5-5: The removal of the
final
modifier from thecontent
field introduces mutability to theVertexAiEmbeddingInstance
objects. Ensure that this change aligns with the intended use cases and consider thread safety if instances are accessed by multiple threads.- 6-15: The addition of the
title
field and its corresponding setter methodsetTitle
enhances the descriptiveness of embedding instances. Ensure that thetitle
field is appropriately used in downstream processes where necessary.- 7-19: The introduction of the
task_type
field and its setter methodsetTaskType
allows for specifying the task type of an embedding instance, enhancing the model's flexibility. Using an enum for task types ensures type safety and improves code readability.langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (3)
- 3-13: The changes to imports, including the addition of
Metadata
and static import forVertexAiEmbeddingModel.TaskType
, are appropriate for the added tests. While wildcard imports are generally discouraged, their use in test files is acceptable. Consider using explicit imports for clarity in non-test code.- 165-234: The addition of tests for semantic similarity, text classification, and document retrieval embeddings using different task types is commendable. These tests effectively validate the new functionality introduced in the PR. Ensure that these tests cover all edge cases and potential failure scenarios to maintain robustness.
- 1-16: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]
The use of
java.util.Random
for generating test data increateRandomSegments
is appropriate and aligns with the intended use case. It's important to note that for security-sensitive operations requiring randomness (e.g., generating tokens or passwords), a cryptographically strong random number generator likeSecureRandom
should be used instead.langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (3)
- 74-76: The introduction of the
TaskType
enum is a positive change, enhancing type safety and code readability by clearly defining the supported task types for embeddings. This approach facilitates easier maintenance and extension of task types in the future.- 82-89: > 📝 NOTE
This review was outside the diff hunks and was mapped to the diff hunk with the greatest overlap. Original lines [85-114]
The modification to the constructor to include a
taskType
parameter is a necessary change to support the new functionality of specifying task types for embeddings. This ensures that everyVertexAiEmbeddingModel
instance is created with a clear task type, aligning with the enhancement goals of the PR.
- 133-139: The use of
taskType
in the creation ofVertexAiEmbeddingInstance
objects within theembedAll
method correctly applies the specified task type to each embedding calculation. This implementation ensures that the embedding process is tailored according to the specified task type, aligning with the PR's objectives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaforge thanks!
@@ -122,8 +130,16 @@ public Response<List<Embedding>> embedAll(List<TextSegment> segments) { | |||
|
|||
List<Value> instances = new ArrayList<>(); | |||
for (TextSegment segment : batch) { | |||
VertexAiEmbeddingInstance embeddingInstance = new VertexAiEmbeddingInstance(segment.text()); | |||
// Title metadata is used for calculating embeddings for document retrieval | |||
embeddingInstance.setTitle(segment.metadata("title")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docu says title is valid only for RETRIEVAL_DOCUMENT
.
Would be nice to document this in hte javadoc of this class.
I would also give user an option to define the key of metadata to use instead of hardcoding "title".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you let the user customize the title
property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps something like titleMetadataKey("my_title")
in the builder/ctor, but this is not urgent, can be added later. My main concern was in that if there is a "title" metadata entry and task type is not RETRIEVAL_DOCUMENT
, then embedding call might fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed that's a bit weird, but that's what's happening: using title
without RETRIEVAL_DOCUMENT
would fail.
I'll add the suggested method to customize the metadata key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the extra builder method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth adding a condition here? Set title
only for RETRIEVAL_DOCUMENT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point: pushed a change to set title
only for RETRIEVAL_DOCUMENT
.
That way, it's still possible to use title
when not using that embedding task type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
- langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java
Additional comments: 5
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (5)
- 3-3: The import of
Metadata
is appropriate given its usage in the new tests. This aligns with the PR's objectives to enhance embedding calculations with additional metadata.- 11-11: Replacing specific imports with a wildcard import from
java.util
can lead to less readable code by obscuring the origin of used classes. Consider reverting to specific imports if the number of imports fromjava.util
does not significantly reduce clutter.- 13-13: Static imports for
TaskType
and utilities likeasList
andassertThat
improve readability by reducing verbosity in test methods. This is a good practice for test code where clarity and conciseness are valuable.- 165-260: The newly added test method
testEmbeddingTask
effectively validates the embedding functionality across different task types (SEMANTIC_SIMILARITY
,CLASSIFICATION
,RETRIEVAL_DOCUMENT
) as intended by the PR objectives. The use ofMetadata
and thetitleMetadataKey
customization demonstrates the enhanced flexibility introduced in the library. However, consider adding assertions for more specific properties of the embeddings (if applicable) to ensure that the task type affects the embeddings as expected.- 1-16: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]
The use of
java.util.Random
increateRandomSegments
is appropriate for the context of generating test data. Since this usage does not involve security-sensitive operations like password or token generation, usingjava.util.Random
is acceptable. However, for security-sensitive contexts, consider usingjava.security.SecureRandom
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Configuration used: CodeRabbit UI
Files selected for processing (2)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java (7 hunks)
- langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- langchain4j-vertex-ai/src/main/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModel.java
Additional comments: 1
langchain4j-vertex-ai/src/test/java/dev/langchain4j/model/vertexai/VertexAiEmbeddingModelIT.java (1)
- 1-16: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [98-98]
Detected use of
java.util.Random
in thecreateRandomSegments
method. While this is acceptable for generating test data, be mindful thatRandom
is not suitable for generating cryptographically secure random numbers. If the context ever shifts towards security-sensitive applications, consider usingjava.security.SecureRandom
instead.
import dev.langchain4j.data.document.Metadata; | ||
import dev.langchain4j.data.embedding.Embedding; | ||
import dev.langchain4j.data.segment.TextSegment; | ||
import dev.langchain4j.model.embedding.EmbeddingModel; | ||
import dev.langchain4j.model.output.Response; | ||
import dev.langchain4j.model.output.TokenUsage; | ||
import org.junit.jupiter.api.Test; | ||
|
||
import java.util.Arrays; | ||
import java.util.Collections; | ||
import java.util.List; | ||
import java.util.Random; | ||
import java.util.*; | ||
|
||
import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes include adding a specific import for Metadata
, replacing specific imports with a wildcard import for java.util
, and adding a static import for VertexAiEmbeddingModel.TaskType
. While wildcard imports can sometimes make the code cleaner, they can also obscure the origin of classes and lead to conflicts if classes with the same name exist in different packages. Consider reverting to specific imports for clarity and to avoid potential conflicts.
-import java.util.*;
+import java.util.List;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Random;
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
import dev.langchain4j.data.document.Metadata; | |
import dev.langchain4j.data.embedding.Embedding; | |
import dev.langchain4j.data.segment.TextSegment; | |
import dev.langchain4j.model.embedding.EmbeddingModel; | |
import dev.langchain4j.model.output.Response; | |
import dev.langchain4j.model.output.TokenUsage; | |
import org.junit.jupiter.api.Test; | |
import java.util.Arrays; | |
import java.util.Collections; | |
import java.util.List; | |
import java.util.Random; | |
import java.util.*; | |
import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*; | |
import dev.langchain4j.data.document.Metadata; | |
import dev.langchain4j.data.embedding.Embedding; | |
import dev.langchain4j.data.segment.TextSegment; | |
import dev.langchain4j.model.embedding.EmbeddingModel; | |
import dev.langchain4j.model.output.Response; | |
import dev.langchain4j.model.output.TokenUsage; | |
import org.junit.jupiter.api.Test; | |
import java.util.List; | |
import java.util.Arrays; | |
import java.util.Collections; | |
import java.util.Random; | |
import static dev.langchain4j.model.vertexai.VertexAiEmbeddingModel.TaskType.*; |
.build(); | ||
|
||
String text = "Embeddings for Text is the name for the model that supports text embeddings. " + | ||
"Text embeddings are a NLP technique that converts textual data into numerical vectors " + | ||
"that can be processed by machine learning algorithms, especially large models. `" + | ||
"These vector representations are designed to capture the semantic meaning and context " + | ||
"of the words they represent."; | ||
|
||
Response<Embedding> embeddedText = model.embed(text); | ||
|
||
assertThat(embeddedText.content().dimension()).isEqualTo(768); | ||
|
||
// Text classification embedding | ||
|
||
TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " + | ||
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " + | ||
"Once the model is trained, it can be used to classify new text inputs into one or more " + | ||
"categories based on their embeddings.", | ||
new Metadata()); | ||
|
||
model = VertexAiEmbeddingModel.builder() | ||
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | ||
.project(System.getenv("GCP_PROJECT_ID")) | ||
.location(System.getenv("GCP_LOCATION")) | ||
.publisher("google") | ||
.modelName("textembedding-gecko@003") | ||
.maxRetries(3) | ||
.taskType(CLASSIFICATION) | ||
.build(); | ||
|
||
Response<Embedding> embeddedSegForClassif = model.embed(segment2); | ||
|
||
assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768); | ||
|
||
// Document retrieval embedding | ||
|
||
Metadata metadata = new Metadata(); | ||
metadata.add("title", "Text embeddings"); | ||
|
||
TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " + | ||
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | ||
"that are more semantically similar to the user's query will have a shorter distance in the " + | ||
"vector space, and can be ranked higher in the search results.", metadata); | ||
|
||
model = VertexAiEmbeddingModel.builder() | ||
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | ||
.project(System.getenv("GCP_PROJECT_ID")) | ||
.location(System.getenv("GCP_LOCATION")) | ||
.publisher("google") | ||
.modelName("textembedding-gecko@003") | ||
.maxRetries(3) | ||
.taskType(RETRIEVAL_DOCUMENT) | ||
.build(); | ||
|
||
Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval); | ||
|
||
assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768); | ||
|
||
// Choose a custom metadata key instead of "title" | ||
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT | ||
|
||
Metadata metadataCustomTitleKey = new Metadata(); | ||
metadataCustomTitleKey.add("customTitle", "Text embeddings"); | ||
|
||
TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " + | ||
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | ||
"that are more semantically similar to the user's query will have a shorter distance in the " + | ||
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey); | ||
|
||
model = VertexAiEmbeddingModel.builder() | ||
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | ||
.project(System.getenv("GCP_PROJECT_ID")) | ||
.location(System.getenv("GCP_LOCATION")) | ||
.publisher("google") | ||
.modelName("textembedding-gecko@003") | ||
.maxRetries(3) | ||
.taskType(RETRIEVAL_DOCUMENT) | ||
.titleMetadataKey("customTitle") | ||
.build(); | ||
|
||
Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey); | ||
|
||
assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768); | ||
|
||
// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task | ||
|
||
model = VertexAiEmbeddingModel.builder() | ||
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | ||
.project(System.getenv("GCP_PROJECT_ID")) | ||
.location(System.getenv("GCP_LOCATION")) | ||
.publisher("google") | ||
.modelName("textembedding-gecko@003") | ||
.maxRetries(3) | ||
.titleMetadataKey("customTitle") | ||
.build(); | ||
|
||
Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval); | ||
|
||
assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The newly added tests demonstrate the functionality of embedding calculations with different task types (SEMANTIC_SIMILARITY
, CLASSIFICATION
, RETRIEVAL_DOCUMENT
). These tests are well-structured and provide good coverage for the new feature. However, there's a repeated pattern of building the VertexAiEmbeddingModel
with similar parameters across tests. Consider refactoring this setup into a shared method to reduce code duplication and improve maintainability.
+ private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) {
+ return VertexAiEmbeddingModel.builder()
+ .endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT"))
+ .project(System.getenv("GCP_PROJECT_ID"))
+ .location(System.getenv("GCP_LOCATION"))
+ .publisher("google")
+ .modelName("textembedding-gecko@003")
+ .maxRetries(3)
+ .taskType(taskType)
+ .titleMetadataKey(titleMetadataKey)
+ .build();
+ }
Then, replace the model instantiation in each test with a call to this method.
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
@Test | |
void testEmbeddingTask() { | |
// Semantic similarity embedding | |
VertexAiEmbeddingModel model = VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.taskType(SEMANTIC_SIMILARITY) | |
.build(); | |
String text = "Embeddings for Text is the name for the model that supports text embeddings. " + | |
"Text embeddings are a NLP technique that converts textual data into numerical vectors " + | |
"that can be processed by machine learning algorithms, especially large models. `" + | |
"These vector representations are designed to capture the semantic meaning and context " + | |
"of the words they represent."; | |
Response<Embedding> embeddedText = model.embed(text); | |
assertThat(embeddedText.content().dimension()).isEqualTo(768); | |
// Text classification embedding | |
TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " + | |
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " + | |
"Once the model is trained, it can be used to classify new text inputs into one or more " + | |
"categories based on their embeddings.", | |
new Metadata()); | |
model = VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.taskType(CLASSIFICATION) | |
.build(); | |
Response<Embedding> embeddedSegForClassif = model.embed(segment2); | |
assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768); | |
// Document retrieval embedding | |
Metadata metadata = new Metadata(); | |
metadata.add("title", "Text embeddings"); | |
TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " + | |
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | |
"that are more semantically similar to the user's query will have a shorter distance in the " + | |
"vector space, and can be ranked higher in the search results.", metadata); | |
model = VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.taskType(RETRIEVAL_DOCUMENT) | |
.build(); | |
Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval); | |
assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768); | |
// Choose a custom metadata key instead of "title" | |
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT | |
Metadata metadataCustomTitleKey = new Metadata(); | |
metadataCustomTitleKey.add("customTitle", "Text embeddings"); | |
TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " + | |
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | |
"that are more semantically similar to the user's query will have a shorter distance in the " + | |
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey); | |
model = VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.taskType(RETRIEVAL_DOCUMENT) | |
.titleMetadataKey("customTitle") | |
.build(); | |
Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey); | |
assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768); | |
// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task | |
model = VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.titleMetadataKey("customTitle") | |
.build(); | |
Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval); | |
assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768); | |
} | |
@Test | |
void testEmbeddingTask() { | |
// Semantic similarity embedding | |
VertexAiEmbeddingModel model = createTestModel(SEMANTIC_SIMILARITY, null); | |
String text = "Embeddings for Text is the name for the model that supports text embeddings. " + | |
"Text embeddings are a NLP technique that converts textual data into numerical vectors " + | |
"that can be processed by machine learning algorithms, especially large models. `" + | |
"These vector representations are designed to capture the semantic meaning and context " + | |
"of the words they represent."; | |
Response<Embedding> embeddedText = model.embed(text); | |
assertThat(embeddedText.content().dimension()).isEqualTo(768); | |
// Text classification embedding | |
TextSegment segment2 = new TextSegment("Text Classification: Training a model that maps " + | |
"the text embeddings to the correct category labels (e.g., cat vs. dog, spam vs. not spam). " + | |
"Once the model is trained, it can be used to classify new text inputs into one or more " + | |
"categories based on their embeddings.", | |
new Metadata()); | |
model = createTestModel(CLASSIFICATION, null); | |
Response<Embedding> embeddedSegForClassif = model.embed(segment2); | |
assertThat(embeddedSegForClassif.content().dimension()).isEqualTo(768); | |
// Document retrieval embedding | |
Metadata metadata = new Metadata(); | |
metadata.add("title", "Text embeddings"); | |
TextSegment segmentForRetrieval = new TextSegment("Text embeddings can be used to represent both the " + | |
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | |
"that are more semantically similar to the user's query will have a shorter distance in the " + | |
"vector space, and can be ranked higher in the search results.", metadata); | |
model = createTestModel(RETRIEVAL_DOCUMENT, null); | |
Response<Embedding> embeddedSegForRetrieval = model.embed(segmentForRetrieval); | |
assertThat(embeddedSegForRetrieval.content().dimension()).isEqualTo(768); | |
// Choose a custom metadata key instead of "title" | |
// as the embedding model requires "title" to be used only for RETRIEVAL_DOCUMENT | |
Metadata metadataCustomTitleKey = new Metadata(); | |
metadataCustomTitleKey.add("customTitle", "Text embeddings"); | |
TextSegment segmentForRetrievalWithCustomKey = new TextSegment("Text embeddings can be used to represent both the " + | |
"user's query and the universe of documents in a high-dimensional vector space. Documents " + | |
"that are more semantically similar to the user's query will have a shorter distance in the " + | |
"vector space, and can be ranked higher in the search results.", metadataCustomTitleKey); | |
model = createTestModel(RETRIEVAL_DOCUMENT, "customTitle"); | |
Response<Embedding> embeddedSegForRetrievalWithCustomKey = model.embed(segmentForRetrievalWithCustomKey); | |
assertThat(embeddedSegForRetrievalWithCustomKey.content().dimension()).isEqualTo(768); | |
// Check we can use "title" metadata when not using RETRIEVAL_DOCUMENT task | |
model = createTestModel(null, "customTitle"); | |
Response<Embedding> embeddedSegTitleKeyNoRetrieval = model.embed(segmentForRetrieval); | |
assertThat(embeddedSegTitleKeyNoRetrieval.content().dimension()).isEqualTo(768); | |
} | |
private VertexAiEmbeddingModel createTestModel(TaskType taskType, String titleMetadataKey) { | |
return VertexAiEmbeddingModel.builder() | |
.endpoint(System.getenv("GCP_VERTEXAI_ENDPOINT")) | |
.project(System.getenv("GCP_PROJECT_ID")) | |
.location(System.getenv("GCP_LOCATION")) | |
.publisher("google") | |
.modelName("textembedding-gecko@003") | |
.maxRetries(3) | |
.taskType(taskType) | |
.titleMetadataKey(titleMetadataKey) | |
.build(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@glaforge thanks a lot!
It's possible to customize different kinds of tasks for embedding:
RETRIEVAL_QUERY
RETRIEVAL_DOCUMENT
SEMANTIC_SIMILARITY
CLASSIFICATION, CLUSTERING
Summary by CodeRabbit
New Features
TaskType
enum for specifying task types in embedding instances, enhancing customization for retrieval, similarity, classification, and clustering tasks.taskType
parameter to the constructor for setting task type in embedding instances.Tests