Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch: Add approximative kNN search in addition to Cosine similarity #712

Merged
merged 46 commits into from
Aug 20, 2024

Conversation

dadoonet
Copy link
Contributor

@dadoonet dadoonet commented Mar 7, 2024

The current implementation uses a slow function score query which is known to be slow because you have to iterate over all the hits to compute the score.

Elasticsearch in the recent versions, now also allow using knn search.

This PR now allows to change the configuration:

// Use the Knn implementation
ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .configuration(ElasticsearchConfigurationKnn.builder().build())
  .restClient(restClient)
  .build();

// Note that this configuration is optional as it's the default behavior. So this code is equivalent:
ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .restClient(restClient)
  .build();

// Use the Scripting implementation (slower)
ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .configuration(ElasticsearchConfigurationScript.builder().build())
  .restClient(restClient)
  .build();

In more details, this is what this PR is bringing:

  • Add a configuration new option in the builder. This helps to choose the implementation to use. It could be ElasticsearchConfigurationScript (the behavior before this change) or ElasticsearchConfigurationKnn for the new Knn implementation
  • The Knn implementation (ElasticsearchConfigurationKnn) is now the default implementation
  • Removes the dimension parameter as Elasticsearch can guess automatically this value based on the first document indexed. And anyway, it's better to let the user create an index template or the index mapping he wants.
  • Modify the IT. In which case if a local instance of Elasticsearch is yet running on https://localhost:9200, we will use it by default and start Testcontainers only if needed.
  • Make Elasticsearch container secured (which is closer to production use case)
  • We now use the elastic.version pom.xml property to launch the right Elasticsearch TestContainer.
  • Replaced the removeAll() implementation from a DeleteByQuery(match all query) to a Delete index API call which is always the recommended way to remove all documents. Note that this does not recreate an index.
  • ElasticsearchEmbeddingStoreRemoveIT now extends the common removal test class EmbeddingStoreWithRemovalIT. But we need to implement in the future a EmbeddingStoreWithRemovalIT.wait_for_ready() method to allow even more common tests.
  • Add more logs in implementation and tests

@langchain4j
Copy link
Collaborator

Hi @dadoonet, thanks a lot for these improvements! I will take a closer look at this PR a bit later, but please note that elastic search implementation was changed in #610 and is planned to be merged tomorrow. So there could be some conflicts. Also, could you please take a look at it, as it seems that you have a lot of experience with elasticsearch? :) Thanks!

@dadoonet
Copy link
Contributor Author

dadoonet commented Mar 7, 2024

I gave a quick look at #610 and that looks ok to me. The biggest conflict I'll have to resolve will be about IT with the new cloud tests. But I will figure this out.
So let's wait for #610 to be merged and then I will rebase my code and fix.

Note that the test part needs Java9+ but I can probably fix that in another way...
@dadoonet
Copy link
Contributor Author

dadoonet commented Apr 18, 2024

I updated my branch so it's up to date with the main branch.
But note that I modified the Java version to 9. I can change that but I'd not recommend anyone nowadays starting a new project with Java8 anyway. That's probably another discussion but I'd encourage the project to move to at least Java17 (or Java11) as a target...

Also it's not complete yet as I need to revisit the ElasticsearchEmbeddingStoreCloudIT. The goal is to run all the Knn and "legacy" tests whatever the env is.

@langchain4j
Copy link
Collaborator

Hi @dadoonet migration to java 17 is in plans, but for now please keep it java 8 compatible, thank you!

@dadoonet
Copy link
Contributor Author

I have a question about the design. The current state of the PR introduces a new class named ElasticsearchKnnEmbeddingStore. With the same "semantic" as the ElasticsearchEmbeddingStore class.

So it's a "drop-in" replacement. From:

ElasticsearchEmbeddingStore.builder()
                .restClient(restClient)
                .indexName(indexName)
                .build();

To:

ElasticsearchKnnEmbeddingStore.builder()
                .restClient(restClient)
                .indexName(indexName)
                .build();

I'm wondering if @langchain4j would prefer an option within the builder. Which mean one "single class" ElasticsearchEmbeddingStore and then just call something like:

ElasticsearchEmbeddingStore.builder()
                .restClient(restClient)
                .indexName(indexName)
                .useKnn(true)
                .build();

And then internally we choose the right implementation.

My guts feeling goes for the former, but I prefer to ask.

@dadoonet
Copy link
Contributor Author

I spoke with @swallez and he proposed another alternative which is the best world of the 2:

ElasticsearchEmbeddingStore.builder()
                .restClient(restClient)
                .indexName(indexName)
                .config(XXX)
                .build();

Where XXX is a configuration class. Like ElasticsearchKnnConfiguration or ElasticsearchScriptConfiguration.

@dadoonet
Copy link
Contributor Author

dadoonet commented Apr 19, 2024

When #610 has been added, it assumed some scores in tests when doing filtering.
The problem I have yet with the Elasticsearch Knn implementation is that Elasticsearch is adding the score of the query "match_all" (1.0) and the knn score (1.0). That gives a total score of 2.0 whereas the tests are expecting a score of 1.0.

java.lang.AssertionError: 
Expecting actual:
  2.0
to be close to:
  1.0
by less than 1% but difference was 100.0%.
(a difference of exactly 1% being considered valid)

I think we should not try to rely on the score but more on the ranking of the results.

@langchain4j
Copy link
Collaborator

@dadoonet from the user perspective, I guess a single ElasticsearchEmbeddingStore would probably be more convenient.
I guess I like the idea with separate configurations in the builder.

Regarding scores: current API expects EmbeddingMatch.score to always be in [0.0, 1.0] range, all implementations follow this.
All implementations are using cosine similarity ([0.0, 2.0]) as a metric. It can be then converted into score using RelevanceScore.fromCosineSimilarity(). It basically just squishes [0.0, 2.0] into [0.0, 1.0] range.

Elasticsearch is adding the score of the query "match_all" (1.0) and the knn score (1.0)

I am not sure I understand how/why this works like this. Is it true that match_all gives all documents the constant score of 1.0? In this case you could just subtract 1.0 from the final score?

Does this problem arise only for the should_filter_by_metadata test or for others as well?

@langchain4j langchain4j added the P2 High priority label Apr 23, 2024
@swallez
Copy link

swallez commented Apr 25, 2024

(disclosure: I'm a co-maintainer of the Elasticsearch Java client at Elastic)

I guess I like the idea with separate configurations in the builder.

The idea behind a config object is that ultimately we want to expose the full range of vector search features of Elasticsearch. So having a single ElastisearchEmbeddingStore is a way to provide a single entry point (with sane defaults), and a configuration (or a set of configuration classes) allows to provide the flexibility needed for more specific user needs.

I am not sure I understand how/why this works like this.

About scoring, there are multiple aspects:

  • approximate search is not limited to cosine distance, as there are various ways to measure vector similarity. The cosine, euclidian, dot product distances are in known ranges and can be cerced to [-1, 1]. The max_inner_product distance however doesn't have a defined range: it depends whether the vectors are normalized or not (on this topic I haven't found information about whether EmbeddingModel should return normalized vectors). Other distance measurements are in the works with hamming distance for 1 bit quantization.

  • adding a filter can also affect the score. In Elasticsearch the filter can be used either as a filter that just restricts the search space and doesn't affect the score, or a query, which then results in hybrid scoring (see RRF) and then the score isn't in a well-defined range. Maybe we should consider LangChain4J's filter as just a filter (doesn't affect the score), but ultimately we would like to have ElasticsearchEmbeddingStore allow hybrid search, and in that case the score value range is undefined.

@dadoonet
Copy link
Contributor Author

So I moved the filters within the knn query (as pre filters) and I hope that they won't affect the score anymore.
But I need elastic/elasticsearch-java#774 to be fixed first ;)

@dadoonet
Copy link
Contributor Author

And I think that for the integration tests, we should either rely on the rank of each document but not on the score, or may be have a way for each IT to overload the expected value if needed as the behavior might be different depending on the embedding store implementation.

@langchain4j
Copy link
Collaborator

langchain4j commented Apr 26, 2024

Hi @swallez, thanks a lot for your comments!

The idea behind a config object is that ultimately we want to expose the full range of vector search features of Elasticsearch. So having a single ElastisearchEmbeddingStore is a way to provide a single entry point (with sane defaults), and a configuration (or a set of configuration classes) allows to provide the flexibility needed for more specific user needs.

Single entry point sounds good!

approximate search is not limited to cosine distance, as there are various ways to measure vector similarity.

Disclosure: I am by no means an expert in this field, so with a high probability my reasoning can be wrong here.
I am aware of other similarity/distance measures, but at the time of designing this API cosine similarity and dot product with normalized vectors were considered the most popular and meaningful options for "semantic search" use cases. At that point of time I wanted to:

  1. keep the API simple and hide the complexity of various measures from the user
  2. make the implementations of EmbeddingStore interchangeable

So as of now, the EmbeddingStore API contract is such that it always returns (hopefully intuitive) "relevance score" in the range of 0..1. This means that it also (implicitly) assumes that integrations use cosine similarity/distance or dot product under the hood, which can be converted from/to "relevance score". This was simply the common ground between all the implementations of this API. There are now 19 integrations that all follow this contract. I know this is limiting for some advanced use cases, so I am open to discussions/suggestions on how to improve here. But I did not hear any complains in this area yet.
In your practical experience, which measures are the most used now for "semantic search" use cases?

on this topic I haven't found information about whether EmbeddingModel should return normalized vectors

Most of the integrations (if not all) return normalized vectors, but API does not guarantee this. The only way to guarantee it is to check and normalize them (if not normalized) before returning them to the user, but:

  1. EmbeddingModel is a low-level API and ideally should not modify outputs of the underlying integration
  2. Normalizing vectors in java 8 is probably not the best idea
  3. AFAIK many vector stores can normalize embeddings on their side, so I think there is no need to handle this in LangChain4j

Maybe we should consider LangChain4J's filter as just a filter (doesn't affect the score)

I think this is a good idea, Filter in LangChain4j assumes strict filtering, not "penalize the score if this condition is not met".

we would like to have ElasticsearchEmbeddingStore allow hybrid search, and in that case the score value range is undefined

EmbeddingStore API does not assume hybrid search (but maybe we should revisit this, a bit later). For hybrid search I would recommend to consider implementing ContentRetriever instead which does not put any constraints on inputs (e.g. query should be a vector and/or a string) and outputs (result should include scores). Please see AzureAiSearchContentRetriever as an example.

@langchain4j
Copy link
Collaborator

Hi @dadoonet, please see my response to @swallez as well.

And I think that for the integration tests, we should either rely on the rank of each document but not on the score, or may be have a way for each IT to overload the expected value if needed as the behavior might be different depending on the embedding store implementation.

It would be great if we could keep ElasticsearchEmbeddingStore in line with other 18 integrations, which all implement the same contract.

The problem I have yet with the Elasticsearch Knn implementation is that Elasticsearch is adding the score of the query "match_all" (1.0) and the knn score (1.0). That gives a total score of 2.0 whereas the tests are expecting a score of 1.0.

I have commented on this above:

Is it true that match_all gives all documents the constant score of 1.0? In this case you could just subtract 1.0 from the final score?

Will "always substracting 1 from the score" work and return score in 0..1 range?

@langchain4j langchain4j changed the title Add Knn search in addition to Cosine similarity Elasticsearch: Add Knn search in addition to Cosine similarity Apr 29, 2024
Copy link
Collaborator

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadoonet thanks a lot!

@langchain4j
Copy link
Collaborator

Hey @langchain4j. I think this PR is now ready for review. I updated the original description.

In the future, we would also like to add more flexibility for the field names used behind the scene. With something like:

ElasticsearchConfigurationKnn.builder()
   .vectorField("my_vector")
   .textField("my_text")
   .metadataField("my_metadata")
   .build()

But I did not want to add this in this PR.

Also, I added restClient(restClient) to provide a restClient instead of providing a URL, login/pwd... It's much more flexible and allows using self-signed certificates instead of disabling security for tests.

I kept the settings in the ElasticsearchEmbeddingStore class instead of moving everything to the new Configuration classes. I think it's ok as I'd prefer having specific settings only in ElasticsearchConfigurationXXX classes if any.

But I could definitely move everything to ElasticsearchConfigurationXXX:

ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .configuration(ElasticsearchConfigurationKnn.builder().build())
  .restClient(restClient)
  .build();

would become:

ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .configuration(ElasticsearchConfigurationKnn.builder()
      .restClient(restClient)
      .build())
  .build();

Let me know what you think ;)

This looks good!

ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
  .configuration(ElasticsearchConfigurationKnn.builder().build())
  .restClient(restClient)
  .build();

* Put `dimension` back but as `@Deprecated`
* Throw `ElasticsearchRequestFailedException` in case of any Elasticsearch exception and remove the call to `log.error()` in that case
* Remove the `deleteIndex()` which was one single line and called only once
@dadoonet
Copy link
Contributor Author

dadoonet commented Aug 2, 2024

While working on the documentation, I realized that I will need to change in a follow up PR the code in - ElasticsearchEmbeddingStoreExample

Instead of using username/password/apikey, we should have the user providing a rest client. It's much more flexible as we might have to deal with:

* provided SSL certificates
* self-signed certificates
* apiKey
* username and password although not recommended)
@dadoonet
Copy link
Contributor Author

dadoonet commented Aug 2, 2024

@langchain4j I added also some documentation to make the Elasticsearch page a bit less "empty" ;)

@dadoonet dadoonet changed the title Elasticsearch: Add Knn search in addition to Cosine similarity Elasticsearch: Add approximative kNN search in addition to Cosine similarity Aug 2, 2024
@dadoonet
Copy link
Contributor Author

dadoonet commented Aug 5, 2024

@langchain4j Could you approve the workflow so we can test the code on the CI?

@langchain4j
Copy link
Collaborator

@dadoonet this CI workflow does not run integration tests (yet), so please test it locally.

@dadoonet
Copy link
Contributor Author

dadoonet commented Aug 5, 2024

I guess we are all good ;)

[INFO] --- failsafe:3.1.2:integration-test (default) @ langchain4j-elasticsearch ---
[INFO] Using auto detected provider org.apache.maven.surefire.junitplatform.JUnitPlatformProvider
[INFO] 
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running dev.langchain4j.store.embedding.elasticsearch.ElasticsearchKnnEmbeddingStoreIT
15:42:49,567 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Starting testcontainers with Elasticsearch [8.14.3].
15:43:03,399 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Found Elasticsearch cluster version [8.14.3] running at [https://localhost:62145].
[INFO] Tests run: 103, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.46 s -- in dev.langchain4j.store.embedding.elasticsearch.ElasticsearchKnnEmbeddingStoreIT
[INFO] Running dev.langchain4j.store.embedding.elasticsearch.ElasticsearchEmbeddingStoreIT
15:43:19,826 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Starting testcontainers with Elasticsearch [8.14.3].
15:43:31,803 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Found Elasticsearch cluster version [8.14.3] running at [https://localhost:62175].
[INFO] Tests run: 103, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 23.36 s -- in dev.langchain4j.store.embedding.elasticsearch.ElasticsearchEmbeddingStoreIT
[INFO] Running dev.langchain4j.store.embedding.elasticsearch.ElasticsearchEmbeddingStoreRemoveIT
15:43:43,185 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Starting testcontainers with Elasticsearch [8.14.3].
15:43:53,924 INFO  [d.l.s.e.e.ElasticsearchClientHelper] Found Elasticsearch cluster version [8.14.3] running at [https://localhost:62179].
[INFO] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.03 s -- in dev.langchain4j.store.embedding.elasticsearch.ElasticsearchEmbeddingStoreRemoveIT
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 218, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] 
[INFO] --- failsafe:3.1.2:verify (default) @ langchain4j-elasticsearch ---
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:18 min
[INFO] Finished at: 2024-08-05T15:44:04+02:00
[INFO] ------------------------------------------------------------------------

@langchain4j
Copy link
Collaborator

@dadoonet awesome, thanks! I will check the PR once again asap

Copy link
Collaborator

@langchain4j langchain4j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadoonet good job, thank you!

@langchain4j langchain4j merged commit 8f6a140 into langchain4j:main Aug 20, 2024
1 of 6 checks passed
langchain4j pushed a commit that referenced this pull request Aug 20, 2024
…ilarity (#712)

Fixed the following version conflict:
Rule 0: org.apache.maven.enforcer.rules.dependency.DependencyConvergence failed with message:
Failed while enforcing releasability.

Dependency convergence error for jakarta.json:jakarta.json-api:jar:2.0.1 paths to dependency are:
+-dev.langchain4j:langchain4j-elasticsearch:jar:0.34.0-SNAPSHOT
  +-co.elastic.clients:elasticsearch-java:jar:8.14.3:compile
    +-jakarta.json:jakarta.json-api:jar:2.0.1:compile
and
+-dev.langchain4j:langchain4j-elasticsearch:jar:0.34.0-SNAPSHOT
  +-co.elastic.clients:elasticsearch-java:jar:8.14.3:compile
    +-org.eclipse.parsson:parsson:jar:1.0.5:compile
      +-jakarta.json:jakarta.json-api:jar:2.0.2:compile
@dadoonet
Copy link
Contributor Author

Amazing! Thanks for your patience and for reviewing! :)

@dadoonet dadoonet deleted the use-knn branch August 20, 2024 09:59
dadoonet added a commit to dadoonet/langchain4j-spring that referenced this pull request Aug 29, 2024
This new implementation expects to have an Elasticsearch Rest Client bean (recommended) or it will create it otherwise.

This updates brings:

* The support for the new way to implement the Elasticsearch embbeding store as we now pass a rest client instead of "just" properties (deprecated in langchain4j/langchain4j#712)
* The removal of previously needed number of dimensions (deprecated in langchain4j/langchain4j#712)
* We add a `checkSslCertificates` property which should be only used in tests.

About tests:

* We don't disable security anymore as it's not a good practice. We should educate users instead.
* We don't wait anymore for 1s for documents to be available, but instead we call the refresh API to make the documents immediately visible. So we change the `awaitUntilPersisted()` method to `awaitUntilPersisted(ApplicationContext)` so we can fetch the Elasticsearch Rest Client bean.
* We allow testing against a running cluster (local/cloud) and set from the CLI the URL (`ELASTICSEARCH_URL`), api key (`ELASTICSEARCH_API_KEY`), username (`ELASTICSEARCH_USERNAME`) and password (`ELASTICSEARCH_PASSWORD`). Note that when running with a self-signed certificate, the BASE64 version of the CA should be provided (`ELASTICSEARCH_CA_CERTIFICATE`).
* We fetch the version of the cluster used to run the tests from the `pom.xml` file using `elastic.version` property.
* We upgrade testcontainers to 1.20.1
langchain4j pushed a commit to langchain4j/langchain4j-spring that referenced this pull request Sep 2, 2024
This new implementation expects to have an Elasticsearch Rest Client
bean (recommended) or it will create it otherwise.

This updates brings:

* The support for the new way to implement the Elasticsearch embbeding
store as we now pass a rest client instead of "just" properties
(deprecated in langchain4j/langchain4j#712)
* The removal of previously needed number of dimensions (deprecated in
langchain4j/langchain4j#712)
* We add a `checkSslCertificates` property which should be only used in
tests.
* We add a `caCertificateAsBase64String` property which can be used in
tests when using a self-signed certificate.

About tests:

* We don't disable security anymore as it's not a good practice. We
should educate users instead.
* We don't wait anymore for 1s for documents to be available, but
instead we call the refresh API to make the documents immediately
visible. So we change the `awaitUntilPersisted()` method to
`awaitUntilPersisted(ApplicationContext)` so we can fetch the
Elasticsearch Rest Client bean.
* We allow testing against a running cluster (local/cloud) and set from
the CLI the URL (`ELASTICSEARCH_URL`), api key
(`ELASTICSEARCH_API_KEY`), username (`ELASTICSEARCH_USERNAME`) and
password (`ELASTICSEARCH_PASSWORD`). Note that when running with a
self-signed certificate, the BASE64 version of the CA should be provided
(`ELASTICSEARCH_CA_CERTIFICATE`).
* We fetch the version of the cluster used to run the tests from the
`pom.xml` file using `elastic.version` property.
* We upgrade testcontainers to 1.20.1
@dadoonet
Copy link
Contributor Author

@langchain4j Could you add the Elasticsearch label on this one? ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Elasticsearch P2 High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants