-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch: Add approximative kNN search in addition to Cosine similarity #712
Conversation
Hi @dadoonet, thanks a lot for these improvements! I will take a closer look at this PR a bit later, but please note that elastic search implementation was changed in #610 and is planned to be merged tomorrow. So there could be some conflicts. Also, could you please take a look at it, as it seems that you have a lot of experience with elasticsearch? :) Thanks! |
Note that the test part needs Java9+ but I can probably fix that in another way...
I updated my branch so it's up to date with the main branch. Also it's not complete yet as I need to revisit the |
Hi @dadoonet migration to java 17 is in plans, but for now please keep it java 8 compatible, thank you! |
This is failing and needs to be fixed. Also move the cloud tests to the abstract class as we should test everything wherever Elasticsearch is running.
I have a question about the design. The current state of the PR introduces a new class named So it's a "drop-in" replacement. From: ElasticsearchEmbeddingStore.builder()
.restClient(restClient)
.indexName(indexName)
.build(); To: ElasticsearchKnnEmbeddingStore.builder()
.restClient(restClient)
.indexName(indexName)
.build(); I'm wondering if @langchain4j would prefer an option within the builder. Which mean one "single class" ElasticsearchEmbeddingStore.builder()
.restClient(restClient)
.indexName(indexName)
.useKnn(true)
.build(); And then internally we choose the right implementation. My guts feeling goes for the former, but I prefer to ask. |
I spoke with @swallez and he proposed another alternative which is the best world of the 2: ElasticsearchEmbeddingStore.builder()
.restClient(restClient)
.indexName(indexName)
.config(XXX)
.build(); Where |
When #610 has been added, it assumed some scores in tests when doing filtering.
I think we should not try to rely on the score but more on the ranking of the results. |
This is making the tests running much faster!
@dadoonet from the user perspective, I guess a single Regarding scores: current API expects
I am not sure I understand how/why this works like this. Is it true that Does this problem arise only for the |
(disclosure: I'm a co-maintainer of the Elasticsearch Java client at Elastic)
The idea behind a config object is that ultimately we want to expose the full range of vector search features of Elasticsearch. So having a single
About scoring, there are multiple aspects:
|
So I moved the filters within the knn query (as pre filters) and I hope that they won't affect the score anymore. |
And I think that for the integration tests, we should either rely on the rank of each document but not on the score, or may be have a way for each IT to overload the expected value if needed as the behavior might be different depending on the embedding store implementation. |
Hi @swallez, thanks a lot for your comments!
Single entry point sounds good!
Disclosure: I am by no means an expert in this field, so with a high probability my reasoning can be wrong here.
So as of now, the
Most of the integrations (if not all) return normalized vectors, but API does not guarantee this. The only way to guarantee it is to check and normalize them (if not normalized) before returning them to the user, but:
I think this is a good idea,
|
Hi @dadoonet, please see my response to @swallez as well.
It would be great if we could keep
I have commented on this above:
Will "always substracting 1 from the score" work and return score in 0..1 range? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dadoonet thanks a lot!
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStore.java
Show resolved
Hide resolved
.../java/dev/langchain4j/store/embedding/elasticsearch/ElasticsearchEmbeddingStoreRemoveIT.java
Show resolved
Hide resolved
This looks good! ElasticsearchEmbeddingStore store = ElasticsearchEmbeddingStore.builder()
.configuration(ElasticsearchConfigurationKnn.builder().build())
.restClient(restClient)
.build(); |
* Put `dimension` back but as `@Deprecated` * Throw `ElasticsearchRequestFailedException` in case of any Elasticsearch exception and remove the call to `log.error()` in that case * Remove the `deleteIndex()` which was one single line and called only once
While working on the documentation, I realized that I will need to change in a follow up PR the code in - ElasticsearchEmbeddingStoreExample |
Instead of using username/password/apikey, we should have the user providing a rest client. It's much more flexible as we might have to deal with: * provided SSL certificates * self-signed certificates * apiKey * username and password although not recommended)
@langchain4j I added also some documentation to make the Elasticsearch page a bit less "empty" ;) |
@langchain4j Could you approve the workflow so we can test the code on the CI? |
@dadoonet this CI workflow does not run integration tests (yet), so please test it locally. |
I guess we are all good ;)
|
@dadoonet awesome, thanks! I will check the PR once again asap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dadoonet good job, thank you!
…ilarity (#712) Fixed the following version conflict: Rule 0: org.apache.maven.enforcer.rules.dependency.DependencyConvergence failed with message: Failed while enforcing releasability. Dependency convergence error for jakarta.json:jakarta.json-api:jar:2.0.1 paths to dependency are: +-dev.langchain4j:langchain4j-elasticsearch:jar:0.34.0-SNAPSHOT +-co.elastic.clients:elasticsearch-java:jar:8.14.3:compile +-jakarta.json:jakarta.json-api:jar:2.0.1:compile and +-dev.langchain4j:langchain4j-elasticsearch:jar:0.34.0-SNAPSHOT +-co.elastic.clients:elasticsearch-java:jar:8.14.3:compile +-org.eclipse.parsson:parsson:jar:1.0.5:compile +-jakarta.json:jakarta.json-api:jar:2.0.2:compile
Amazing! Thanks for your patience and for reviewing! :) |
This new implementation expects to have an Elasticsearch Rest Client bean (recommended) or it will create it otherwise. This updates brings: * The support for the new way to implement the Elasticsearch embbeding store as we now pass a rest client instead of "just" properties (deprecated in langchain4j/langchain4j#712) * The removal of previously needed number of dimensions (deprecated in langchain4j/langchain4j#712) * We add a `checkSslCertificates` property which should be only used in tests. About tests: * We don't disable security anymore as it's not a good practice. We should educate users instead. * We don't wait anymore for 1s for documents to be available, but instead we call the refresh API to make the documents immediately visible. So we change the `awaitUntilPersisted()` method to `awaitUntilPersisted(ApplicationContext)` so we can fetch the Elasticsearch Rest Client bean. * We allow testing against a running cluster (local/cloud) and set from the CLI the URL (`ELASTICSEARCH_URL`), api key (`ELASTICSEARCH_API_KEY`), username (`ELASTICSEARCH_USERNAME`) and password (`ELASTICSEARCH_PASSWORD`). Note that when running with a self-signed certificate, the BASE64 version of the CA should be provided (`ELASTICSEARCH_CA_CERTIFICATE`). * We fetch the version of the cluster used to run the tests from the `pom.xml` file using `elastic.version` property. * We upgrade testcontainers to 1.20.1
This new implementation expects to have an Elasticsearch Rest Client bean (recommended) or it will create it otherwise. This updates brings: * The support for the new way to implement the Elasticsearch embbeding store as we now pass a rest client instead of "just" properties (deprecated in langchain4j/langchain4j#712) * The removal of previously needed number of dimensions (deprecated in langchain4j/langchain4j#712) * We add a `checkSslCertificates` property which should be only used in tests. * We add a `caCertificateAsBase64String` property which can be used in tests when using a self-signed certificate. About tests: * We don't disable security anymore as it's not a good practice. We should educate users instead. * We don't wait anymore for 1s for documents to be available, but instead we call the refresh API to make the documents immediately visible. So we change the `awaitUntilPersisted()` method to `awaitUntilPersisted(ApplicationContext)` so we can fetch the Elasticsearch Rest Client bean. * We allow testing against a running cluster (local/cloud) and set from the CLI the URL (`ELASTICSEARCH_URL`), api key (`ELASTICSEARCH_API_KEY`), username (`ELASTICSEARCH_USERNAME`) and password (`ELASTICSEARCH_PASSWORD`). Note that when running with a self-signed certificate, the BASE64 version of the CA should be provided (`ELASTICSEARCH_CA_CERTIFICATE`). * We fetch the version of the cluster used to run the tests from the `pom.xml` file using `elastic.version` property. * We upgrade testcontainers to 1.20.1
@langchain4j Could you add the |
The current implementation uses a slow function score query which is known to be slow because you have to iterate over all the hits to compute the score.
Elasticsearch in the recent versions, now also allow using knn search.
This PR now allows to change the configuration:
In more details, this is what this PR is bringing:
configuration
new option in the builder. This helps to choose the implementation to use. It could beElasticsearchConfigurationScript
(the behavior before this change) orElasticsearchConfigurationKnn
for the new Knn implementationElasticsearchConfigurationKnn
) is now the default implementationdimension
parameter as Elasticsearch can guess automatically this value based on the first document indexed. And anyway, it's better to let the user create an index template or the index mapping he wants.elastic.version
pom.xml
property to launch the right Elasticsearch TestContainer.removeAll()
implementation from aDeleteByQuery(match all query)
to a Delete index API call which is always the recommended way to remove all documents. Note that this does not recreate an index.ElasticsearchEmbeddingStoreRemoveIT
now extends the common removal test classEmbeddingStoreWithRemovalIT
. But we need to implement in the future aEmbeddingStoreWithRemovalIT.wait_for_ready()
method to allow even more common tests.