[AL-2275] DeepLakeVector module #2313

adolkhan · 2023-04-26T01:02:53Z

🚀 🚀 Pull Request

Checklist:

My code follows the style guidelines of this project and the Contributing document
I have commented my code, particularly in hard-to-understand areas
I have kept the coverage-rate up
I have performed a self-review of my own code and resolved any problems
I have checked to ensure there aren't any other open Pull Requests for the same change
I have described and made corresponding changes to the relevant documentation
New and existing unit tests pass locally with my changes

Changes

Added support for running DeepLakeVectore store with native python and Indra search

davidbuniat

need better typing across classes and documentation
added few comments
generic question, is there any chance we can add quick index (FAISS or HNSW)

davidbuniat · 2023-04-26T04:59:40Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        self.dataset.summary()
+        return ids
+
+    def query(


I think this is less of a query rather than a search functionality

davidbuniat · 2023-04-26T05:00:18Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        texts: Iterable[str],
+        metadatas: Optional[List[dict]] = None,
+        ids: Optional[List[str]] = None,
+        embeddings: Optional[Any] = None,


IMO we need to make this as rigid as possible to add more tensors later, maybe kwargs that translate into custom tensors.

Neither in LangChain, LlamaIndex, AutoGPT user has control over the tensors they create. So I think adding this is not urgent and I will do that first thing in the next deeplake vector search module release

sounds good, keep in mind our key is that we support more than single modality so our API should reflect that, while staying compliant with existing interface.

davidbuniat · 2023-04-26T05:00:55Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+                }
+            )
+
+    def _create_elements(self, ids, texts, metadatas, embeddings):


this should also do kwargs for custom tensors, also add maybe types?

Will add types, but as I mentioned above I think this functionality can wait till next release

davidbuniat · 2023-04-26T05:01:54Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            batched,
+            self.dataset,
+            num_workers=min(self.num_workers, len(batched) // max(self.num_workers, 1)),
+            _embedding_function=self._embedding_function,


eval needs fault tolerance parameters that are missing here, in langchain I pass kwargs to let customers specify it but IMO we should do it ourselves.

davidbuniat · 2023-04-26T05:04:09Z

deeplake/core/vectorstore/utils.py

+    return dataset
+
+
+def create_deeplake_dataset(dataset_path, token, **kwargs):


I think we should not cement us into this fixed structure of tensors and let users be able to dynamically later add arbitrary datasets.

davidbuniat · 2023-04-26T05:04:45Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            view = self.dataset.filter(lambda x: x["ids"].data()["value"] in ids)
+            ids = list(view.sample_indices)
+
+        if filter:


How TQL is handling filters? this should not be here if TQL is running it.

davidbuniat · 2023-04-26T05:05:21Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            ids (List[str]): List of document IDs
+        """
+        elements = self._create_elements(ids, texts, metadatas, embeddings)
+        self._run_data_injestion(elements)


IMO run injection would also have remote or local later, though no need to worry about this now.

davidbuniat · 2023-04-26T05:05:57Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        embedding_function: Optional[callable] = None,
+        read_only: Optional[bool] = False,
+        ingestion_batch_size: int = 1024,
+        num_workers: int = 4,


default let's set to 0 because ingest transform fails if embedding is not an api but a e.g. HuggingFace model.

davidbuniat · 2023-04-26T05:07:17Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+
+        embeds = [None] * len(text_list)
+        if _embedding_function is not None:
+            embeddings = _embedding_function.embed_documents(text_list)


still debating if this function should be inside transform, sometimes it is very unreliable causing the rest of the transform fail. Maybe we should try catch transform to ensure smooth execution and fail if necessary.

should add try catch to transform or move ingestion into transform?

davidbuniat · 2023-04-26T05:15:29Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        elements = self._create_elements(ids, texts, metadatas, embeddings)
+        self._run_data_injestion(elements)
+        self.dataset.commit(allow_empty=True)
+        self.dataset.summary()


maybe doesn't worth for us to printing ourselves?

davidbuniat · 2023-04-26T05:17:21Z

deeplake/core/vectorstore/indra/tql_metrics_functions.py

+
+
+def cosine_similarity(embeddings, query_embedding, limit):
+    return f"select * order by COSINE_SIMILARITY({embeddings}, ARRAY[{query_embedding}]) DESC LIMIT {limit}"


are you sure COSINE_SIMILARITY is DESC?

davidbuniat · 2023-04-26T05:17:56Z

deeplake/core/vectorstore/utils.py

+        )
+        dataset.create_tensor(
+            "embedding",
+            htype="generic",


are we adding embedding htype?

farizrahman4u · 2023-04-26T07:06:30Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+                return []
+        return view
+
+    def _run_data_injestion(self, elements):


farizrahman4u · 2023-04-26T07:07:39Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+class DeepLakeVectorStore:
+    """Base class for DeepLakeVectorStore"""
+
+    _DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store"


This should be in constants.py

farizrahman4u · 2023-04-26T07:09:51Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        elements = self._create_elements(ids, texts, metadatas, embeddings)
+        self._run_data_injestion(elements)
+        self.dataset.commit(allow_empty=True)
+        self.dataset.summary()


No prints that cannot be disabled by user (for e.g using a verbose arg)

farizrahman4u · 2023-04-26T07:11:51Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+    def query(
+        self,
+        query: Optional[str] = None,
+        embedding: Optional[float] = None,


embedding type doesnt look right. We should define a custom type like Vector=Union[List[float], np.ndarray] and use it everywhere.

farizrahman4u · 2023-04-26T07:12:59Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            emb = embedding or self._embedding_function.embed_query(
+                query
+            )  # type: ignore
+            query_emb = np.array(emb, dtype=np.float32)


Don't do this if emb is already a numpy array (unnecessary copy + up/down casting)

will do explicit check on dtype because otherwise LangChain, LlamaIndex errors out if dtype is not np.float32

farizrahman4u · 2023-04-26T07:14:36Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        if delete_all:
+            self.dataset.delete(large_ok=True)


This should probably be a different method called drop

Also makes sense to have a clear method which deletes all samples without dropping the dataset.

I think delete is a good name for this method

davidbuniat · 2023-05-01T22:56:18Z

@adolkhan most importantly tests are missing.

you need to make sure it would smoothly work with integrations such as langchain, llama, and their deeplake related tests won't get broken
you need to ensure that three implementations of python/indra/serverless would univocally return same results. Fixtures would be helpful.
100% test coverage is a must have on this one.

# Conflicts: # deeplake/constants.py

# Conflicts: # deeplake/core/dataset/dataset.py

farizrahman4u · 2023-05-15T15:26:01Z

deeplake/client/config.py

@@ -10,7 +10,7 @@
 HUB_REST_ENDPOINT_LOCAL = "http://localhost:7777"
 USE_LOCAL_HOST = False
 USE_DEV_ENVIRONMENT = False
-USE_STAGING_ENVIRONMENT = False
+USE_STAGING_ENVIRONMENT = True


farizrahman4u · 2023-05-15T15:26:52Z

deeplake/constants.py

@@ -189,3 +189,7 @@
 # Transform cache sizes
 DEFAULT_TRANSFORM_SAMPLE_CACHE_SIZE = 16
 TRANSFORM_CHUNK_CACHE_SIZE = 64 * MB
+
+DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store"
+MAX_RETRY_ATTEMPTS = 5


This variable name should be more descriptive, max retry attempts for what?

farizrahman4u · 2023-05-15T15:29:00Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+        read_only: Optional[bool] = False,
+        ingestion_batch_size: int = 1024,
+        num_workers: int = 0,
+        exec_option: str = "python",


"python" doesnt sound like a good name for exec option.

I think python is fine, we discussed this part with @istranic

farizrahman4u · 2023-05-15T15:30:11Z

deeplake/core/vectorstore/vector_search/dataset/dataset.py

+            "metadata": processed_tensors["metadatas"][i],
+            "embedding": processed_tensors["embeddings"][i],
+        }
+        for i in range(0, len(processed_tensors["texts"]))


Suggested change

for i in range(0, len(processed_tensors["texts"]))

for i in range(len(processed_tensors["texts"]))

codecov · 2023-05-16T11:35:29Z

Codecov Report

Patch coverage: 91.65% and project coverage change: +0.16 🎉

Comparison is base (9d2d497) 84.53% compared to head (79f1c72) 84.70%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2313      +/-   ##
==========================================
+ Coverage   84.53%   84.70%   +0.16%     
==========================================
  Files         298      323      +25     
  Lines       35805    36531     +726     
==========================================
+ Hits        30268    30942     +674     
- Misses       5537     5589      +52

Flag	Coverage Δ
unittests	`84.70% <91.65%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
deeplake/htype.py	`100.00% <ø> (ø)`
deeplake/client/client.py	`85.16% <25.00%> (-0.48%)`	⬇️
deeplake/core/dataset/dataset.py	`91.48% <25.00%> (-0.24%)`	⬇️
...rstore/vector_search/indra/remote_engine_search.py	`40.00% <40.00%> (ø)`
...orstore/vector_search/indra/indra_vector_search.py	`46.15% <46.15%> (ø)`
.../vectorstore/vector_search/python/vector_search.py	`60.00% <60.00%> (ø)`
...lake/core/vectorstore/test_deeplake_vectorstore.py	`73.33% <73.33%> (ø)`
deeplake/core/vectorstore/deeplake_vectorstore.py	`77.96% <77.96%> (ø)`
...ke/core/vectorstore/vector_search/vector_search.py	`87.50% <87.50%> (ø)`
.../vectorstore/vector_search/dataset/test_dataset.py	`94.11% <94.11%> (ø)`
... and 24 more

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

FayazRahman · 2023-05-16T14:51:38Z

deeplake/client/client.py


        indicies = [int(i) for i in indicies.split(",")]
-        return indicies
+        return indicies, scores


should be indices not indicies

FayazRahman · 2023-05-16T15:07:26Z

deeplake/core/vectorstore/vector_search/utils.py

+        tensor_lengthes += (
+            f"length of {tensor_name} = {len(tensors[f'{tensor_name}'])}\n"
+        )
+    return tensor_lengthes


tensor_lengths not tensor_lengthes

FayazRahman · 2023-05-16T15:10:04Z

deeplake/core/vectorstore/vector_search/vector_search.py

+        distance_metric (str, optional): Type of distance metric to use for sorting the data. Avaliable options are: "L1", "L2", "COS", "MAX".
+        exec_option (str, optional): Type of query execution. It could be either "python", "compute_engine" or "db_engine". Defaults to "python".
+            - `python` - runs on the client and can be used for any data stored anywhere. WARNING: using this option with big datasets is discouraged, because it can lead to some memory issues.
+            - `compute_engine` - runs on the client and can be used for any data stored in or connected to Deep Lake.


single backtick (`) will make the text italic, we should use double backtick (``) for this kind of text.

davidbuniat · 2023-05-16T22:51:16Z

deeplake/constants.py

+DEFAULT_VECTORSTORE_DEEPLAKE_PATH = "./deeplake_vector_store"
+MAX_VECTORSTORE_INGESTION_RETRY_ATTEMPTS = 5
+MAX_CHECKPOINTING_INTERVAL = 100000
+MAX_DATSET_LENGTH_FOR_CACHING = 100000


Typo should be MAX_DATASET_LENGTH_FOR_CACHING and updated everywhere

davidbuniat · 2023-05-16T22:52:41Z

deeplake/core/dataset/dataset.py

+        self,
+        query_string: str,
+        runtime: Optional[Dict] = None,
+        return_indices_and_scores: bool = False,


Does this resolve it #2350?

davidbuniat · 2023-05-16T22:53:39Z

deeplake/core/dataset/dataset.py

+            return_indices_and_scores (bool): by default False. Whether to return indices and scores.
+
+        Raises:
+            ValueError: if return_indices_and_scores is True and runtime is not {"db_engine": true}


I believe this naming has been changed for db_engine, please talk to @istranic on final decision on naming strategies.

davidbuniat · 2023-05-16T22:54:23Z

deeplake/core/dataset/dataset.py

+        query_string: str,
+        runtime: Optional[Dict] = None,
+        return_indices_and_scores: bool = False,
+    ):


I would add examples inside docstrings, that they are copy-able and usable

davidbuniat · 2023-05-16T22:55:51Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            num_workers (int): The number of workers to use for ingesting data in parallel. Defaults to 0.
+            exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python".
+                - ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues.
+                - ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data.


are we sure that compute_engine is the local version of tensordb?

davidbuniat · 2023-05-16T22:56:07Z

deeplake/core/vectorstore/deeplake_vectorstore.py

+            exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python".
+                - ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues.
+                - ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data.
+                - ``tensor_db`` - Fully-hosted Managed Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. This is achieved by specifying runtime = {"tensor_db": True} during dataset creation.


are you sure it is tensor_db not tensordb?

davidbuniat · 2023-05-16T23:03:07Z

deeplake/core/vectorstore/vector_search/indra/test_indra.py

+
+
+def create_tql_string(metric_function, order="ASC"):
+    return f"select *, {METRIC_FUNC_TO_METRIC_STRING[metric_function]} as score ORDER BY {METRIC_FUNC_TO_METRIC_STRING[metric_function]} {order} LIMIT 10"


this is missing attribute based search with where condition, likely you can skip this for this release, but please add to your todo list.

please talk to @istranic earliest you can

davidbuniat · 2023-05-16T23:03:33Z

deeplake/core/vectorstore/test_deeplake_vectorstore.py

+
+
+@requires_libdeeplake
+@pytest.mark.parametrize("distance_metric", ["L1", "L2", "COS", "MAX", "DOT"])


if you are removing Dot then I would remove from testing it as well.

davidbuniat · 2023-05-16T23:04:04Z

deeplake/core/vectorstore/vector_search/indra/tql_distance_metrics/order.py

+    "l1": "ASC",
+    "l2": "ASC",
+    "cos": "DESC",
+    "max": "ASC",


davidbuniat · 2023-05-17T00:19:26Z

deeplake/core/vectorstore/vector_search/indra/query.py

+    Returns:
+        str: TQL representation of the query string.
+    """
+    return f"select *, {distance_metric} as score ORDER BY {distance_metric} {order} LIMIT {limit}"


is there any problem distance_metric twice? why don't you then ORDER BY score where variable is reused?

here is a complete example

query = f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"

There is no where condition for hybrid search

davidbuniat · 2023-05-17T03:54:17Z

deeplake/core/vectorstore/vector_search/dataset/dataset.py

+        dataset.create_tensor(
+            "embedding",
+            htype="embedding",
+            dtype=np.float32,


np.float32 with corresponding casting should be parameterized, I can easily see users specifying float16 or going even down.

Created DeepLakeVectore store module

65d3604

davidbuniat reviewed Apr 26, 2023

View reviewed changes

adolkhan marked this pull request as draft April 26, 2023 05:51

adolkhan changed the title ~~Created DeepLakeVectore store module~~ DeepLakeVectore store module Apr 26, 2023

adolkhan changed the title ~~DeepLakeVectore store module~~ DeepLakeVector module Apr 26, 2023

wIP

0ef5509

farizrahman4u suggested changes Apr 26, 2023

View reviewed changes

adolkhan added 3 commits April 26, 2023 19:41

WIP

42f1816

WIP

04ad00e

WIP

e83a75d

Adressed comments

a113c4c

tatevikh changed the title ~~DeepLakeVector module~~ [AL-2275] DeepLakeVector module May 2, 2023

adolkhan added 13 commits May 2, 2023 20:48

WIP

35f3334

Merge branch 'fr_db_engine' into DeepLakeVectorStore

4981350

# Conflicts: # deeplake/constants.py

WIP

1f7c34b

Merge branch 'main' into DeepLakeVectorStore

a18f545

# Conflicts: # deeplake/core/dataset/dataset.py

added test

9ebdaa1

Few more tests

462a4ae

Indra import issues fix

953f91e

remote_db fix

205806a

Added tests

83871a5

WIP

e2de444

Added docstrings

d772e2a

Changed tests for test_deeplake_vectorstore

b617040

WIP

9c4a949

adolkhan requested a review from farizrahman4u May 9, 2023 15:12

adolkhan marked this pull request as ready for review May 9, 2023 15:12

ids fix

c0d7295

farizrahman4u suggested changes May 15, 2023

View reviewed changes

adolkhan added 3 commits May 15, 2023 21:55

data injestion fix

6b34c14

Changed config

98c2d71

Changed according to comments

1c09f11

adolkhan requested a review from farizrahman4u May 16, 2023 08:37

adolkhan added 3 commits May 16, 2023 14:49

Tests fix

92a9636

Fixed testing issues

5902b69

Removed line

348b14b

adolkhan added 3 commits May 16, 2023 18:29

Windows tests fix

8e7939e

Chaned constant name

b24b13c

Linting Fix

1a8ad1a

FayazRahman suggested changes May 16, 2023

View reviewed changes

adolkhan added 2 commits May 16, 2023 21:32

Adressed comments

1afcb23

Changed exec_option

3e1b1da

adolkhan requested a review from FayazRahman May 16, 2023 15:56

FayazRahman approved these changes May 16, 2023

View reviewed changes

adolkhan added 3 commits May 16, 2023 23:30

VectorStore docs

4b99428

Docs

9daa2a1

docs

49953bd

davidbuniat reviewed May 16, 2023

View reviewed changes

Merge branch 'main' into DeepLakeVectorStore

596b863

farizrahman4u approved these changes May 17, 2023

View reviewed changes

Typo fix

79f1c72

adolkhan requested a review from davidbuniat May 17, 2023 10:49

davidbuniat approved these changes May 17, 2023

View reviewed changes

adolkhan merged commit 9fba417 into main May 17, 2023

adolkhan deleted the DeepLakeVectorStore branch May 17, 2023 15:40

		return dataset


		def create_deeplake_dataset(dataset_path, token, **kwargs):



		def cosine_similarity(embeddings, query_embedding, limit):
		return f"select * order by COSINE_SIMILARITY({embeddings}, ARRAY[{query_embedding}]) DESC LIMIT {limit}"

	for i in range(0, len(processed_tensors["texts"]))
	for i in range(len(processed_tensors["texts"]))



		def create_tql_string(metric_function, order="ASC"):
		return f"select *, {METRIC_FUNC_TO_METRIC_STRING[metric_function]} as score ORDER BY {METRIC_FUNC_TO_METRIC_STRING[metric_function]} {order} LIMIT 10"



		@requires_libdeeplake
		@pytest.mark.parametrize("distance_metric", ["L1", "L2", "COS", "MAX", "DOT"])

[AL-2275] DeepLakeVector module #2313

[AL-2275] DeepLakeVector module #2313

Conversation

adolkhan commented Apr 26, 2023

🚀 🚀 Pull Request

Checklist:

Changes

davidbuniat left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbuniat commented May 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbuniat left a comment •

edited

Loading

codecov bot commented May 16, 2023 •

edited

Loading