Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AL-2275] DeepLakeVector module #2313

Merged
merged 76 commits into from
May 17, 2023
Merged

[AL-2275] DeepLakeVector module #2313

merged 76 commits into from
May 17, 2023

Conversation

adolkhan
Copy link
Contributor

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

Added support for running DeepLakeVectore store with native python and Indra search

Copy link
Member

@davidbuniat davidbuniat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • need better typing across classes and documentation
  • added few comments
  • generic question, is there any chance we can add quick index (FAISS or HNSW)

self.dataset.summary()
return ids

def query(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is less of a query rather than a search functionality

texts: Iterable[str],
metadatas: Optional[List[dict]] = None,
ids: Optional[List[str]] = None,
embeddings: Optional[Any] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we need to make this as rigid as possible to add more tensors later, maybe kwargs that translate into custom tensors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither in LangChain, LlamaIndex, AutoGPT user has control over the tensors they create. So I think adding this is not urgent and I will do that first thing in the next deeplake vector search module release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, keep in mind our key is that we support more than single modality so our API should reflect that, while staying compliant with existing interface.

}
)

def _create_elements(self, ids, texts, metadatas, embeddings):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also do kwargs for custom tensors, also add maybe types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add types, but as I mentioned above I think this functionality can wait till next release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk

batched,
self.dataset,
num_workers=min(self.num_workers, len(batched) // max(self.num_workers, 1)),
_embedding_function=self._embedding_function,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval needs fault tolerance parameters that are missing here, in langchain I pass kwargs to let customers specify it but IMO we should do it ourselves.

return dataset


def create_deeplake_dataset(dataset_path, token, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not cement us into this fixed structure of tensors and let users be able to dynamically later add arbitrary datasets.

view = self.dataset.filter(lambda x: x["ids"].data()["value"] in ids)
ids = list(view.sample_indices)

if filter:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How TQL is handling filters? this should not be here if TQL is running it.

ids (List[str]): List of document IDs
"""
elements = self._create_elements(ids, texts, metadatas, embeddings)
self._run_data_injestion(elements)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO run injection would also have remote or local later, though no need to worry about this now.

embedding_function: Optional[callable] = None,
read_only: Optional[bool] = False,
ingestion_batch_size: int = 1024,
num_workers: int = 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default let's set to 0 because ingest transform fails if embedding is not an api but a e.g. HuggingFace model.


embeds = [None] * len(text_list)
if _embedding_function is not None:
embeddings = _embedding_function.embed_documents(text_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still debating if this function should be inside transform, sometimes it is very unreliable causing the rest of the transform fail. Maybe we should try catch transform to ensure smooth execution and fail if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add try catch to transform or move ingestion into transform?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so

elements = self._create_elements(ids, texts, metadatas, embeddings)
self._run_data_injestion(elements)
self.dataset.commit(allow_empty=True)
self.dataset.summary()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe doesn't worth for us to printing ourselves?



def cosine_similarity(embeddings, query_embedding, limit):
return f"select * order by COSINE_SIMILARITY({embeddings}, ARRAY[{query_embedding}]) DESC LIMIT {limit}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure COSINE_SIMILARITY is DESC?

)
dataset.create_tensor(
"embedding",
htype="generic",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we adding embedding htype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@adolkhan adolkhan marked this pull request as draft April 26, 2023 05:51
@adolkhan adolkhan changed the title Created DeepLakeVectore store module DeepLakeVectore store module Apr 26, 2023
@adolkhan adolkhan changed the title DeepLakeVectore store module DeepLakeVector module Apr 26, 2023
return []
return view

def _run_data_injestion(self, elements):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingestion*

class DeepLakeVectorStore:
"""Base class for DeepLakeVectorStore"""

_DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in constants.py

elements = self._create_elements(ids, texts, metadatas, embeddings)
self._run_data_injestion(elements)
self.dataset.commit(allow_empty=True)
self.dataset.summary()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prints that cannot be disabled by user (for e.g using a verbose arg)

def query(
self,
query: Optional[str] = None,
embedding: Optional[float] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding type doesnt look right. We should define a custom type like Vector=Union[List[float], np.ndarray] and use it everywhere.

emb = embedding or self._embedding_function.embed_query(
query
) # type: ignore
query_emb = np.array(emb, dtype=np.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this if emb is already a numpy array (unnecessary copy + up/down casting)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do explicit check on dtype because otherwise LangChain, LlamaIndex errors out if dtype is not np.float32

Comment on lines 109 to 110
if delete_all:
self.dataset.delete(large_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a different method called drop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also makes sense to have a clear method which deletes all samples without dropping the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delete is a good name for this method

@davidbuniat
Copy link
Member

@adolkhan most importantly tests are missing.

  1. you need to make sure it would smoothly work with integrations such as langchain, llama, and their deeplake related tests won't get broken
  2. you need to ensure that three implementations of python/indra/serverless would univocally return same results. Fixtures would be helpful.
  3. 100% test coverage is a must have on this one.

@tatevikh tatevikh changed the title DeepLakeVector module [AL-2275] DeepLakeVector module May 2, 2023
@adolkhan adolkhan requested a review from farizrahman4u May 9, 2023 15:12
@adolkhan adolkhan marked this pull request as ready for review May 9, 2023 15:12
@@ -10,7 +10,7 @@
HUB_REST_ENDPOINT_LOCAL = "http://localhost:7777"
USE_LOCAL_HOST = False
USE_DEV_ENVIRONMENT = False
USE_STAGING_ENVIRONMENT = False
USE_STAGING_ENVIRONMENT = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

@@ -189,3 +189,7 @@
# Transform cache sizes
DEFAULT_TRANSFORM_SAMPLE_CACHE_SIZE = 16
TRANSFORM_CHUNK_CACHE_SIZE = 64 * MB

DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store"
MAX_RETRY_ATTEMPTS = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable name should be more descriptive, max retry attempts for what?

read_only: Optional[bool] = False,
ingestion_batch_size: int = 1024,
num_workers: int = 0,
exec_option: str = "python",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"python" doesnt sound like a good name for exec option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think python is fine, we discussed this part with @istranic

"metadata": processed_tensors["metadatas"][i],
"embedding": processed_tensors["embeddings"][i],
}
for i in range(0, len(processed_tensors["texts"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i in range(0, len(processed_tensors["texts"]))
for i in range(len(processed_tensors["texts"]))

@adolkhan adolkhan requested a review from farizrahman4u May 16, 2023 08:37
@codecov
Copy link

codecov bot commented May 16, 2023

Codecov Report

Patch coverage: 91.65% and project coverage change: +0.16 🎉

Comparison is base (9d2d497) 84.53% compared to head (79f1c72) 84.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2313      +/-   ##
==========================================
+ Coverage   84.53%   84.70%   +0.16%     
==========================================
  Files         298      323      +25     
  Lines       35805    36531     +726     
==========================================
+ Hits        30268    30942     +674     
- Misses       5537     5589      +52     
Flag Coverage Δ
unittests 84.70% <91.65%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
deeplake/htype.py 100.00% <ø> (ø)
deeplake/client/client.py 85.16% <25.00%> (-0.48%) ⬇️
deeplake/core/dataset/dataset.py 91.48% <25.00%> (-0.24%) ⬇️
...rstore/vector_search/indra/remote_engine_search.py 40.00% <40.00%> (ø)
...orstore/vector_search/indra/indra_vector_search.py 46.15% <46.15%> (ø)
.../vectorstore/vector_search/python/vector_search.py 60.00% <60.00%> (ø)
...lake/core/vectorstore/test_deeplake_vectorstore.py 73.33% <73.33%> (ø)
deeplake/core/vectorstore/deeplake_vectorstore.py 77.96% <77.96%> (ø)
...ke/core/vectorstore/vector_search/vector_search.py 87.50% <87.50%> (ø)
.../vectorstore/vector_search/dataset/test_dataset.py 94.11% <94.11%> (ø)
... and 24 more

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.


indicies = [int(i) for i in indicies.split(",")]
return indicies
return indicies, scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be indices not indicies

tensor_lengthes += (
f"length of {tensor_name} = {len(tensors[f'{tensor_name}'])}\n"
)
return tensor_lengthes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor_lengths not tensor_lengthes

distance_metric (str, optional): Type of distance metric to use for sorting the data. Avaliable options are: "L1", "L2", "COS", "MAX".
exec_option (str, optional): Type of query execution. It could be either "python", "compute_engine" or "db_engine". Defaults to "python".
- `python` - runs on the client and can be used for any data stored anywhere. WARNING: using this option with big datasets is discouraged, because it can lead to some memory issues.
- `compute_engine` - runs on the client and can be used for any data stored in or connected to Deep Lake.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single backtick (`) will make the text italic, we should use double backtick (``) for this kind of text.

@adolkhan adolkhan requested a review from FayazRahman May 16, 2023 15:56
DEFAULT_VECTORSTORE_DEEPLAKE_PATH = "./deeplake_vector_store"
MAX_VECTORSTORE_INGESTION_RETRY_ATTEMPTS = 5
MAX_CHECKPOINTING_INTERVAL = 100000
MAX_DATSET_LENGTH_FOR_CACHING = 100000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo should be MAX_DATASET_LENGTH_FOR_CACHING and updated everywhere

@adolkhan adolkhan requested a review from davidbuniat May 17, 2023 10:49
self,
query_string: str,
runtime: Optional[Dict] = None,
return_indices_and_scores: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this resolve it #2350?

return_indices_and_scores (bool): by default False. Whether to return indices and scores.

Raises:
ValueError: if return_indices_and_scores is True and runtime is not {"db_engine": true}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this naming has been changed for db_engine, please talk to @istranic on final decision on naming strategies.

query_string: str,
runtime: Optional[Dict] = None,
return_indices_and_scores: bool = False,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add examples inside docstrings, that they are copy-able and usable

num_workers (int): The number of workers to use for ingesting data in parallel. Defaults to 0.
exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python".
- ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues.
- ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that compute_engine is the local version of tensordb?

exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python".
- ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues.
- ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data.
- ``tensor_db`` - Fully-hosted Managed Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. This is achieved by specifying runtime = {"tensor_db": True} during dataset creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure it is tensor_db not tensordb?



def create_tql_string(metric_function, order="ASC"):
return f"select *, {METRIC_FUNC_TO_METRIC_STRING[metric_function]} as score ORDER BY {METRIC_FUNC_TO_METRIC_STRING[metric_function]} {order} LIMIT 10"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is missing attribute based search with where condition, likely you can skip this for this release, but please add to your todo list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please talk to @istranic earliest you can



@requires_libdeeplake
@pytest.mark.parametrize("distance_metric", ["L1", "L2", "COS", "MAX", "DOT"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are removing Dot then I would remove from testing it as well.

"l1": "ASC",
"l2": "ASC",
"cos": "DESC",
"max": "ASC",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk

Returns:
str: TQL representation of the query string.
"""
return f"select *, {distance_metric} as score ORDER BY {distance_metric} {order} LIMIT {limit}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any problem distance_metric twice? why don't you then ORDER BY score where variable is reused?

here is a complete example

query = f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no where condition for hybrid search

dataset.create_tensor(
"embedding",
htype="embedding",
dtype=np.float32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.float32 with corresponding casting should be parameterized, I can easily see users specifying float16 or going even down.

@adolkhan adolkhan merged commit 9fba417 into main May 17, 2023
@adolkhan adolkhan deleted the DeepLakeVectorStore branch May 17, 2023 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants