-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AL-2275] DeepLakeVector module #2313
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- need better typing across classes and documentation
- added few comments
- generic question, is there any chance we can add quick index (FAISS or HNSW)
self.dataset.summary() | ||
return ids | ||
|
||
def query( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is less of a query rather than a search functionality
texts: Iterable[str], | ||
metadatas: Optional[List[dict]] = None, | ||
ids: Optional[List[str]] = None, | ||
embeddings: Optional[Any] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we need to make this as rigid as possible to add more tensors later, maybe kwargs that translate into custom tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither in LangChain, LlamaIndex, AutoGPT user has control over the tensors they create. So I think adding this is not urgent and I will do that first thing in the next deeplake vector search module release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, keep in mind our key is that we support more than single modality so our API should reflect that, while staying compliant with existing interface.
} | ||
) | ||
|
||
def _create_elements(self, ids, texts, metadatas, embeddings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also do kwargs for custom tensors, also add maybe types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add types, but as I mentioned above I think this functionality can wait till next release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk
batched, | ||
self.dataset, | ||
num_workers=min(self.num_workers, len(batched) // max(self.num_workers, 1)), | ||
_embedding_function=self._embedding_function, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval needs fault tolerance parameters that are missing here, in langchain I pass kwargs to let customers specify it but IMO we should do it ourselves.
deeplake/core/vectorstore/utils.py
Outdated
return dataset | ||
|
||
|
||
def create_deeplake_dataset(dataset_path, token, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not cement us into this fixed structure of tensors and let users be able to dynamically later add arbitrary datasets.
view = self.dataset.filter(lambda x: x["ids"].data()["value"] in ids) | ||
ids = list(view.sample_indices) | ||
|
||
if filter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How TQL is handling filters? this should not be here if TQL is running it.
ids (List[str]): List of document IDs | ||
""" | ||
elements = self._create_elements(ids, texts, metadatas, embeddings) | ||
self._run_data_injestion(elements) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO run injection would also have remote or local later, though no need to worry about this now.
embedding_function: Optional[callable] = None, | ||
read_only: Optional[bool] = False, | ||
ingestion_batch_size: int = 1024, | ||
num_workers: int = 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default let's set to 0 because ingest transform fails if embedding is not an api but a e.g. HuggingFace model.
|
||
embeds = [None] * len(text_list) | ||
if _embedding_function is not None: | ||
embeddings = _embedding_function.embed_documents(text_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still debating if this function should be inside transform, sometimes it is very unreliable causing the rest of the transform fail. Maybe we should try catch transform to ensure smooth execution and fail if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should add try catch to transform or move ingestion into transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so
elements = self._create_elements(ids, texts, metadatas, embeddings) | ||
self._run_data_injestion(elements) | ||
self.dataset.commit(allow_empty=True) | ||
self.dataset.summary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe doesn't worth for us to printing ourselves?
|
||
|
||
def cosine_similarity(embeddings, query_embedding, limit): | ||
return f"select * order by COSINE_SIMILARITY({embeddings}, ARRAY[{query_embedding}]) DESC LIMIT {limit}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure COSINE_SIMILARITY is DESC?
deeplake/core/vectorstore/utils.py
Outdated
) | ||
dataset.create_tensor( | ||
"embedding", | ||
htype="generic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we adding embedding htype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup
return [] | ||
return view | ||
|
||
def _run_data_injestion(self, elements): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingestion*
class DeepLakeVectorStore: | ||
"""Base class for DeepLakeVectorStore""" | ||
|
||
_DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in constants.py
elements = self._create_elements(ids, texts, metadatas, embeddings) | ||
self._run_data_injestion(elements) | ||
self.dataset.commit(allow_empty=True) | ||
self.dataset.summary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No prints that cannot be disabled by user (for e.g using a verbose arg)
def query( | ||
self, | ||
query: Optional[str] = None, | ||
embedding: Optional[float] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedding type doesnt look right. We should define a custom type like Vector=Union[List[float], np.ndarray]
and use it everywhere.
emb = embedding or self._embedding_function.embed_query( | ||
query | ||
) # type: ignore | ||
query_emb = np.array(emb, dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this if emb is already a numpy array (unnecessary copy + up/down casting)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do explicit check on dtype because otherwise LangChain, LlamaIndex errors out if dtype is not np.float32
if delete_all: | ||
self.dataset.delete(large_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a different method called drop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also makes sense to have a clear
method which deletes all samples without dropping the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think delete is a good name for this method
@adolkhan most importantly tests are missing.
|
# Conflicts: # deeplake/constants.py
# Conflicts: # deeplake/core/dataset/dataset.py
deeplake/client/config.py
Outdated
@@ -10,7 +10,7 @@ | |||
HUB_REST_ENDPOINT_LOCAL = "http://localhost:7777" | |||
USE_LOCAL_HOST = False | |||
USE_DEV_ENVIRONMENT = False | |||
USE_STAGING_ENVIRONMENT = False | |||
USE_STAGING_ENVIRONMENT = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
deeplake/constants.py
Outdated
@@ -189,3 +189,7 @@ | |||
# Transform cache sizes | |||
DEFAULT_TRANSFORM_SAMPLE_CACHE_SIZE = 16 | |||
TRANSFORM_CHUNK_CACHE_SIZE = 64 * MB | |||
|
|||
DEFAULT_DEEPLAKE_PATH = "./deeplake_vector_store" | |||
MAX_RETRY_ATTEMPTS = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name should be more descriptive, max retry attempts for what?
read_only: Optional[bool] = False, | ||
ingestion_batch_size: int = 1024, | ||
num_workers: int = 0, | ||
exec_option: str = "python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"python" doesnt sound like a good name for exec option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think python
is fine, we discussed this part with @istranic
"metadata": processed_tensors["metadatas"][i], | ||
"embedding": processed_tensors["embeddings"][i], | ||
} | ||
for i in range(0, len(processed_tensors["texts"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in range(0, len(processed_tensors["texts"])) | |
for i in range(len(processed_tensors["texts"])) |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #2313 +/- ##
==========================================
+ Coverage 84.53% 84.70% +0.16%
==========================================
Files 298 323 +25
Lines 35805 36531 +726
==========================================
+ Hits 30268 30942 +674
- Misses 5537 5589 +52
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
deeplake/client/client.py
Outdated
|
||
indicies = [int(i) for i in indicies.split(",")] | ||
return indicies | ||
return indicies, scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be indices
not indicies
tensor_lengthes += ( | ||
f"length of {tensor_name} = {len(tensors[f'{tensor_name}'])}\n" | ||
) | ||
return tensor_lengthes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor_lengths
not tensor_lengthes
distance_metric (str, optional): Type of distance metric to use for sorting the data. Avaliable options are: "L1", "L2", "COS", "MAX". | ||
exec_option (str, optional): Type of query execution. It could be either "python", "compute_engine" or "db_engine". Defaults to "python". | ||
- `python` - runs on the client and can be used for any data stored anywhere. WARNING: using this option with big datasets is discouraged, because it can lead to some memory issues. | ||
- `compute_engine` - runs on the client and can be used for any data stored in or connected to Deep Lake. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single backtick (`) will make the text italic, we should use double backtick (``) for this kind of text.
deeplake/constants.py
Outdated
DEFAULT_VECTORSTORE_DEEPLAKE_PATH = "./deeplake_vector_store" | ||
MAX_VECTORSTORE_INGESTION_RETRY_ATTEMPTS = 5 | ||
MAX_CHECKPOINTING_INTERVAL = 100000 | ||
MAX_DATSET_LENGTH_FOR_CACHING = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo should be MAX_DATASET_LENGTH_FOR_CACHING
and updated everywhere
self, | ||
query_string: str, | ||
runtime: Optional[Dict] = None, | ||
return_indices_and_scores: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this resolve it #2350?
return_indices_and_scores (bool): by default False. Whether to return indices and scores. | ||
|
||
Raises: | ||
ValueError: if return_indices_and_scores is True and runtime is not {"db_engine": true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this naming has been changed for db_engine, please talk to @istranic on final decision on naming strategies.
query_string: str, | ||
runtime: Optional[Dict] = None, | ||
return_indices_and_scores: bool = False, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add examples inside docstrings, that they are copy-able and usable
num_workers (int): The number of workers to use for ingesting data in parallel. Defaults to 0. | ||
exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python". | ||
- ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. | ||
- ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that compute_engine
is the local version of tensordb
?
exec_option (str): Type of query execution. It could be either "python", "compute_engine" or "tensor_db". Defaults to "python". | ||
- ``python`` - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. | ||
- ``compute_engine`` - C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local data. | ||
- ``tensor_db`` - Fully-hosted Managed Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. This is achieved by specifying runtime = {"tensor_db": True} during dataset creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure it is tensor_db
not tensordb
?
|
||
|
||
def create_tql_string(metric_function, order="ASC"): | ||
return f"select *, {METRIC_FUNC_TO_METRIC_STRING[metric_function]} as score ORDER BY {METRIC_FUNC_TO_METRIC_STRING[metric_function]} {order} LIMIT 10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is missing attribute based search with where
condition, likely you can skip this for this release, but please add to your todo list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please talk to @istranic earliest you can
|
||
|
||
@requires_libdeeplake | ||
@pytest.mark.parametrize("distance_metric", ["L1", "L2", "COS", "MAX", "DOT"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are removing Dot
then I would remove from testing it as well.
"l1": "ASC", | ||
"l2": "ASC", | ||
"cos": "DESC", | ||
"max": "ASC", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk
Returns: | ||
str: TQL representation of the query string. | ||
""" | ||
return f"select *, {distance_metric} as score ORDER BY {distance_metric} {order} LIMIT {limit}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any problem distance_metric
twice? why don't you then ORDER BY score
where variable is reused?
here is a complete example
query = f"select * from (select text, l2_norm(embedding - ARRAY[{embedding_search}]) as score from \"{dataset_path}\") order by score desc limit 5"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no where
condition for hybrid search
dataset.create_tensor( | ||
"embedding", | ||
htype="embedding", | ||
dtype=np.float32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.float32 with corresponding casting should be parameterized, I can easily see users specifying float16 or going even down.
🚀 🚀 Pull Request
Checklist:
coverage-rate
upChanges
Added support for running DeepLakeVectore store with native python and Indra search