Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte storage integration into segment #4049

Merged
merged 16 commits into from
Apr 17, 2024
Merged

Conversation

IvanPleshkov
Copy link
Contributor

@IvanPleshkov IvanPleshkov commented Apr 16, 2024

This PR adds byte vector storage support into segment.

Main changes in this PR to do integration:

  1. Add datatype field into segment config (into VectorDataConfig). It allows storing the type of storage and loading segment properly
  2. Add new fields into VectorStorageEnum: DenseSimpleByte, DenseMemmapByte and DenseAppendableMemmapByte.
  3. Add in segment constructor support of changed config and of new vector storages types.
  4. Remove Distance::preprocess_vector and Distance::similarity. Distance does not describe which type of vector is processing so it cannot trigger functions from the generic Metric trait.
  5. Removing Distance::preprocess_vector triggers large refactor in quantization scorers.
  6. Changes in raw scorer constructors, adding byte type support.
  7. Unit test which checks plain search, HNSW with filters, discovery search, recommendations and compare search over byte-storaged segment with original one

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

  • Have you added an explanation of what your changes do and why you'd like us to include them?
  • Have you written new tests for your core changes, as applicable?
  • Have you successfully ran tests with your changes locally?

@IvanPleshkov IvanPleshkov force-pushed the byte-storage-integration branch from 55e2deb to da88a49 Compare April 17, 2024 10:12
@IvanPleshkov IvanPleshkov force-pushed the byte-storage-integration branch from c8a8468 to 686cf90 Compare April 17, 2024 18:26
@IvanPleshkov IvanPleshkov marked this pull request as ready for review April 17, 2024 18:45
@IvanPleshkov IvanPleshkov requested a review from generall April 17, 2024 18:45
Self {
query: TMetric::preprocess(query),
query: TElement::slice_from_float_cow(&preprocessed_vector).to_vec(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to know, that Cow won't actually invoke copy for owned vectors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I did like that because slice_from_float_cow is only one function to convert from floats into bytes. I find Cow helpful in storages upserting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 0edd630
Now we don't do unnecessary copying. In this place for float and non-cosine distance there is no any reallocation anymore

@IvanPleshkov IvanPleshkov merged commit 224e4f6 into dev Apr 17, 2024
17 checks passed
@IvanPleshkov IvanPleshkov deleted the byte-storage-integration branch April 17, 2024 22:42
timvisee pushed a commit that referenced this pull request Apr 22, 2024
* byte storage with quantization

raw scorer integration

config and test

are you happy fmt

fn renamings

cow refactor

use quantization branch

quantization update

* are you happy clippy

* don't use distance in quantized scorers

* fix build

* add fn quantization_preprocess

* apply preprocessing for only cosine float metric

* fix sparse vectors tests

* update openapi

* more complicated integration test

* update openapi comment

* mmap byte storages support

* fix async test

* move .unwrap closer to the actual check of the vector presence

* fmt

* remove distance similarity function

* avoid copying data while working with cow

---------

Co-authored-by: generall <andrey@vasnetsov.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants