Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility for mmap sparse vectors #5454

Merged
merged 22 commits into from
Nov 27, 2024
Merged

Conversation

coszio
Copy link
Contributor

@coszio coszio commented Nov 15, 2024

Adds the infrastructure to integrate blob_store as a new mmap storage for sparse vectors.

This PR does not yet create segments with this storage, but should be able to read it.

There are some additional changes to the simple storage, specifically:

  • better calculation for total_sparse_vector_size:
    • when deleting, the old vector gets subtracted before adding the new vector size
  • deleted vectors are considered None when reading the storage

Edits for blob_store:

  • Setting a trailing point id to a None value (deleting a point above the range of the storage) now updates the next_pointer_offset
  • new for_each_unfiltered function which iterates through the ALL the storage with a callback, including deleted points
  • new open_or_create function which does a better check to choose between opening or creating a storage

TODO

  • test implementation
  • hidden setting to enable mmap sparse vector storage
  • read segment config for mmap
    • fix reading old config
  • compatibility test

});
}

#[test]
Copy link
Contributor Author

@coszio coszio Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also tests the simple storage for better coverage 🤓

pub enum SparseVectorStorageType {
/// Storage on disk
// (rocksdb storage)
#[default]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing this default in the future will help us make the switch

Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great already 👍

lib/blob_store/src/blob_store.rs Outdated Show resolved Hide resolved
lib/blob_store/src/blob_store.rs Outdated Show resolved Hide resolved
@coszio coszio force-pushed the blob_store-for-sparse-vectors branch from e5671d7 to d4d9f15 Compare November 22, 2024 16:59
@@ -79,7 +79,7 @@ impl SegmentBuilder {
};

let payload_storage =
create_payload_storage(database.clone(), segment_config, segment_path)?;
create_payload_storage(database.clone(), segment_config, temp_dir.path())?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also fixes the mmap payload storage to be stored in the right place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@generall generall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToDo: avoid iterating over vector storage on load in another PR

@coszio coszio force-pushed the blob_store-for-sparse-vectors branch from d4d9f15 to 4835944 Compare November 27, 2024 16:24
@coszio coszio merged commit 2de122a into dev Nov 27, 2024
17 checks passed
@coszio coszio deleted the blob_store-for-sparse-vectors branch November 27, 2024 17:45
coszio added a commit that referenced this pull request Dec 2, 2024
* implement mmap sparse vector storage

* add to VectorStorageEnum

* clippy

* add tests, fix both simple and mmap storages

* smol correction on total_vector_count

* add sparse storage type to config

* fix reading config without storage type

* generate openapi

* use blob_store by path

* hidden setting to enable new storage

* validate existing path in `BlobStore::open()`

* use new dir for each sparse vector name

* fix and rename `max_point_offset`

Plus some extra refactors

* add storage compat test, to always check both storages work

* fix opening of storage + other misc fixes

* FIX!!!

`Unset` operations in the Tracker weren't updating the
`next_pointer_id`. So, when reopening the storage, those points wouldn't
get marked as deleted in the bitslice, thus creating the illusion that
they should exist, when they did not.

* refactor naming from `iter_*` to `for_each_*`

* fix checking for BlobStore existance

* fix typo

* fix error message

* better docs for open_or_create

* fix after rebase
timvisee pushed a commit that referenced this pull request Dec 9, 2024
* implement mmap sparse vector storage

* add to VectorStorageEnum

* clippy

* add tests, fix both simple and mmap storages

* smol correction on total_vector_count

* add sparse storage type to config

* fix reading config without storage type

* generate openapi

* use blob_store by path

* hidden setting to enable new storage

* validate existing path in `BlobStore::open()`

* use new dir for each sparse vector name

* fix and rename `max_point_offset`

Plus some extra refactors

* add storage compat test, to always check both storages work

* fix opening of storage + other misc fixes

* FIX!!!

`Unset` operations in the Tracker weren't updating the
`next_pointer_id`. So, when reopening the storage, those points wouldn't
get marked as deleted in the bitslice, thus creating the illusion that
they should exist, when they did not.

* refactor naming from `iter_*` to `for_each_*`

* fix checking for BlobStore existance

* fix typo

* fix error message

* better docs for open_or_create

* fix after rebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants