SparseIndex

Introduction

SparseIndex is a Rust library designed for sparse vector retrieval, indexing, and storage.

Developers can easily integrate this project into Rust or C++ projects. Here are some use cases:

Quickly implement a Rust-based sparse vector search engine using this library along with actix-web.
Integrate this library into ClickHouse (written in C++) to support sparse vector search in ClickHouse.

Features Provided by SparseIndex

Sparse vector search
- Optimized sparse vector search
- Unoptimized brute-force sparse vector search
Index file storage and loading
- Index data is stored on disk using an inverted index data structure
- Index files are loaded from disk using mmap
- Optionally, mmap files can be compressed or stored with u8 quantization to save space
Multi-threaded index building support
Automatic background merging of Segment-level indexes
Supports f32, half::f16, and u8 floating-point types for storage
Supports C++ FFI (Foreign Function Interface) for cross-language invocation
TODO: Support for deleting SparseVectors from the index

Project Structure

Let's break down the project structure of SparseIndex from bottom to top:

1. Segment (Inverted Index)

A Segment is the smallest unit of an index file. It uses the inverted index as the underlying data structure, which is a classic data structure widely used in text search.

We have implemented two types of inverted indices: InvertedIndex and CompressedInvertedIndex. Both types support enabling u8 quantization storage.

InvertedIndex is a simple implementation of the inverted index, which typically offers better search performance.
CompressedInvertedIndex uses the BitPacking compression library to compress RowIds data of type u32. As a result, sparse vector search may take longer.

2. Storage

In the process of adding a large number of sparse vectors to SparseIndex, we use multi-threading to perform concurrent index building. Typically, four threads are used to build the index, with each thread responsible for constructing one Segment. When a Segment is complete (i.e., its resources reach a threshold), it participates in the merge operation with other Segments. The automatic background Segment merge operation ensures that SparseIndex maintains an optimal number of Segments under high data volumes, which in turn guarantees search performance.

3. Sparse Index API (Rust)

This is the Index-level Rust interface. Each Index is bound to a directory on disk, where all Segment file building, merging, and destruction operations are carried out. Using the Add and Search interfaces provided by the Index, sparse vectors can be stored and searched. There are also other interfaces for index state management, such as the Commit interface, which represents committing the index state.

4. Cross-Language API

To enable cross-language integration with SparseIndex, we provide rich FFI function interfaces for C++. This helps with cross-language function calls. In the future, we may consider supporting additional language types.

Development Instructions

1. Library Compilation and Testing

a. Compiling the Library

The toolchain's version:

❯ cargo version                          
cargo 1.79.0 (ffa9cf99a 2024-06-03)

Execute the following command to compile the SparseIndex library in release mode:

cargo build --release

After the compilation is complete, you can find the .a file (compiled product) in the target directory. To integrate this into a C++ project, you will need to link this .a file and use the C++ header files in the includes folder to call the SparseIndex FFI functions.

b. Testing the Library

Execute the following command to perform unit tests and integration tests for the library:

cargo test

2. FFI Interface Development Process

If the existing FFI interfaces do not meet your C++ project needs, you will need to learn how to add new FFI interfaces. Below is the process for adding new FFI function interfaces:

Step 1: Fork the Repository

If you wish to contribute your custom FFI interfaces to this repository, please fork the repository, modify the code, and submit the changes.

Step 2: Declare the New FFI Function Interface

FFI function interfaces are declared in the src/lib.rs file, which also indicates that this Rust project is a library.

You can declare your custom FFI function interfaces within the ffi module scope:

#[cxx::bridge(namespace = "SPARSE")]
pub mod ffi {
    /* ... */
    extern "Rust" {
        /* ... */
        // Your custom FFI function
        pub fn ffi_test_func();
    }
}

Step 3: Implement the FFI Function Interface

The actual implementation of FFI functions is located in the src/api/cxx_ffi directory. Developers can refer to existing FFI interface implementations and add new FFI function implementations here, along with relevant unit tests and integration tests.

Step 4: Update the C++ Header File

Once the FFI function is developed, you need to run the following command to generate the header file for C++ linkage:

cxxbridge src/lib.rs --header > include/sparse_index_cxx.h

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.cargo		.cargo
.vscode		.vscode
bitpacker		bitpacker
common		common
contrib		contrib
datasets		datasets
include		include
ownedbytes		ownedbytes
resources		resources
src		src
.cursorignore		.cursorignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
build.rs.in		build.rs.in
cbindgen.toml		cbindgen.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SparseIndex

Introduction

Features Provided by SparseIndex

Project Structure

1. Segment (Inverted Index)

2. Storage

3. Sparse Index API (Rust)

4. Cross-Language API

Development Instructions

1. Library Compilation and Testing

a. Compiling the Library

b. Testing the Library

2. FFI Interface Development Process

Step 1: Fork the Repository

Step 2: Declare the New FFI Function Interface

Step 3: Implement the FFI Function Interface

Step 4: Update the C++ Header File

About

Releases

Packages

Languages

License

myscale/sparse-index

Folders and files

Latest commit

History

Repository files navigation

SparseIndex

Introduction

Features Provided by SparseIndex

Project Structure

1. Segment (Inverted Index)

2. Storage

3. Sparse Index API (Rust)

4. Cross-Language API

Development Instructions

1. Library Compilation and Testing

a. Compiling the Library

b. Testing the Library

2. FFI Interface Development Process

Step 1: Fork the Repository

Step 2: Declare the New FFI Function Interface

Step 3: Implement the FFI Function Interface

Step 4: Update the C++ Header File

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages