Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse idf dot #4126

Merged
merged 14 commits into from
Apr 29, 2024
Merged

Sparse idf dot #4126

merged 14 commits into from
Apr 29, 2024

Conversation

generall
Copy link
Member

@generall generall commented Apr 26, 2024

This PR introduces a modifier param, which can be configured for individual sparse vectors.

It is also introduces a single non-empty value for the modifier, called idf.

Usage example:

PATCH collections/test_sparse_idf
{
    "sparse_vectors": {
        "text": {
            "modifier": "idf"
        }
    }
}

Modifier alters the weights of sparse vectors, so the resulting score (and order) are different.

The idf modifier multiplies weights of the sparse vectors to account for "rarity" of the vector elements in the current collection. Less frequent the element is, more impact on the scoring it will have.

Exact formula looks like this:

((n - df + 0.5) / (df + 0.5) + 1.).ln()

where n - number of points in shard, df - number of points with given element.
This component is computed at query time and multiplied with query before performing the actual search.

Similar scoring component is used in Okapi BM25, which allows usage of sparse vectors as a replacement for Bm25 text search without the necessity of pre-computing of the corpus statistics.

@generall generall changed the title [idf] Sparse idf dot [WIP] Sparse idf dot Apr 26, 2024
@generall generall changed the title [WIP] Sparse idf dot Sparse idf dot Apr 27, 2024
@generall generall marked this pull request as ready for review April 28, 2024 12:08
@generall generall requested a review from agourlay April 29, 2024 07:28
lib/segment/src/data_types/query_context.rs Outdated Show resolved Hide resolved
tests/openapi/openapi_integration/test_sparse_idf.py Outdated Show resolved Hide resolved
lib/sparse/src/common/sparse_vector.rs Show resolved Hide resolved
@@ -254,10 +307,9 @@ impl SegmentsSearcher {
search_in_segment(
segment,
partial_batch_request,
0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to pass explicitly zero in that case vs available_point_count for the query above.

Now both search_in_segment will use the same value from the query_context.

Not sure of the consequences yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None, if the heuristic is not used, it won't use this parameter

Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>
@agourlay
Copy link
Member

FYI the sparse vector congruence tests are still green with this branch 👌

@generall generall merged commit c173a9f into dev Apr 29, 2024
17 checks passed
@generall generall deleted the sparse-idf-dot branch April 29, 2024 12:54
generall added a commit that referenced this pull request May 2, 2024
* introduce QueryContext, which accumulates runtime info needed for executing search

* fmt

* propagate query context into segment internals

* [WIP] prepare idf stats for search query context

* Split SparseVector and RemmapedSparseVector to guarantee we will not mix them up on the type level

* implement filling of the query context with IDF statistics

* implement re-weighting of the sparse query with idf

* fmt

* update idf param only if explicitly specified (more consistent with diff param update

* replace idf bool with modifier enum, improve further extensibility

* test and fixes

* Update lib/collection/src/operations/types.rs

Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>

* review fixes

* fmt

---------

Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants