-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse idf dot #4126
Sparse idf dot #4126
Conversation
…mix them up on the type level
@@ -254,10 +307,9 @@ impl SegmentsSearcher { | |||
search_in_segment( | |||
segment, | |||
partial_batch_request, | |||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to pass explicitly zero
in that case vs available_point_count
for the query above.
Now both search_in_segment
will use the same value from the query_context
.
Not sure of the consequences yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, if the heuristic is not used, it won't use this parameter
Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>
FYI the sparse vector congruence tests are still green with this branch 👌 |
* introduce QueryContext, which accumulates runtime info needed for executing search * fmt * propagate query context into segment internals * [WIP] prepare idf stats for search query context * Split SparseVector and RemmapedSparseVector to guarantee we will not mix them up on the type level * implement filling of the query context with IDF statistics * implement re-weighting of the sparse query with idf * fmt * update idf param only if explicitly specified (more consistent with diff param update * replace idf bool with modifier enum, improve further extensibility * test and fixes * Update lib/collection/src/operations/types.rs Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com> * review fixes * fmt --------- Co-authored-by: Arnaud Gourlay <arnaud.gourlay@gmail.com>
This PR introduces a
modifier
param, which can be configured for individual sparse vectors.It is also introduces a single non-empty value for the modifier, called
idf
.Usage example:
Modifier alters the weights of sparse vectors, so the resulting score (and order) are different.
The
idf
modifier multiplies weights of the sparse vectors to account for "rarity" of the vector elements in the current collection. Less frequent the element is, more impact on the scoring it will have.Exact formula looks like this:
where
n
- number of points in shard,df
- number of points with given element.This component is computed at query time and multiplied with query before performing the actual search.
Similar scoring component is used in Okapi BM25, which allows usage of sparse vectors as a replacement for Bm25 text search without the necessity of pre-computing of the corpus statistics.