-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segment batch search #813
Segment batch search #813
Conversation
4f9ee69
to
13c04ff
Compare
rebased & fixed conflicts ✔️ |
0e9e454
to
91bdead
Compare
91bdead
to
0c8b4aa
Compare
We can start an official round of reviews here to get things started 🔨 |
with_payload.is_required() | ||
} else { | ||
false | ||
// search is a special case of search_batch with a single batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@generall Can you confirm that the dedicated search path needs to be re-instated here?
@@ -117,6 +117,11 @@ message SearchPoints { | |||
optional uint64 offset = 9; // Offset of the result | |||
} | |||
|
|||
message SearchBatchPoints { | |||
string collection_name = 1; // Name of the collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the collection is passed explicitly although the search_points
contain a dedicated collection_name
field.
The service validates that the collection_name
exists and assumes all points in the batch targets the same collection.
This is not extremely clean for the time being.
@@ -0,0 +1,89 @@ | |||
use std::fs::File; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to share a single prof.rs
file between all the benches
across crates?
) -> CollectionResult<Vec<Vec<ScoredPoint>>> { | ||
// A factor which determines if we need to use the 2-step search or not | ||
// Should be adjusted based on usage statistics. | ||
const PAYLOAD_TRANSFERS_FACTOR_THRESHOLD: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic is adapted from the regular search
taking in consideration the batching aspects
}; | ||
// Remove offset from top result. | ||
if request.offset > 0 { | ||
if top_res.len() >= request.offset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a fix after testing the usage of offsets.
The bug must be present in our regular search already.
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to approve my own work on this one.
Another attempt to benchmark batching query.
Main question: at what level should we split batch into individual function calls?
There are 3 main options:
On collection level - where we can simply have batch request equivalent to the list of
SearchRequest
On local shard level
On segment level
This PR introduces benchmark to compare scenario 2 and 3.
It makes 100 single queries to 2000 stored vectors vs single batch query with 100 vectors. Dim = 100
No indexes used.
Results:
No filters:
Filter which matches 1/5 of the points:
Filter which matches all data points (to estimate filter overhead):