Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment batch search #813

Merged
merged 23 commits into from
Aug 18, 2022
Merged

Segment batch search #813

merged 23 commits into from
Aug 18, 2022

Conversation

generall
Copy link
Member

@generall generall commented Jul 13, 2022

Another attempt to benchmark batching query.

Main question: at what level should we split batch into individual function calls?

There are 3 main options:

  • On collection level - where we can simply have batch request equivalent to the list of SearchRequest

    • pros: Simplest to implement - no change even in internal API
    • cons: Require to perform multiple internal calls to remote shard
  • On local shard level

    • pros:
      • No need to change code on segment level and segment holder level
      • Single request in remote shard
    • cons
      • Same filter calculated multiple times
  • On segment level

    • pros:
      • Presumably smallest overhead on function calls
      • Can reuse point selection from filters
    • cons:
      • requires the most amount of changes and potential code duplication

This PR introduces benchmark to compare scenario 2 and 3.
It makes 100 single queries to 2000 stored vectors vs single batch query with 100 vectors. Dim = 100
No indexes used.
Results:

No filters:

batch-search-bench/search-0                                                                             
                        time:   [4.3911 ms 4.4788 ms 4.5933 ms]
batch-search-bench/search-batch-0                                                                             
                        time:   [3.7360 ms 3.8431 ms 3.9657 ms]

Filter which matches 1/5 of the points:

batch-search-bench/search-1                                                                            
                        time:   [13.916 ms 14.146 ms 14.397 ms]
batch-search-bench/search-batch-1                                                                             
                        time:   [1.1711 ms 1.1870 ms 1.2080 ms]

Filter which matches all data points (to estimate filter overhead):

batch-search-bench/search-2                                                                            
                        time:   [19.268 ms 19.852 ms 20.570 ms]
batch-search-bench/search-batch-2                                                                            
                        time:   [5.4323 ms 5.4829 ms 5.5404 ms]

@agourlay agourlay force-pushed the segment-batch-search branch from 4f9ee69 to 13c04ff Compare August 8, 2022 18:50
@agourlay
Copy link
Member

agourlay commented Aug 8, 2022

rebased & fixed conflicts ✔️

@agourlay agourlay force-pushed the segment-batch-search branch 3 times, most recently from 0e9e454 to 91bdead Compare August 15, 2022 19:23
@agourlay agourlay force-pushed the segment-batch-search branch from 91bdead to 0c8b4aa Compare August 16, 2022 09:42
@agourlay agourlay marked this pull request as ready for review August 17, 2022 14:47
@agourlay
Copy link
Member

We can start an official round of reviews here to get things started 🔨

with_payload.is_required()
} else {
false
// search is a special case of search_batch with a single batch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generall Can you confirm that the dedicated search path needs to be re-instated here?

@@ -117,6 +117,11 @@ message SearchPoints {
optional uint64 offset = 9; // Offset of the result
}

message SearchBatchPoints {
string collection_name = 1; // Name of the collection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the collection is passed explicitly although the search_points contain a dedicated collection_name field.

The service validates that the collection_name exists and assumes all points in the batch targets the same collection.

This is not extremely clean for the time being.

@@ -0,0 +1,89 @@
use std::fs::File;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to share a single prof.rs file between all the benches across crates?

) -> CollectionResult<Vec<Vec<ScoredPoint>>> {
// A factor which determines if we need to use the 2-step search or not
// Should be adjusted based on usage statistics.
const PAYLOAD_TRANSFERS_FACTOR_THRESHOLD: usize = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic is adapted from the regular search taking in consideration the batching aspects

};
// Remove offset from top result.
if request.offset > 0 {
if top_res.len() >= request.offset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fix after testing the usage of offsets.
The bug must be present in our regular search already.

@generall
Copy link
Member Author

LGTM

Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to approve my own work on this one.

@agourlay agourlay merged commit a2acca0 into master Aug 18, 2022
@agourlay agourlay linked an issue Aug 18, 2022 that may be closed by this pull request
@agourlay agourlay deleted the segment-batch-search branch September 9, 2022 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

batch search queries
2 participants