Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Median based TopK for sparse vectors scoring #4037

Merged
merged 2 commits into from
Apr 19, 2024
Merged

Conversation

agourlay
Copy link
Member

@agourlay agourlay commented Apr 15, 2024

Experiment median TopK based on the article from the nice folks at Quickwit https://quickwit.io/blog/top-k-complexity.

Performance

Criterion

sparse-vector-search-group/inverted-index-search
                        time:   [566.19 µs 575.56 µs 584.71 µs]
                        change: [-9.0722% -7.3381% -5.5563%] (p = 0.00 < 0.05)
                        Performance has improved.

End to end

Running https://github.com/qdrant/sparse-vectors-benchmark with full dataset

DEV
min: 72.6 millis
50p: 157.3 millis
95p: 210.72 millis
99p: 235.89 millis
999p: 261.57 millis
max: 338.19 millis
PR
min: 62.32 millis
50p: 148.66 millis
95p: 197.56 millis
99p: 222.33 millis
999p: 246.36 millis
max: 273.89 millis

Validation

  • checked congruence tests
  • checked ground truth values in benchmarks

Concerns

We might want to constraint the limit argument in search to avoid having someone reserving a huge amount of memory?

@agourlay agourlay force-pushed the topk-sparse-vectors branch from 8be2c80 to 665e82b Compare April 16, 2024 06:22
@KShivendu
Copy link
Member

Nice!

check perf.

Hoping that qdrant/vector-db-benchmark#114 will be merged today. So we see a bump in CI benchmarks too =)

@@ -123,7 +123,7 @@ fn compare_sparse_vectors_search_with_without_filter(full_scan_threshold: usize)
.filter(|s| s.score != 0.0)
.zip(no_filter_result.iter().filter(|s| s.score != 0.0))
{
assert_eq!(filter_result, no_filter_result);
assert_eq!(filter_result.score, no_filter_result.score);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting assertion error due to different ids on equal score

@agourlay agourlay marked this pull request as ready for review April 16, 2024 18:10
@agourlay agourlay force-pushed the topk-sparse-vectors branch 3 times, most recently from 210b584 to 3d723b6 Compare April 18, 2024 13:25
@agourlay agourlay merged commit abcd2af into dev Apr 19, 2024
17 checks passed
@agourlay agourlay deleted the topk-sparse-vectors branch April 19, 2024 11:27
timvisee pushed a commit that referenced this pull request Apr 22, 2024
* Median based TopK for sparse vectors scoring

* add test with identical scores
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants