-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Implement BMX algorithm #40
Comments
I think the mixedbread.ai team did a great job with implementing bmx in https://github.com/mixedbread-ai/baguetter/ I'm not sure if it makes sense to copy their code into this repository, considering they are actively maintaining their new library. They might also make changes to their algorithms in the future too. |
That's fair. Just thought it might be nice to have a single place for all things bm25 😁 |
I think if it is as easy as changing the term frequency component or the idf function, it would definitely be great to add! Im not sure if BMX is as simple as that yet though, since I haven't had the chance to read the paper closely. At the moment, bm25+, BM25L, etc. are all very similar to the base implementation, with only small changes, so it was easy to add those variants. |
Was reading more about this paper yesterday. It looks like it requires calculating two additional parameters -- term entropies, and "similarity" i.e. in rust syntax, let idf = ((num_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0).ln();
let tf = (term_freq * (self.alpha + 1.0)) / (term_freq + (self.k * ((1.0 - self.b) + self.b * doc_length / avg_doc_length)) + self.alpha * term_entropies.avg);
score += idf * (tf + self.beta * term_entropies.normalized[i] * similarity); Where similarity is just counting the number of common terms, and term entropies is something like // Calculate probabilities and entropy
let mut entropy = 0.0;
for freq in doc_term_freqs {
let p = freq as f64 / term_freq_sum as f64;
entropy -= p * p.log2();
} |
I see, where you are going. python-binding -> Rust hahaha |
Do you know how the |
Recently introduced here
https://www.mixedbread.ai/blog/intro-bmx
Seems like there's enough info + another open source implementation that this algorithm could make it's way here
The text was updated successfully, but these errors were encountered: