Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Implement BMX algorithm #40

Open
logan-markewich opened this issue Aug 14, 2024 · 6 comments
Open

[feature request] Implement BMX algorithm #40

logan-markewich opened this issue Aug 14, 2024 · 6 comments

Comments

@logan-markewich
Copy link

Recently introduced here
https://www.mixedbread.ai/blog/intro-bmx

Seems like there's enough info + another open source implementation that this algorithm could make it's way here

@xhluca
Copy link
Owner

xhluca commented Aug 15, 2024

I think the mixedbread.ai team did a great job with implementing bmx in https://github.com/mixedbread-ai/baguetter/

I'm not sure if it makes sense to copy their code into this repository, considering they are actively maintaining their new library. They might also make changes to their algorithms in the future too.

@logan-markewich
Copy link
Author

That's fair. Just thought it might be nice to have a single place for all things bm25 😁

@xhluca
Copy link
Owner

xhluca commented Aug 15, 2024

I think if it is as easy as changing the term frequency component or the idf function, it would definitely be great to add! Im not sure if BMX is as simple as that yet though, since I haven't had the chance to read the paper closely.

At the moment, bm25+, BM25L, etc. are all very similar to the base implementation, with only small changes, so it was easy to add those variants.

@logan-markewich
Copy link
Author

logan-markewich commented Sep 16, 2024

Was reading more about this paper yesterday. It looks like it requires calculating two additional parameters -- term entropies, and "similarity"

i.e. in rust syntax,

let idf = ((num_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0).ln();
let tf = (term_freq * (self.alpha + 1.0)) / (term_freq + (self.k * ((1.0 - self.b) + self.b * doc_length / avg_doc_length)) + self.alpha * term_entropies.avg);
score += idf * (tf + self.beta * term_entropies.normalized[i] * similarity);

Where similarity is just counting the number of common terms, and term entropies is something like

// Calculate probabilities and entropy
let mut entropy = 0.0;
for freq in doc_term_freqs {
    let p = freq as f64 / term_freq_sum as f64;
    entropy -= p * p.log2();
}

@bm777
Copy link
Contributor

bm777 commented Sep 24, 2024

I see, where you are going. python-binding -> Rust hahaha
@logan-markewich

@xhluca
Copy link
Owner

xhluca commented Sep 24, 2024

Do you know how the term_entropies is computed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants