[feature request] Implement BMX algorithm #40

logan-markewich · 2024-08-14T23:47:30Z

Recently introduced here
https://www.mixedbread.ai/blog/intro-bmx

Seems like there's enough info + another open source implementation that this algorithm could make it's way here

xhluca · 2024-08-15T03:50:27Z

I think the mixedbread.ai team did a great job with implementing bmx in https://github.com/mixedbread-ai/baguetter/

I'm not sure if it makes sense to copy their code into this repository, considering they are actively maintaining their new library. They might also make changes to their algorithms in the future too.

logan-markewich · 2024-08-15T04:40:59Z

That's fair. Just thought it might be nice to have a single place for all things bm25 😁

xhluca · 2024-08-15T16:11:17Z

I think if it is as easy as changing the term frequency component or the idf function, it would definitely be great to add! Im not sure if BMX is as simple as that yet though, since I haven't had the chance to read the paper closely.

At the moment, bm25+, BM25L, etc. are all very similar to the base implementation, with only small changes, so it was easy to add those variants.

logan-markewich · 2024-09-16T14:43:32Z

Was reading more about this paper yesterday. It looks like it requires calculating two additional parameters -- term entropies, and "similarity"

i.e. in rust syntax,

let idf = ((num_docs - doc_freq + 0.5) / (doc_freq + 0.5) + 1.0).ln();
let tf = (term_freq * (self.alpha + 1.0)) / (term_freq + (self.k * ((1.0 - self.b) + self.b * doc_length / avg_doc_length)) + self.alpha * term_entropies.avg);
score += idf * (tf + self.beta * term_entropies.normalized[i] * similarity);

Where similarity is just counting the number of common terms, and term entropies is something like

// Calculate probabilities and entropy
let mut entropy = 0.0;
for freq in doc_term_freqs {
    let p = freq as f64 / term_freq_sum as f64;
    entropy -= p * p.log2();
}

bm777 · 2024-09-24T12:09:35Z

I see, where you are going. python-binding -> Rust hahaha
@logan-markewich

xhluca · 2024-09-24T13:54:13Z

Do you know how the term_entropies is computed?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Implement BMX algorithm #40

[feature request] Implement BMX algorithm #40

logan-markewich commented Aug 14, 2024

xhluca commented Aug 15, 2024

logan-markewich commented Aug 15, 2024

xhluca commented Aug 15, 2024

logan-markewich commented Sep 16, 2024 •

edited

Loading

bm777 commented Sep 24, 2024

xhluca commented Sep 24, 2024

[feature request] Implement BMX algorithm #40

[feature request] Implement BMX algorithm #40

Comments

logan-markewich commented Aug 14, 2024

xhluca commented Aug 15, 2024

logan-markewich commented Aug 15, 2024

xhluca commented Aug 15, 2024

logan-markewich commented Sep 16, 2024 • edited Loading

bm777 commented Sep 24, 2024

xhluca commented Sep 24, 2024

logan-markewich commented Sep 16, 2024 •

edited

Loading