Numba JIT support

@xhluca

What's Changed

Fix query filtering and vocabulary dict by @xhluca in #96 and @mossbee in #92

Notes

The behavior of tokenizers have changed wrt null token. Now, the null token will be added first to the vocab rather than at the end, as the previous approach is inconsistent with the general standard (the "" string should map to 0 in general). However, it is a backward compatible change because the tokenizers should work the same way as before, but expect the tokenizers before 0.2.7 to differ from the tokenizers in 0.2.7 and beyond in the behavior, even though both will work with the retriever object.

Full Changelog: 0.2.6...0.2.7

@IssacXid

What's Changed

Extending to Non-ASCII characters with corpora loading and saving by @IssacXid in #93

Full Changelog: 0.2.5...0.2.6

@xhluca

What's Changed

Update README.md by @xhluca in #83
Added support for saving and loading non ASCII chars in corpus and vocab by @IssacXid in #86
Update README.md by @mrisher in #87

New Contributors

@IssacXid made their first contribution in #86
@mrisher made their first contribution in #87

Full Changelog: 0.2.4...0.2.5

@mgraczyk

What's Changed

Fix crash tokenizing with empty word_to_id by @mgraczyk in #72

Create nltk_stemmer.py by @aflip in #77

aa31a23: The commit primarily focused on improving the handling of unknown tokens during the tokenization and retrieval processes, enhancing error handling, and improving the logging mechanism for better debugging.

bm25s/init.py: Added checks in the get_scores_from_ids method to raise a ValueError if max_token_id exceeds the number of tokens in the index. Enhanced handling of empty queries in _get_top_k_results method by returning zero scores for all documents.
bm25s/tokenization.py: Fixed the behavior of streaming_tokenize to correctly handle the addition of new tokens and updating word_to_id, word_to_stem, and stem_to_sid.

New Contributors

@mgraczyk made their first contribution in #72
@aflip made their first contribution in #77

Full Changelog: 0.2.3...0.2.4

What's Changed

PR #67 fixes issue #60
More test cases for edge cases of Tokenizer class, such as when update_vocab=True in return_as="ids" mode, which leads to unseen new token IDs being passed to retriever.retrieve

Full Changelog: 0.2.2...0.2.3

Improve README with example of memory usage optimization
Add a Results.merge method allowing merging list of results
Make get_max_memory_usage compatible with mac os
Add BM25.load_scores that allows loading only the scores of the object
Add a load_vocab parameter set to True by default in BM25.load, allowing the vocabulary to not be always loaded.

PR: #63

Full Changelog: 0.2.1...0.2.2

Add Tokenizer.save_vocab and Tokenizer.load_vocab methods to save/load vocabulary to a json file called vocab.tokenizer.json by default
Add Tokenizer.save_stopwords and Tokenizer.load_stopwords methods to save/load stopwords to a json file called stopwords.tokenizer.json by default
Add TokenizerHF class to allow saving/loading from huggingface hub
- New function: load_vocab_from_hub, save_vocab_to_hub, load_stopwords_from_hub, save_stopwords_to_hub

New tests and examples were added (see examples/index_to_hf.py and examples/tokenizer_class.py)

@bm777

Version 0.2.0 is an exciting release! This brings a lot of new features, including numba support (over 2x faster in many cases), stopwords for 10 new languages (thank you @bm777), a new Tokenizer class (faster and more flexible), document weighting at retrieval time, a new JSON backend (orjson), improvements to utils for using BEIR, and many new examples! Hope you enjoy this new release!

Numba JIT support

See discussion here: #46

The most important new feature of v0.2.0 is the addition of numba support, which only require you to install the core requirements (with pip install "bm25s[core]") or with pip install numba.

Using numba will result in a substantial speedup, so it is highly recommended if you have access to numba on your system (which should be in most cases). You can find a benchmark here.

Notably, by combining numba JIT-based scoring, numba-based top-k selection (no longer relies on jax, see discussion thread) and the new and faster bm25s.tokenization.Tokenizer (see below), we observe the following speedup on a few benchmarks, in a single-threaded setting with Kaggle CPUs:

MSMarco: 12.2 --> 39.18
HotpotQA: 20.88 --> 47.16
Fever: 20.19 --> 53.84
NQ: 41.85 --> 109.47
Quora: 272.04 --> 479.71
NFCorpus: 1196.16 --> 5696.21

To enable it, simply do:

import bm25s

# load corpus
# ...

retriever = bm25s.BM25(backend="numba")

# index and run retrieval

This is all you need to use numba JIT when calling the retriever.retrieve method. Note, however, that the first run might be slower, so you can warmup by passing a small query. Here are more examples:

New `bm25s.tokenization.Tokenizer` class

With v0.2.0, we are adding the Tokenizer class, which enhances the existing features of bm25s.tokenize and makes it more flexible. Notably, it enables generator mode (stream with yield), and is much faster when tokenizing queries, if you have an existing vocabulary. Also, you can specify your own splitter function, which is no longer locked to a regex pattern.

You can find more information here:

Readme section
examples/tokenizer_class.py
Read the docstring with help(bm25s.tokenization.Tokenizer)

New stopwords

Stopwords for 10 languages (from NLTK) were added by @bm777 in #33

English
German
Dutch
French
Spanish
Portuguese
Italian
Russian
Swedish
Norwegian
Chinese

New JSON backend

orjson is now supported as a JSON backend, as it is faster than ujson and is currently supported.

Weight mask

BM25.retrieve now supports a weight_mask array, which applies a weight (binary or float) on each of the document retrieved. This is useful, for example, if you want to use a binary mask to hide certain documents deemed irrelevant.

Dependency Notes

orjson replaces ujson as a core dependency
jax[cpu] is no longer a core dependency, but a selection dependency now. Be careful to not use backend_selection='jax' if you don't have it installed!
numba is a new core dependency, allowing you to directly use the backend='numba' when initializing a retriever.
pytrec_eval is a new evaluation dependency, which is useful if you want to use the evaluation function in bm25s.utils.beir which is copied from the BEIR dataset.

Advanced Numba

Alternative Usage (advanced)

Here's an example of how to leverage numba speedups using the alternative method of activing numba scorer and choosing the backend_selection manually. It is not recommended to use this method unless you speicfically want to have more control over how the backend is activated.

import os
import Stemmer

import bm25s.hf

def main(repo_name="xhluca/bm25s-fiqa-index"):
    queries = [
        "Is chemotherapy effective for treating cancer?",
        "Is Cardiac injury is common in critical cases of COVID-19?",
    ]

    retriever = bm25s.hf.BM25HF.load_from_hub(
        repo_name, load_corpus=False, mmap=False
    )

    # Tokenize the queries
    stemmer = Stemmer.Stemmer("english")
    queries_tokenized = bm25s.tokenize(queries, stemmer=stemmer)

    # Retrieve the top-k results
    retriever.activate_numba_scorer()
    results = retriever.retrieve(queries_tokenized, k=3, backend_selection="numba")
    # show first results
    result = results.documents[0]
    print(f"First score (# 1 result):{results.scores[0, 0]}")
    print(f"First result (# 1 result):\n{result[0]}")

if __name__ == "__main__":
    main()

Again, this method is only recommended if you want to have more control.

WARNING: it will not do well with multithreading. For the full example, see retrieve_with_numba_advanced.py

In this release, we add the Tokenizer class. Please see readme section on tokenization and examples/tokenizer_class.py for more details.

@xhluca

This is the final version of the numba improvements:

Refactor retrieval to make it faster to run in numba mode by @xhluca in #47

Full Changelog: data...0.2.0rc7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Notes

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Numba JIT support

New `bm25s.tokenization.Tokenizer` class

New stopwords

New JSON backend

Weight mask

Dependency Notes

Advanced Numba

Alternative Usage (advanced)

Contributors

Contributors

Releases: xhluca/bm25s

0.2.7pre1

What's Changed

Notes

Contributors

0.2.6

What's Changed

Contributors

0.2.5

What's Changed

New Contributors

Contributors

0.2.4

What's Changed

New Contributors

Contributors

0.2.3

What's Changed

0.2.2

v0.2.1

v0.2.0: Numba support, new `Tokenizer` class, more stopwords

Numba JIT support

New bm25s.tokenization.Tokenizer class

New stopwords

New JSON backend

Weight mask

Dependency Notes

Advanced Numba

Alternative Usage (advanced)

Contributors

Add tokenizer

0.2.0rc7: Speeding up retrieval with numba, and new stopwords

Contributors

New `bm25s.tokenization.Tokenizer` class