v0.2.0: Numba support, new Tokenizer
class, more stopwords
#58
xhluca
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Version 0.2.0 is an exciting release! This brings a lot of new features, including numba support (over 2x faster in many cases), stopwords for 10 new languages (thank you @bm777), a new Tokenizer class (faster and more flexible), document weighting at retrieval time, a new JSON backend (orjson), improvements to utils for using BEIR, and many new examples! Hope you enjoy this new release!
Numba JIT support
The most important new feature of v0.2.0 is the addition of numba support, which only require you to install the core requirements (with
pip install "bm25s[core]"
) or withpip install numba
.Using numba will result in a substantial speedup, so it is highly recommended if you have access to numba on your system (which should be in most cases). You can find a benchmark here.
Notably, by combining numba JIT-based scoring, numba-based top-k selection (no longer relies on jax, see discussion thread) and the new and faster
bm25s.tokenization.Tokenizer
(see below), we observe the following speedup on a few benchmarks, in a single-threaded setting with Kaggle CPUs:To enable it, simply do:
This is all you need to use numba JIT when calling the
retriever.retrieve
method. Note, however, that the first run might be slower, so you can warmup by passing a small query. Here are more examples:New
bm25s.tokenization.Tokenizer
classWith v0.2.0, we are adding the
Tokenizer
class, which enhances the existing features ofbm25s.tokenize
and makes it more flexible. Notably, it enables generator mode (stream withyield
), and is much faster when tokenizing queries, if you have an existing vocabulary. Also, you can specify your own splitter function, which is no longer locked to a regex pattern.You can find more information here:
examples/tokenizer_class.py
help(bm25s.tokenization.Tokenizer)
New stopwords
Stopwords for 10 languages (from NLTK) were added by @bm777 in #33
New JSON backend
orjson
is now supported as a JSON backend, as it is faster than ujson and is currently supported.Weight mask
BM25.retrieve
now supports a weight_mask array, which applies a weight (binary or float) on each of the document retrieved. This is useful, for example, if you want to use a binary mask to hide certain documents deemed irrelevant.Dependency Notes
orjson
replacesujson
as a core dependencyjax[cpu]
is no longer acore
dependency, but aselection
dependency now. Be careful to not usebackend_selection='jax'
if you don't have it installed!numba
is a newcore
dependency, allowing you to directly use thebackend='numba'
when initializing a retriever.pytrec_eval
is a newevaluation
dependency, which is useful if you want to use the evaluation function inbm25s.utils.beir
which is copied from the BEIR dataset.Advanced Numba
Alternative Usage (advanced)
Here's an example of how to leverage numba speedups using the alternative method of activing numba scorer and choosing the
backend_selection
manually. It is not recommended to use this method unless you speicfically want to have more control over how the backend is activated.Again, this method is only recommended if you want to have more control.
WARNING: it will not do well with multithreading. For the full example, see retrieve_with_numba_advanced.py
This discussion was created from the release v0.2.0: Numba support, new `Tokenizer` class, more stopwords.
Beta Was this translation helpful? Give feedback.
All reactions