Skip to content

Tags: xhluca/bm25s

Tags

0.2.7pre1

Toggle 0.2.7pre1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix query filtering and vocabulary dict (#96)

* update readme

* fix: token ID in the query higher than the number of tokens in the index (#92)

* Fix query filtering by using a set of vocab dict

* add edge case when all tokens are integers

* fix allow true

* update tests to match new changes

* Fix changes to test

* Fix error during yield

---------

Co-authored-by: Nguyễn Hoàng Nhật <69780142+mossbee@users.noreply.github.com>

0.2.6

Toggle 0.2.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added changes to load/save corpora with non-ascii character with unit…

… test case (#93)

0.2.5

Toggle 0.2.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update README.md (#87)

clarify that the example returned docs but the comment said it returned IDs

0.2.4

Toggle 0.2.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Update README.md

0.2.3

Toggle 0.2.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Make empty strings an acceptable token (#67)

* Update tokenization.py

* Update tokenization.py

* remove valueerror, instead make empty string into an accepted vocab item

* Allow empty strings to suppress errors

* Add tests for empty tokens

* Remove integer tokens that are not in the vocabulary during retrieve call to avoid running into an error

* Update retriever.retrieve to ensure list of list of ints passed do not contain ints that do not exist in the vocabulary, but allows empty string

* Consolidate tests into a single file

0.2.2

Toggle 0.2.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add features to improve memory usage (#63)

* Add a `Results.merge` and `len(results)` methods

* Add a `BM25.load_scores` methods for only loading scores, update `BM25.load` to use that function

* Refactor index/retrieve examples to remove dependencies and improve readability

* Add an example script that uses BM25.load_scores and JsonlCorpus.load to reload the mmapped scores/corpus, allowing lower memory usage

* Update readme to discuss memory usage optimization with mmap and mmap+reload

* Update query loading to only focus on specific qrels

* Add note about msmarco usage

* clarify post-retrieve scenario for readme

* Update main function to use a split arg

* Add note about msmarco

* Fix: issue where the max memory usage on mac is mislabeled

* Add parameter in bm25.load allowing to skip loading the vocabulary

* Update index_dir to depend on dataset

* Update indexing step to use tokenizer

0.2.1

Toggle 0.2.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Add saving and loading corpus/stopwords to `Tokenizer` and add integr…

…ation to HF Hub via `bm25s.hf.TokenizerHF` (save/load) (#59)

* Add save_vocab, load_vocab, save_stopwords, load_stopwords

* Add support to saving/loading vocabulary and stopwords to hub

* Improve auto-generated readme with section on tokenizer, fix error in example

0.2.0

Toggle 0.2.0's commit message

Verified

This commit was signed with the committer’s verified signature.
xhluca Xing Han Lu
Add new examples using numba

0.2.0rc8

Toggle 0.2.0rc8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Improve tokenizer (#51)

* Add a tokenizer class (WIP)

* fix word_to_wid logic and add example for using tokenizer class

* WIP changes to make token ids valid inputs of retrieve, still need to be thoroughly tested

* add todo to example

* Major refactoring of tokenizer dclass

* Minor QOL improvements

* Refactor streaming_tokenize to be faster by reducing unecessary set checks

* Remove _word_to_wid to simplify vocab design. Now, word_to_id is updated when stemmer is not used

* Update beir.py utils to use new URL

* Remove unused function, lint code, add test cases

* Add example of using the tokenizer

* Rename example

* Add details about the new tokenizer class in readme

* Update class to test tokenize in int, ids, strings, tuple

0.2.0rc7

Toggle 0.2.0rc7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Refactor retrieval to make it faster to run in numba mode (#47)

* WIP

* Stil WIP, seems to be working, cleanup TODO

* Add _tokenize_with_vocab (WIP)

* Rename to tokenize_with_vocab_exp

* simplify _tokenize_with_vocab_exp

* update _tokenize_with_vocab_exp

* Delete unecessary comment and rename retrieve utils functiion

* Cleanup retrieve_numba to make it compatible with retrieve when BM25 object is initiatilized with backend="numba"

* Add backend saving

* Add tests for numba backend, including mmap

* Move test file to numba section