Tags · xhluca/bm25s

0.2.7pre1

Fix query filtering and vocabulary dict (#96)

* update readme

* fix: token ID in the query higher than the number of tokens in the index (#92)

* Fix query filtering by using a set of vocab dict

* add edge case when all tokens are integers

* fix allow true

* update tests to match new changes

* Fix changes to test

* Fix error during yield

---------

Co-authored-by: Nguyễn Hoàng Nhật <69780142+mossbee@users.noreply.github.com>

Dec 29, 2024
6dfb6ce
zip
tar.gz
Notes

0.2.6

Added changes to load/save corpora with non-ascii character with unit…

… test case (#93)

Dec 23, 2024
ce8f886
zip
tar.gz
Notes

0.2.5

Update README.md (#87)

clarify that the example returned docs but the comment said it returned IDs

Nov 26, 2024
c4fef24
zip
tar.gz
Notes

0.2.4

Update README.md

Oct 31, 2024
8b5ff10
zip
tar.gz
Notes

0.2.3

Make empty strings an acceptable token (#67)

* Update tokenization.py

* Update tokenization.py

* remove valueerror, instead make empty string into an accepted vocab item

* Allow empty strings to suppress errors

* Add tests for empty tokens

* Remove integer tokens that are not in the vocabulary during retrieve call to avoid running into an error

* Update retriever.retrieve to ensure list of list of ints passed do not contain ints that do not exist in the vocabulary, but allows empty string

* Consolidate tests into a single file

Oct 18, 2024
e06ecf4
zip
tar.gz
Notes

0.2.2

Add features to improve memory usage (#63)

* Add a `Results.merge` and `len(results)` methods

* Add a `BM25.load_scores` methods for only loading scores, update `BM25.load` to use that function

* Refactor index/retrieve examples to remove dependencies and improve readability

* Add an example script that uses BM25.load_scores and JsonlCorpus.load to reload the mmapped scores/corpus, allowing lower memory usage

* Update readme to discuss memory usage optimization with mmap and mmap+reload

* Update query loading to only focus on specific qrels

* Add note about msmarco usage

* clarify post-retrieve scenario for readme

* Update main function to use a split arg

* Add note about msmarco

* Fix: issue where the max memory usage on mac is mislabeled

* Add parameter in bm25.load allowing to skip loading the vocabulary

* Update index_dir to depend on dataset

* Update indexing step to use tokenizer

Oct 6, 2024
0296bf6
zip
tar.gz
Notes

0.2.1

Add saving and loading corpus/stopwords to `Tokenizer` and add integr…

…ation to HF Hub via `bm25s.hf.TokenizerHF` (save/load) (#59)

* Add save_vocab, load_vocab, save_stopwords, load_stopwords

* Add support to saving/loading vocabulary and stopwords to hub

* Improve auto-generated readme with section on tokenizer, fix error in example

Sep 22, 2024
1e636a9
zip
tar.gz
Notes

0.2.0

Add new examples using numba

Sep 16, 2024
942ab0b
zip
tar.gz
Notes

0.2.0rc8

Improve tokenizer (#51)

* Add a tokenizer class (WIP)

* fix word_to_wid logic and add example for using tokenizer class

* WIP changes to make token ids valid inputs of retrieve, still need to be thoroughly tested

* add todo to example

* Major refactoring of tokenizer dclass

* Minor QOL improvements

* Refactor streaming_tokenize to be faster by reducing unecessary set checks

* Remove _word_to_wid to simplify vocab design. Now, word_to_id is updated when stemmer is not used

* Update beir.py utils to use new URL

* Remove unused function, lint code, add test cases

* Add example of using the tokenizer

* Rename example

* Add details about the new tokenizer class in readme

* Update class to test tokenize in int, ids, strings, tuple

Sep 8, 2024
0a49c62
zip
tar.gz
Notes

0.2.0rc7

Refactor retrieval to make it faster to run in numba mode (#47)

* WIP

* Stil WIP, seems to be working, cleanup TODO

* Add _tokenize_with_vocab (WIP)

* Rename to tokenize_with_vocab_exp

* simplify _tokenize_with_vocab_exp

* update _tokenize_with_vocab_exp

* Delete unecessary comment and rename retrieve utils functiion

* Cleanup retrieve_numba to make it compatible with retrieve when BM25 object is initiatilized with backend="numba"

* Add backend saving

* Add tests for numba backend, including mmap

* Move test file to numba section

Sep 3, 2024
072d242
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.2.7pre1

0.2.6

0.2.5

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0

0.2.0rc8

0.2.0rc7

Tags: xhluca/bm25s