Tags: xhluca/bm25s
Tags
Fix query filtering and vocabulary dict (#96) * update readme * fix: token ID in the query higher than the number of tokens in the index (#92) * Fix query filtering by using a set of vocab dict * add edge case when all tokens are integers * fix allow true * update tests to match new changes * Fix changes to test * Fix error during yield --------- Co-authored-by: Nguyễn Hoàng Nhật <69780142+mossbee@users.noreply.github.com>
Make empty strings an acceptable token (#67) * Update tokenization.py * Update tokenization.py * remove valueerror, instead make empty string into an accepted vocab item * Allow empty strings to suppress errors * Add tests for empty tokens * Remove integer tokens that are not in the vocabulary during retrieve call to avoid running into an error * Update retriever.retrieve to ensure list of list of ints passed do not contain ints that do not exist in the vocabulary, but allows empty string * Consolidate tests into a single file
Add features to improve memory usage (#63) * Add a `Results.merge` and `len(results)` methods * Add a `BM25.load_scores` methods for only loading scores, update `BM25.load` to use that function * Refactor index/retrieve examples to remove dependencies and improve readability * Add an example script that uses BM25.load_scores and JsonlCorpus.load to reload the mmapped scores/corpus, allowing lower memory usage * Update readme to discuss memory usage optimization with mmap and mmap+reload * Update query loading to only focus on specific qrels * Add note about msmarco usage * clarify post-retrieve scenario for readme * Update main function to use a split arg * Add note about msmarco * Fix: issue where the max memory usage on mac is mislabeled * Add parameter in bm25.load allowing to skip loading the vocabulary * Update index_dir to depend on dataset * Update indexing step to use tokenizer
Add saving and loading corpus/stopwords to `Tokenizer` and add integr… …ation to HF Hub via `bm25s.hf.TokenizerHF` (save/load) (#59) * Add save_vocab, load_vocab, save_stopwords, load_stopwords * Add support to saving/loading vocabulary and stopwords to hub * Improve auto-generated readme with section on tokenizer, fix error in example
Improve tokenizer (#51) * Add a tokenizer class (WIP) * fix word_to_wid logic and add example for using tokenizer class * WIP changes to make token ids valid inputs of retrieve, still need to be thoroughly tested * add todo to example * Major refactoring of tokenizer dclass * Minor QOL improvements * Refactor streaming_tokenize to be faster by reducing unecessary set checks * Remove _word_to_wid to simplify vocab design. Now, word_to_id is updated when stemmer is not used * Update beir.py utils to use new URL * Remove unused function, lint code, add test cases * Add example of using the tokenizer * Rename example * Add details about the new tokenizer class in readme * Update class to test tokenize in int, ids, strings, tuple
Refactor retrieval to make it faster to run in numba mode (#47) * WIP * Stil WIP, seems to be working, cleanup TODO * Add _tokenize_with_vocab (WIP) * Rename to tokenize_with_vocab_exp * simplify _tokenize_with_vocab_exp * update _tokenize_with_vocab_exp * Delete unecessary comment and rename retrieve utils functiion * Cleanup retrieve_numba to make it compatible with retrieve when BM25 object is initiatilized with backend="numba" * Add backend saving * Add tests for numba backend, including mmap * Move test file to numba section
PreviousNext