Add saving and loading corpus/stopwords to `Tokenizer` and add integration to HF Hub via `bm25s.hf.TokenizerHF` (save/load) #59

xhluca · 2024-09-22T18:02:30Z

Add Tokenizer.save_vocab and Tokenizer.load_vocab methods to save/load vocabulary to a json file called vocab.tokenizer.json by default
Add Tokenizer.save_stopwords and Tokenizer.load_stopwords methods to save/load stopwords to a json file called stopwords.tokenizer.json by default
Add TokenizerHF class to allow saving/loading from huggingface hub
- New function: load_vocab_from_hub, save_vocab_to_hub, load_stopwords_from_hub, save_stopwords_to_hub

… example

xhluca added 3 commits September 22, 2024 10:56

Add save_vocab, load_vocab, save_stopwords, load_stopwords

d48e0c2

Add support to saving/loading vocabulary and stopwords to hub

8f1ca84

Improve auto-generated readme with section on tokenizer, fix error in…

e58ae2d

… example

xhluca merged commit 1e636a9 into main Sep 22, 2024
2 checks passed

xhluca deleted the add-saving-loading-to-tokenizer branch September 22, 2024 18:05

Provide feedback