-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index out of bounds errors in 0.2.0 and 0.2.1 #60
Comments
can you share reproducible code to run this (including dependencies)? I'd be happy to add it to the tests/ (which I extensively updated for 0.2) so this doesn't happen in the future. |
I've also encountered the same issue, but it only occurs in very specific circumstances, where all the strings in the croup are empty. However, I think this situation should not result in an error, but rather return an empty result instead. reproduce code
output
|
Thank you! This seems like we should raise an exception when an empty string is encountered, as it should not be a valid input with the tokenizer. Returning empty string is worse than failing, as that'd mislead the API user that the corpus is valid, when empty string in the corpus should not be. |
I'm getting this error when the tokens from my query are not in the vocabulary that I got when tokenizing and indexing the corpus. I'm new to this library and tries to use custom tokenizer for my language, so maybe i'm doing something wrong... from typing import List, Union
import nltk
from pymystem3 import Mystem
class MystemTokenizer():
def __init__(self, stopwords: Union[List[str], str] = "ru"):
if stopwords == "ru":
try:
self.stopwords = nltk.corpus.stopwords.words("russian")
except LookupError:
nltk.download("stopwords")
self.stopwords = nltk.corpus.stopwords.words("russian")
self.mystem = Mystem()
def tokenize(self, text: str) -> List[str]:
# Lemmatize and tokenize using Mystem
lemmas = self.mystem.lemmatize(text.lower())
# Filter out non-letter tokens and stopwords
tokens = [lemma for lemma in lemmas if lemma.isalpha()
and lemma not in self.stopwords]
return tokens import os
from typing import List, Callable
from bm25s import BM25
from bm25s.tokenization import Tokenizer as BM25STokenizer
class BM25SIndexer:
def __init__(self, corpus: List[str],
tokenizer: Callable[[str], List[str]]):
self.corpus = corpus
self.custom_tokenizer = tokenizer
self.bm25s_tokenizer = BM25STokenizer(
stemmer=None,
stopwords=None,
splitter=self.custom_tokenizer
)
self.bm25 = None
def create_index(
self,
save_dir: str,
save_corpus: bool = True,
save_vocab: bool = True) -> None:
# Ensure the save directory exists
os.makedirs(save_dir, exist_ok=True)
# Tokenize the corpus
corpus_tokens = self.bm25s_tokenizer.tokenize(self.corpus)
# Create and index the BM25 model
self.bm25 = BM25()
self.bm25.index(corpus_tokens)
# Save the BM25 index
self.bm25.save(save_dir, corpus=self.corpus if save_corpus else None)
# Save vocabulary if needed
if save_vocab:
self.bm25s_tokenizer.save_vocab(save_dir=save_dir) import time
from typing import List, Dict, Any, Callable
import numpy as np
from sqlalchemy.orm import Session
import bm25s
from src.db import crud
class BM25Search:
def __init__(self, db: Session, index_folder: str,
tokenizer: Callable[[str], List[str]]):
self.db = db
self.bm25s_tokenizer = bm25s.tokenization.Tokenizer(
stemmer=None,
stopwords=None,
splitter=tokenizer
)
self.bm25s_tokenizer.load_vocab(index_folder)
self.bm25 = bm25s.BM25.load(index_folder, load_corpus=False)
def search(self, query: str, n: int = 3) -> Dict[str, Any]:
start_time = time.time()
query_tokens = self.bm25s_tokenizer.tokenize([query])
doc_ids, scores = self.bm25.retrieve(query_tokens, k=n)
... And I'm getting |
@Futyn-Maker thanks for expanding on this, it is not clear to me how the error happens. Could you share a minimally reproducible example in a single-file gist, without the use of databases (as it is hard for me to set it up)? |
@xhluca Yes, here it is: https://gist.github.com/Futyn-Maker/d3817aa9d0d5b89afd543387622612fb It contains dummy custom tokenizer, corpus and a query which contains tokens that do not in the corpus. It results in the error described in this issue. |
@Futyn-Maker the reason there's errors is because (1) the corpus size is smaller than the default value of k=10, and (2) you are setting update_vocab=True, which creates new tokens that have never been seen during indexing. For 1, there's already an error messages that indicates why it fails (it's better here to raise an error and let the user correct it, than to give an array that is less than k, which is silent failure). For 2, no error message should appear, if the tokens are new. This should be corrected. Thank you for raising this, it should be added to the PR #67 |
@xhluca Thanks for the clarification! If I don't use Basically I'm talking about a real user scenario where a user might enter a query that will have tokens that are not in the corpus/vocab - would like some kind of result to be returned. |
@Futyn-Maker the error should be fixed in this commit: 02fc3df I also added your example as a test case, even with I will go ahead and merge the PR now, so people can benefit from the patch. Although I'm happy to fix issues when I have time on my hands, it is hard for me to handle all edge cases (e.g. |
Thanks for making this library! With both version 0.2.0 and 0.2.1 i get out of bounds errors for some of my queries. Here's a full stack trace:
I've got the following version of relevant packages installed:
I also tried with numpy 2.1.1. Downgrading to bm25s 0.1.10 makes everything work again.
The text was updated successfully, but these errors were encountered: