Index out of bounds errors in 0.2.0 and 0.2.1 #60

hantusk · 2024-09-27T15:14:38Z

Thanks for making this library! With both version 0.2.0 and 0.2.1 i get out of bounds errors for some of my queries. Here's a full stack trace:

Traceback (most recent call last):
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Documents/kode/venv/venv.com/backend/demo/test/search/search.py", line 177, in root
    top_results, secondary_results = search(query)
                                     ^^^^^^^^^^^^^
  File "/Users/user/Documents/kode/venv/venv.com/backend/demo/test/search/search.py", line 297, in search
    tiktoken_results, tiktoken_bm25_scores = tiktoken_retriever.retrieve(
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 710, in retrieve
    scores, indices = zip(*out)
                      ^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 517, in _get_top_k_results
    scores_q = self.get_scores(query_tokens_single, weight_mask=weight_mask)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 502, in get_scores
    return self.get_scores_from_ids(query_tokens_ids, weight_mask=weight_mask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 469, in get_scores_from_ids
    scores = self._compute_relevance_from_scores(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 249, in _compute_relevance_from_scores
    indptr_starts = indptr[query_tokens_ids]
                    ~~~~~~^^^^^^^^^^^^^^^^^^
IndexError: index 32680 is out of bounds for axis 0 with size 8163

I've got the following version of relevant packages installed:

scipy                             1.14.1
numpy                             2.0.2

I also tried with numpy 2.1.1. Downgrading to bm25s 0.1.10 makes everything work again.

The text was updated successfully, but these errors were encountered:

xhluca · 2024-09-27T15:54:50Z

can you share reproducible code to run this (including dependencies)? I'd be happy to add it to the tests/ (which I extensively updated for 0.2) so this doesn't happen in the future.

zhuwenxing · 2024-10-12T08:34:18Z

I've also encountered the same issue, but it only occurs in very specific circumstances, where all the strings in the croup are empty.

However, I think this situation should not result in an error, but rather return an empty result instead.

reproduce code

import bm25s
from bm25s.tokenization import Tokenizer
import jieba
import re
jieba.initialize()
# Create your corpus here
corpus = [
    "",
    "",
    ""

    # "蛋白质设计方法发明者大卫·贝克、蛋白质结构预测方法发明者杰米斯·哈萨比斯和约翰·M·江珀获得诺贝尔化学奖。",
    # "在机器学习及类神经网络领域有重大贡献的约翰·霍普菲尔德和杰弗里·辛顿（图）获得诺贝尔物理学奖。",

]
queries = [
    "诺贝尔奖得主",
]
top_k = 2
for c in corpus:
    print(list(jieba.cut(c, cut_all=True)))

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

# Tokenize the corpus
def jieba_split(text):
    text_without_punctuation = remove_punctuation(text)
    return jieba.cut(text_without_punctuation, cut_all=True)

splitter = jieba_split
stemmer = None
tokenizer = Tokenizer(
      stemmer=stemmer, splitter=splitter
)
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple")
print(corpus_tokens)
language = "zh"
splitter = jieba_split if language in ["zh", "cn", "chinese"] else None
stopwords = "english" if language in ["en", "english"] else [" "]
stemmer = None
tokenizer = Tokenizer(
    stemmer=stemmer, splitter=splitter, stopwords=stopwords
)
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple")
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
query_tokens = tokenizer.tokenize(queries, return_as="tuple")
print(f"query_tokens: {query_tokens}")
results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=top_k)
print(results)

output

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/lp/c86mh1zd5ps_gfp93p9k69sc0000gp/T/jieba.cache
Loading model cost 0.769 seconds.
Prefix dict has been built successfully.
[]
[]
[]
Tokenized(ids=[[], [], []], vocab={})
BM25S Compute Scores:   0%|          | 0/3 [00:00<?, ?it/s]/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/scoring.py:106: RuntimeWarning: invalid value encountered in scalar divide
  return tf_array / (k1 * ((1 - b) + b * l_d / l_avg) + tf_array)
Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/bm25sdemo.py", line 50, in <module>
    results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=top_k)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 710, in retrieve
    scores, indices = zip(*out)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 517, in _get_top_k_results
    scores_q = self.get_scores(query_tokens_single, weight_mask=weight_mask)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 502, in get_scores
    return self.get_scores_from_ids(query_tokens_ids, weight_mask=weight_mask)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 469, in get_scores_from_ids
    scores = self._compute_relevance_from_scores(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 249, in _compute_relevance_from_scores
    indptr_starts = indptr[query_tokens_ids]
IndexError: index 1 is out of bounds for axis 0 with size 1
query_tokens: Tokenized(ids=[[0, 1, 2, 3, 4, 5]], vocab={'诺贝': 0, '诺贝尔': 1, '诺贝尔奖': 2, '贝尔': 3, '奖得主': 4, '得主': 5})

xhluca · 2024-10-14T18:18:02Z

Thank you! This seems like we should raise an exception when an empty string is encountered, as it should not be a valid input with the tokenizer. Returning empty string is worse than failing, as that'd mislead the API user that the corpus is valid, when empty string in the corpus should not be.

Futyn-Maker · 2024-10-18T12:30:21Z

I'm getting this error when the tokens from my query are not in the vocabulary that I got when tokenizing and indexing the corpus. I'm new to this library and tries to use custom tokenizer for my language, so maybe i'm doing something wrong...

from typing import List, Union

import nltk
from pymystem3 import Mystem


class MystemTokenizer():
    def __init__(self, stopwords: Union[List[str], str] = "ru"):
        if stopwords == "ru":
            try:
                self.stopwords = nltk.corpus.stopwords.words("russian")
            except LookupError:
                nltk.download("stopwords")
                self.stopwords = nltk.corpus.stopwords.words("russian")

        self.mystem = Mystem()

    def tokenize(self, text: str) -> List[str]:
        # Lemmatize and tokenize using Mystem
        lemmas = self.mystem.lemmatize(text.lower())

        # Filter out non-letter tokens and stopwords
        tokens = [lemma for lemma in lemmas if lemma.isalpha()
                  and lemma not in self.stopwords]

        return tokens

import os
from typing import List, Callable

from bm25s import BM25
from bm25s.tokenization import Tokenizer as BM25STokenizer


class BM25SIndexer:
    def __init__(self, corpus: List[str],
                 tokenizer: Callable[[str], List[str]]):
        self.corpus = corpus
        self.custom_tokenizer = tokenizer
        self.bm25s_tokenizer = BM25STokenizer(
            stemmer=None,
            stopwords=None,
            splitter=self.custom_tokenizer
        )
        self.bm25 = None

    def create_index(
            self,
            save_dir: str,
            save_corpus: bool = True,
            save_vocab: bool = True) -> None:
        # Ensure the save directory exists
        os.makedirs(save_dir, exist_ok=True)

        # Tokenize the corpus
        corpus_tokens = self.bm25s_tokenizer.tokenize(self.corpus)

        # Create and index the BM25 model
        self.bm25 = BM25()
        self.bm25.index(corpus_tokens)

        # Save the BM25 index
        self.bm25.save(save_dir, corpus=self.corpus if save_corpus else None)

        # Save vocabulary if needed
        if save_vocab:
            self.bm25s_tokenizer.save_vocab(save_dir=save_dir)

import time
from typing import List, Dict, Any, Callable

import numpy as np
from sqlalchemy.orm import Session
import bm25s
from src.db import crud


class BM25Search:
    def __init__(self, db: Session, index_folder: str,
                 tokenizer: Callable[[str], List[str]]):
        self.db = db
        self.bm25s_tokenizer = bm25s.tokenization.Tokenizer(
            stemmer=None,
            stopwords=None,
            splitter=tokenizer
        )
        self.bm25s_tokenizer.load_vocab(index_folder)
        self.bm25 = bm25s.BM25.load(index_folder, load_corpus=False)

    def search(self, query: str, n: int = 3) -> Dict[str, Any]:
        start_time = time.time()

        query_tokens = self.bm25s_tokenizer.tokenize([query])
        doc_ids, scores = self.bm25.retrieve(query_tokens, k=n)
...

And I'm getting list index out of range error in this way, and error describing in this issue if using update_vocab=True in cases when my query contains unknown words.

xhluca · 2024-10-18T15:52:09Z

@Futyn-Maker thanks for expanding on this, it is not clear to me how the error happens. Could you share a minimally reproducible example in a single-file gist, without the use of databases (as it is hard for me to set it up)?

Futyn-Maker · 2024-10-18T19:14:42Z

@xhluca Yes, here it is: https://gist.github.com/Futyn-Maker/d3817aa9d0d5b89afd543387622612fb

It contains dummy custom tokenizer, corpus and a query which contains tokens that do not in the corpus. It results in the error described in this issue.

xhluca · 2024-10-18T20:24:48Z

@Futyn-Maker the reason there's errors is because (1) the corpus size is smaller than the default value of k=10, and (2) you are setting update_vocab=True, which creates new tokens that have never been seen during indexing.

For 1, there's already an error messages that indicates why it fails (it's better here to raise an error and let the user correct it, than to give an array that is less than k, which is silent failure).

For 2, no error message should appear, if the tokens are new. This should be corrected. Thank you for raising this, it should be added to the PR #67

Futyn-Maker · 2024-10-18T20:34:48Z

@xhluca Thanks for the clarification! If I don't use update_vocab I get another error: list index out of range because empty arrays appear instead of unknown tokens. Regarding the size of the corpus, it seems that even on a large corpus the problem with unknown tokens persists, this has been shown by my experience of actually using the library.

Basically I'm talking about a real user scenario where a user might enter a query that will have tokens that are not in the corpus/vocab - would like some kind of result to be returned.

xhluca · 2024-10-18T20:55:51Z

@Futyn-Maker the error should be fixed in this commit: 02fc3df

I also added your example as a test case, even with update_vocab=True it should now be robust to failure (although I could technically raise an error when unknown tokens are given, but I want to avoid doing that for greater robustness).

I will go ahead and merge the PR now, so people can benefit from the patch. Although I'm happy to fix issues when I have time on my hands, it is hard for me to handle all edge cases (e.g. update_vocab=True in return_as="ids" mode, which leads to unseen token IDs when calling retrieve, which is not a use case I intended to handle originally). Unlike commercial products with dedicated staff, this is a personal project. Thus, if you run into supplementary problems, feel free to contribute fixes and add test cases (see this as an example: 02fc3df), or consider using alternative implementations such as rank-bm25.

xhluca mentioned this issue Oct 14, 2024

Make empty strings an acceptable token #67

Merged

xhluca closed this as completed in #67 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index out of bounds errors in 0.2.0 and 0.2.1 #60

Index out of bounds errors in 0.2.0 and 0.2.1 #60

hantusk commented Sep 27, 2024 •

edited

Loading

xhluca commented Sep 27, 2024 •

edited

Loading

zhuwenxing commented Oct 12, 2024 •

edited

Loading

xhluca commented Oct 14, 2024

Futyn-Maker commented Oct 18, 2024 •

edited

Loading

xhluca commented Oct 18, 2024

Futyn-Maker commented Oct 18, 2024 •

edited

Loading

xhluca commented Oct 18, 2024

Futyn-Maker commented Oct 18, 2024

xhluca commented Oct 18, 2024 •

edited

Loading

Index out of bounds errors in 0.2.0 and 0.2.1 #60

Index out of bounds errors in 0.2.0 and 0.2.1 #60

Comments

hantusk commented Sep 27, 2024 • edited Loading

xhluca commented Sep 27, 2024 • edited Loading

zhuwenxing commented Oct 12, 2024 • edited Loading

xhluca commented Oct 14, 2024

Futyn-Maker commented Oct 18, 2024 • edited Loading

xhluca commented Oct 18, 2024

Futyn-Maker commented Oct 18, 2024 • edited Loading

xhluca commented Oct 18, 2024

Futyn-Maker commented Oct 18, 2024

xhluca commented Oct 18, 2024 • edited Loading

hantusk commented Sep 27, 2024 •

edited

Loading

xhluca commented Sep 27, 2024 •

edited

Loading

zhuwenxing commented Oct 12, 2024 •

edited

Loading

Futyn-Maker commented Oct 18, 2024 •

edited

Loading

Futyn-Maker commented Oct 18, 2024 •

edited

Loading

xhluca commented Oct 18, 2024 •

edited

Loading