Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index out of bounds errors in 0.2.0 and 0.2.1 #60

Closed
hantusk opened this issue Sep 27, 2024 · 9 comments · Fixed by #67
Closed

Index out of bounds errors in 0.2.0 and 0.2.1 #60

hantusk opened this issue Sep 27, 2024 · 9 comments · Fixed by #67

Comments

@hantusk
Copy link

hantusk commented Sep 27, 2024

Thanks for making this library! With both version 0.2.0 and 0.2.1 i get out of bounds errors for some of my queries. Here's a full stack trace:

Traceback (most recent call last):
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Documents/kode/venv/venv.com/backend/demo/test/search/search.py", line 177, in root
    top_results, secondary_results = search(query)
                                     ^^^^^^^^^^^^^
  File "/Users/user/Documents/kode/venv/venv.com/backend/demo/test/search/search.py", line 297, in search
    tiktoken_results, tiktoken_bm25_scores = tiktoken_retriever.retrieve(
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 710, in retrieve
    scores, indices = zip(*out)
                      ^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 517, in _get_top_k_results
    scores_q = self.get_scores(query_tokens_single, weight_mask=weight_mask)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 502, in get_scores
    return self.get_scores_from_ids(query_tokens_ids, weight_mask=weight_mask)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 469, in get_scores_from_ids
    scores = self._compute_relevance_from_scores(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/micromamba/envs/venv/lib/python3.11/site-packages/bm25s/__init__.py", line 249, in _compute_relevance_from_scores
    indptr_starts = indptr[query_tokens_ids]
                    ~~~~~~^^^^^^^^^^^^^^^^^^
IndexError: index 32680 is out of bounds for axis 0 with size 8163

I've got the following version of relevant packages installed:

scipy                             1.14.1
numpy                             2.0.2

I also tried with numpy 2.1.1. Downgrading to bm25s 0.1.10 makes everything work again.

@xhluca
Copy link
Owner

xhluca commented Sep 27, 2024

can you share reproducible code to run this (including dependencies)? I'd be happy to add it to the tests/ (which I extensively updated for 0.2) so this doesn't happen in the future.

@zhuwenxing
Copy link

zhuwenxing commented Oct 12, 2024

I've also encountered the same issue, but it only occurs in very specific circumstances, where all the strings in the croup are empty.

However, I think this situation should not result in an error, but rather return an empty result instead.

reproduce code

import bm25s
from bm25s.tokenization import Tokenizer
import jieba
import re
jieba.initialize()
# Create your corpus here
corpus = [
    "",
    "",
    ""

    # "蛋白质设计方法发明者大卫·贝克、蛋白质结构预测方法发明者杰米斯·哈萨比斯和约翰·M·江珀获得诺贝尔化学奖。",
    # "在机器学习及类神经网络领域有重大贡献的约翰·霍普菲尔德和杰弗里·辛顿(图)获得诺贝尔物理学奖。",

]
queries = [
    "诺贝尔奖得主",
]
top_k = 2
for c in corpus:
    print(list(jieba.cut(c, cut_all=True)))

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

# Tokenize the corpus
def jieba_split(text):
    text_without_punctuation = remove_punctuation(text)
    return jieba.cut(text_without_punctuation, cut_all=True)

splitter = jieba_split
stemmer = None
tokenizer = Tokenizer(
      stemmer=stemmer, splitter=splitter
)
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple")
print(corpus_tokens)
language = "zh"
splitter = jieba_split if language in ["zh", "cn", "chinese"] else None
stopwords = "english" if language in ["en", "english"] else [" "]
stemmer = None
tokenizer = Tokenizer(
    stemmer=stemmer, splitter=splitter, stopwords=stopwords
)
corpus_tokens = tokenizer.tokenize(corpus, return_as="tuple")
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
query_tokens = tokenizer.tokenize(queries, return_as="tuple")
print(f"query_tokens: {query_tokens}")
results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=top_k)
print(results)

output

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/lp/c86mh1zd5ps_gfp93p9k69sc0000gp/T/jieba.cache
Loading model cost 0.769 seconds.
Prefix dict has been built successfully.
[]
[]
[]
Tokenized(ids=[[], [], []], vocab={})
BM25S Compute Scores:   0%|          | 0/3 [00:00<?, ?it/s]/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/scoring.py:106: RuntimeWarning: invalid value encountered in scalar divide
  return tf_array / (k1 * ((1 - b) + b * l_d / l_avg) + tf_array)
Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/bm25sdemo.py", line 50, in <module>
    results, scores = retriever.retrieve(query_tokens, corpus=corpus, k=top_k)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 710, in retrieve
    scores, indices = zip(*out)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 517, in _get_top_k_results
    scores_q = self.get_scores(query_tokens_single, weight_mask=weight_mask)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 502, in get_scores
    return self.get_scores_from_ids(query_tokens_ids, weight_mask=weight_mask)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 469, in get_scores_from_ids
    scores = self._compute_relevance_from_scores(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/bm25s/__init__.py", line 249, in _compute_relevance_from_scores
    indptr_starts = indptr[query_tokens_ids]
IndexError: index 1 is out of bounds for axis 0 with size 1
query_tokens: Tokenized(ids=[[0, 1, 2, 3, 4, 5]], vocab={'诺贝': 0, '诺贝尔': 1, '诺贝尔奖': 2, '贝尔': 3, '奖得主': 4, '得主': 5})

@xhluca
Copy link
Owner

xhluca commented Oct 14, 2024

Thank you! This seems like we should raise an exception when an empty string is encountered, as it should not be a valid input with the tokenizer. Returning empty string is worse than failing, as that'd mislead the API user that the corpus is valid, when empty string in the corpus should not be.

@Futyn-Maker
Copy link

Futyn-Maker commented Oct 18, 2024

I'm getting this error when the tokens from my query are not in the vocabulary that I got when tokenizing and indexing the corpus. I'm new to this library and tries to use custom tokenizer for my language, so maybe i'm doing something wrong...

from typing import List, Union

import nltk
from pymystem3 import Mystem


class MystemTokenizer():
    def __init__(self, stopwords: Union[List[str], str] = "ru"):
        if stopwords == "ru":
            try:
                self.stopwords = nltk.corpus.stopwords.words("russian")
            except LookupError:
                nltk.download("stopwords")
                self.stopwords = nltk.corpus.stopwords.words("russian")

        self.mystem = Mystem()

    def tokenize(self, text: str) -> List[str]:
        # Lemmatize and tokenize using Mystem
        lemmas = self.mystem.lemmatize(text.lower())

        # Filter out non-letter tokens and stopwords
        tokens = [lemma for lemma in lemmas if lemma.isalpha()
                  and lemma not in self.stopwords]

        return tokens
import os
from typing import List, Callable

from bm25s import BM25
from bm25s.tokenization import Tokenizer as BM25STokenizer


class BM25SIndexer:
    def __init__(self, corpus: List[str],
                 tokenizer: Callable[[str], List[str]]):
        self.corpus = corpus
        self.custom_tokenizer = tokenizer
        self.bm25s_tokenizer = BM25STokenizer(
            stemmer=None,
            stopwords=None,
            splitter=self.custom_tokenizer
        )
        self.bm25 = None

    def create_index(
            self,
            save_dir: str,
            save_corpus: bool = True,
            save_vocab: bool = True) -> None:
        # Ensure the save directory exists
        os.makedirs(save_dir, exist_ok=True)

        # Tokenize the corpus
        corpus_tokens = self.bm25s_tokenizer.tokenize(self.corpus)

        # Create and index the BM25 model
        self.bm25 = BM25()
        self.bm25.index(corpus_tokens)

        # Save the BM25 index
        self.bm25.save(save_dir, corpus=self.corpus if save_corpus else None)

        # Save vocabulary if needed
        if save_vocab:
            self.bm25s_tokenizer.save_vocab(save_dir=save_dir)
import time
from typing import List, Dict, Any, Callable

import numpy as np
from sqlalchemy.orm import Session
import bm25s
from src.db import crud


class BM25Search:
    def __init__(self, db: Session, index_folder: str,
                 tokenizer: Callable[[str], List[str]]):
        self.db = db
        self.bm25s_tokenizer = bm25s.tokenization.Tokenizer(
            stemmer=None,
            stopwords=None,
            splitter=tokenizer
        )
        self.bm25s_tokenizer.load_vocab(index_folder)
        self.bm25 = bm25s.BM25.load(index_folder, load_corpus=False)

    def search(self, query: str, n: int = 3) -> Dict[str, Any]:
        start_time = time.time()

        query_tokens = self.bm25s_tokenizer.tokenize([query])
        doc_ids, scores = self.bm25.retrieve(query_tokens, k=n)
...

And I'm getting list index out of range error in this way, and error describing in this issue if using update_vocab=True in cases when my query contains unknown words.

@xhluca
Copy link
Owner

xhluca commented Oct 18, 2024

@Futyn-Maker thanks for expanding on this, it is not clear to me how the error happens. Could you share a minimally reproducible example in a single-file gist, without the use of databases (as it is hard for me to set it up)?

@Futyn-Maker
Copy link

Futyn-Maker commented Oct 18, 2024

@xhluca Yes, here it is: https://gist.github.com/Futyn-Maker/d3817aa9d0d5b89afd543387622612fb

It contains dummy custom tokenizer, corpus and a query which contains tokens that do not in the corpus. It results in the error described in this issue.

@xhluca
Copy link
Owner

xhluca commented Oct 18, 2024

@Futyn-Maker the reason there's errors is because (1) the corpus size is smaller than the default value of k=10, and (2) you are setting update_vocab=True, which creates new tokens that have never been seen during indexing.

For 1, there's already an error messages that indicates why it fails (it's better here to raise an error and let the user correct it, than to give an array that is less than k, which is silent failure).

For 2, no error message should appear, if the tokens are new. This should be corrected. Thank you for raising this, it should be added to the PR #67

@Futyn-Maker
Copy link

@xhluca Thanks for the clarification! If I don't use update_vocab I get another error: list index out of range because empty arrays appear instead of unknown tokens. Regarding the size of the corpus, it seems that even on a large corpus the problem with unknown tokens persists, this has been shown by my experience of actually using the library.

Basically I'm talking about a real user scenario where a user might enter a query that will have tokens that are not in the corpus/vocab - would like some kind of result to be returned.

@xhluca
Copy link
Owner

xhluca commented Oct 18, 2024

@Futyn-Maker the error should be fixed in this commit: 02fc3df

I also added your example as a test case, even with update_vocab=True it should now be robust to failure (although I could technically raise an error when unknown tokens are given, but I want to avoid doing that for greater robustness).

I will go ahead and merge the PR now, so people can benefit from the patch. Although I'm happy to fix issues when I have time on my hands, it is hard for me to handle all edge cases (e.g. update_vocab=True in return_as="ids" mode, which leads to unseen token IDs when calling retrieve, which is not a use case I intended to handle originally). Unlike commercial products with dedicated staff, this is a personal project. Thus, if you run into supplementary problems, feel free to contribute fixes and add test cases (see this as an example: 02fc3df), or consider using alternative implementations such as rank-bm25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants