Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error occurs when retrieving from a different corpus when using a customized tokenizer #91

Open
mossbee opened this issue Dec 15, 2024 · 2 comments

Comments

@mossbee
Copy link
Contributor

mossbee commented Dec 15, 2024

When I used a customized tokenizer, the vocab is saved by retriever into vocab.index.json under the format of, for example:

{"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9, "10": 10, "11": 11}

when i retrieve by ids of query tokens like results, scores = reloaded_retriever.retrieve(query_tokens, corpus=titles, k=2) it led to an error when we query in a different corpus

File "env\Lib\site-packages\bm25s_init_.py", line 486, in get_scores_from_ids
raise ValueError(
ValueError: The maximum token ID in the query (12) is higher than the number of tokens in the index.This likely means that the query contains tokens that are not in the index.

In function retrieve of class BM25 in bm25s_init_.py, the retriever load the vocab this way:

        # if it's a list of list of tokens ids (int), we remove any integer not in the vocab_dict
        if is_list_of_list_of_type(query_tokens, type_=int):
            query_tokens_filtered = []
            for query in query_tokens:
                query_filtered = [
                    token_id for token_id in query if token_id in self.vocab_dict
                ]
                if len(query_filtered) == 0:
                    if "" not in self.vocab_dict:
                        self.vocab_dict[""] = max(self.vocab_dict.values()) + 1
                    query_filtered = [self.vocab_dict[""]]

                query_tokens_filtered.append(query_filtered)

            query_tokens = query_tokens_filtered

I think a quick fix is add .values in the query_filtered and it work:

    query_filtered = [
        token_id for token_id in query if token_id in self.vocab_dict.values()
    ]

Is this correct?

@xhluca
Copy link
Owner

xhluca commented Dec 16, 2024

I'm concerned by the time complexity of in self.vocab_dict.values(), which is O(N), vs calling in self.vocab_dict which is O(1). This should definitely be addressed in the PR.

You also mention that's an issue with the custom tokenizer. What's the default tokenizer's output? Can you change your custom tokenizer to match the behavior of the default one?

@xhluca
Copy link
Owner

xhluca commented Dec 29, 2024

@mossbee can you let me know if the new release 0.2.7pre1 works for you? I made substantial changes in #96, and want to make sure there's no side effect and regression before releasing officially as 0.2.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants