error occurs when retrieving from a different corpus when using a customized tokenizer #91

mossbee · 2024-12-15T10:47:26Z

When I used a customized tokenizer, the vocab is saved by retriever into vocab.index.json under the format of, for example:

{"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9, "10": 10, "11": 11}

when i retrieve by ids of query tokens like results, scores = reloaded_retriever.retrieve(query_tokens, corpus=titles, k=2) it led to an error when we query in a different corpus

File "env\Lib\site-packages\bm25s_init_.py", line 486, in get_scores_from_ids
raise ValueError(
ValueError: The maximum token ID in the query (12) is higher than the number of tokens in the index.This likely means that the query contains tokens that are not in the index.

In function retrieve of class BM25 in bm25s_init_.py, the retriever load the vocab this way:

        # if it's a list of list of tokens ids (int), we remove any integer not in the vocab_dict
        if is_list_of_list_of_type(query_tokens, type_=int):
            query_tokens_filtered = []
            for query in query_tokens:
                query_filtered = [
                    token_id for token_id in query if token_id in self.vocab_dict
                ]
                if len(query_filtered) == 0:
                    if "" not in self.vocab_dict:
                        self.vocab_dict[""] = max(self.vocab_dict.values()) + 1
                    query_filtered = [self.vocab_dict[""]]

                query_tokens_filtered.append(query_filtered)

            query_tokens = query_tokens_filtered

I think a quick fix is add .values in the query_filtered and it work:

    query_filtered = [
        token_id for token_id in query if token_id in self.vocab_dict.values()
    ]

Is this correct?

The text was updated successfully, but these errors were encountered:

xhluca · 2024-12-16T17:32:56Z

I'm concerned by the time complexity of in self.vocab_dict.values(), which is O(N), vs calling in self.vocab_dict which is O(1). This should definitely be addressed in the PR.

You also mention that's an issue with the custom tokenizer. What's the default tokenizer's output? Can you change your custom tokenizer to match the behavior of the default one?

xhluca · 2024-12-29T01:20:02Z

@mossbee can you let me know if the new release 0.2.7pre1 works for you? I made substantial changes in #96, and want to make sure there's no side effect and regression before releasing officially as 0.2.7.

xhluca mentioned this issue Dec 16, 2024

fix: token ID in the query higher than the number of tokens in the index #92

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error occurs when retrieving from a different corpus when using a customized tokenizer #91

error occurs when retrieving from a different corpus when using a customized tokenizer #91

mossbee commented Dec 15, 2024

xhluca commented Dec 16, 2024

xhluca commented Dec 29, 2024

error occurs when retrieving from a different corpus when using a customized tokenizer #91

error occurs when retrieving from a different corpus when using a customized tokenizer #91

Comments

mossbee commented Dec 15, 2024

xhluca commented Dec 16, 2024

xhluca commented Dec 29, 2024