You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when i retrieve by ids of query tokens like results, scores = reloaded_retriever.retrieve(query_tokens, corpus=titles, k=2) it led to an error when we query in a different corpus
File "env\Lib\site-packages\bm25s_init_.py", line 486, in get_scores_from_ids
raise ValueError(
ValueError: The maximum token ID in the query (12) is higher than the number of tokens in the index.This likely means that the query contains tokens that are not in the index.
In function retrieve of class BM25 in bm25s_init_.py, the retriever load the vocab this way:
# if it's a list of list of tokens ids (int), we remove any integer not in the vocab_dictifis_list_of_list_of_type(query_tokens, type_=int):
query_tokens_filtered= []
forqueryinquery_tokens:
query_filtered= [
token_idfortoken_idinqueryiftoken_idinself.vocab_dict
]
iflen(query_filtered) ==0:
if""notinself.vocab_dict:
self.vocab_dict[""] =max(self.vocab_dict.values()) +1query_filtered= [self.vocab_dict[""]]
query_tokens_filtered.append(query_filtered)
query_tokens=query_tokens_filtered
I think a quick fix is add .values in the query_filtered and it work:
I'm concerned by the time complexity of in self.vocab_dict.values(), which is O(N), vs calling in self.vocab_dict which is O(1). This should definitely be addressed in the PR.
You also mention that's an issue with the custom tokenizer. What's the default tokenizer's output? Can you change your custom tokenizer to match the behavior of the default one?
@mossbee can you let me know if the new release 0.2.7pre1 works for you? I made substantial changes in #96, and want to make sure there's no side effect and regression before releasing officially as 0.2.7.
When I used a customized tokenizer, the vocab is saved by retriever into vocab.index.json under the format of, for example:
{"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 6, "7": 7, "8": 8, "9": 9, "10": 10, "11": 11}
when i retrieve by ids of query tokens like
results, scores = reloaded_retriever.retrieve(query_tokens, corpus=titles, k=2)
it led to an error when we query in a different corpusFile "env\Lib\site-packages\bm25s_init_.py", line 486, in get_scores_from_ids
raise ValueError(
ValueError: The maximum token ID in the query (12) is higher than the number of tokens in the index.This likely means that the query contains tokens that are not in the index.
In function retrieve of class BM25 in bm25s_init_.py, the retriever load the vocab this way:
I think a quick fix is add .values in the query_filtered and it work:
Is this correct?
The text was updated successfully, but these errors were encountered: