Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Document Similarity Check with passed Threshold #6845

Merged

Conversation

Sidchat95
Copy link
Contributor

@Sidchat95 Sidchat95 commented Jun 27, 2023

Converting the Similarity obtained in the similarity_search_with_score_by_vector method whilst comparing to the passed
threshold. This is because the passed threshold is a number between 0 to 1 and is already in the relevance_score_fn format.
As of now, the function is comparing two different scoring parameters and that wouldn't work.

Dependencies
None

Issue:
Different scores being compared in similarity_search_with_score_by_vector method in FAISS.

Tag maintainer
@hwchase17

… score threshold passed is the relevant score
@vercel
Copy link

vercel bot commented Jun 27, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jul 13, 2023 4:01am

@Sidchat95
Copy link
Contributor Author

@pablocastro @farzad528 Kindly request you to please take a look.

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.relevance_score_fn is relevant for _similarity_search_with_relevance_scores

@pablocastro
Copy link

@Sidchat95 happy to help but you may have the wrong folks, @farzad528 and I work on Azure Cognitive Search, not on FAISS.

@Sidchat95
Copy link
Contributor Author

self.relevance_score_fn is relevant for _similarity_search_with_relevance_scores

Hi @hwchase17 , as I agree that self.relevance_score_fn is majorly used for _similarity_search_with_relevance_scores function but it is also effecting this stage as well.

Consider the following scenario where we have a vectorstore built out of FAISS vectorstore:

  1. We get the retriever as retriever = vectorstore.as_retriever() which is mostly used for conversational chains.
  2. Given that the search type is set to "similarity_score_threshold", when we call retriever.get_relevant_documents(query) by setting the score threshold to say 0.8, the following workflow is triggered chronologically. (Let us keep in mind that 0.8 threshold set here is the threshold set on the basis of a score of 0 to 1 which ideally is the output of relevance_score_fn.

The steps called are:

  •  self.vectorstore.similarity_search_with_relevance_scores is triggered with the threshold of 0.8 (Line 410 File langchain.vectorstore.base)
    
  •  The next step involves calling self._similarity_search_with_relevance_scores (Line 156 File langchain.vectorstore.base)
    
  •  Function _similarity_search_with_relevance_scores is called from file langchain.vectorstore.faiss which in turn calls self.similarity_search_with_score function with the same query and the same threshold of 0.8
    
  • Function similarity_search_with_score is called which in turn calls self.similarity_search_with_score_by_vector with the same query and threshold of 0.8
    
  • self.similarity_search_with_score_by_vector is called with the threshold of 0.8 and the query. This function first fetches the docs till line 229 of langchain.vectorstore.faiss and then gets the score_threshold set which is still 0.8 in this case. It then matches and filters according to the threshold which is 0.8. 
    
  • The score returned in the list docs with the similarities is not between 0 and 1 and thus comparing it to 0.8 will essentially not fetch any results. For example a score of 0.312 in the docs list similarity is actually 0.779 by using the function converter 1.0 - 0.312 / math.sqrt(2) which is why we need to first convert the similarity score fetched by docs and then compare it to the incoming threshold.

@Sidchat95
Copy link
Contributor Author

@Sidchat95 happy to help but you may have the wrong folks, @farzad528 and I work on Azure Cognitive Search, not on FAISS.

Thanks no problem. My bad !

@Sidchat95 Sidchat95 requested a review from hwchase17 June 29, 2023 18:10
@Sidchat95
Copy link
Contributor Author

@hwchase17 Request you to please look into this and approve the workflows

@dev2049 dev2049 force-pushed the Fix-FAISS-Score-Threshold-Comparision branch from 8c047c8 to d5f4f92 Compare July 13, 2023 00:01
@baskaryan
Copy link
Collaborator

updated so only affects similarity_search_with_relevance_score

@baskaryan baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 13, 2023
@Sidchat95
Copy link
Contributor Author

updated so only affects similarity_search_with_relevance_score

Thanks @baskaryan for the collaboration but changing it in similarity_search_with_relevance_score in FAISS renders Line 227 to 238 in method similarity_search_with_score_by_vector completely useless as this would always pop the threshold from the arguments. Should we remove the redundant code then ?

@baskaryan
Copy link
Collaborator

baskaryan commented Jul 13, 2023

updated so only affects similarity_search_with_relevance_score

Thanks @baskaryan for the collaboration but changing it in similarity_search_with_relevance_score in FAISS renders Line 227 to 238 in method similarity_search_with_score_by_vector completely useless as this would always pop the threshold from the arguments. Should we remove the redundant code then ?

it's useless if someone calls similarity_search_with_relevance_score but not useless if someones calls similarity_search_with_score_by_vector directly

@Sidchat95
Copy link
Contributor Author

updated so only affects similarity_search_with_relevance_score

Thanks @baskaryan for the collaboration but changing it in similarity_search_with_relevance_score in FAISS renders Line 227 to 238 in method similarity_search_with_score_by_vector completely useless as this would always pop the threshold from the arguments. Should we remove the redundant code then ?

it's useless if someone calls similarity_search_with_relevance_score but not useless if someones calls similarity_search_with_score_by_vector directly

Agreed. Thank you @baskaryan . Can we merge this now? I do not have the permissions to merge this actually and seems all checks have passed?

@baskaryan baskaryan merged commit c5e50c4 into langchain-ai:master Jul 13, 2023
@baskaryan
Copy link
Collaborator

done, thanks @Sidchat95!

@Sidchat95
Copy link
Contributor Author

done, thanks @Sidchat95!

Thank you @baskaryan as a part of this PR, can I get the recognition for this twitter account if possible for this? @procoretech

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants