Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test_scores_for_identical_examples unit test fails #1079

Merged

Conversation

gogetron
Copy link
Contributor

@gogetron gogetron commented Apr 1, 2024

Summary

🎯 Purpose: Fixes #1074

[ ✏️ Write your summary here. ]
This test was failing with default parameters for np.isclose function which is called in the transform_distances_to_scores function. However it seems that even when providing the same values in the array the distances can be above 1e-8 which is the default parameter for np.isclose. I have found empirically that 5e-6 seems to be a reasonable value for an input array with a value between 1e-15 up to 10 repeated to fill the array. In addition, I would expect the function to return ones when given the same value repeated in the whole array so it would fix both the function and also the flaky test.

Impact

🌐 Areas Affected: Out Of Distribution scores with distances between 5e-6 and 1e-8 were given a score in the past different from 1.0 and after this PR they would be exactly 1.0.

Testing

🔍 Testing Done: Existing test suite.

Unaddressed Cases

It's ok if your unit tests are not yet
comprehensive when you first open the PR,
we can revisit them later!

⚠️ Mention any aspects of testing that have not been covered, and why.

Under some conditions, random values higher than 10 repeated over the whole array might result in scores that are not equal to 1.0. There is a trade-off with how high we want to set the tolerance for np.isclose. I have calibrated the new value with values up to 10 because that was the limit established in the hypothesis test, thus it should solve the current flaky test.

Links to Relevant Issues or Conversations

🔗 What Git or Slack items (Issues, threads, etc) that are specifically related to
this work? Please link them here.

Fixes #1074

References

📚 Include any extra information that may be helpful to reviewers of your
Pull-Request here (e.g. relevant URLs or Wiki pages that could help give
background information).

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

Copy link

codecov bot commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 91.30435% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 96.12%. Comparing base (abd0924) to head (1ca5ef6).
Report is 2 commits behind head on master.

Files Patch % Lines
cleanlab/internal/outlier.py 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1079      +/-   ##
==========================================
- Coverage   96.15%   96.12%   -0.03%     
==========================================
  Files          74       75       +1     
  Lines        5850     5885      +35     
  Branches     1044     1050       +6     
==========================================
+ Hits         5625     5657      +32     
- Misses        134      135       +1     
- Partials       91       93       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -70,6 +70,6 @@ def transform_distances_to_scores(
ood_features_scores: np.ndarray = np.exp(-1 * avg_distances / scaling_factor * t)

# Set scores to 1 if the average distance is close to 0
inds = np.isclose(avg_distances, 0)
inds = np.isclose(avg_distances, 0, atol=5e-6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be changing our source code to fix unit tests. Should instead fix the unit test. This atol value was added here for a reason, but will let @elisno share how to best proceed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out tthat he default atol was a bit too small for the euclidean distance, so we explicitly set the atol parameter based how large the floating-point error can get for the distance computations.

It's still not 100% verified how this happens on only a few platforms for the euclidean distances, but we're confident that we can put a MUCH lower tolerance for the cosine metric.
The takeaway is: A single atol value won't work for different types of datasets, we need to set it carefully when comparing values against 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not within the scope of this PR, but we may want to look at similar things for the near-duplicate check.

@gogetron
Copy link
Contributor Author

gogetron commented Apr 2, 2024

I completely agree with your comment. Maybe I should have changed the title of this PR because actually I think the test is raising a behavior that we do not expect. Here is one example:

import numpy as np

from cleanlab.outlier import OutOfDistribution

np.random.seed(0)

N = 20
K = 3
fill_value = 0.6

features = np.full((N, K), fill_value=fill_value)

OutOfDistribution().fit_score(features=features, verbose=False)
#     array([0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944])

I would expect the score to be much higher for this input data. With this new PR this code produces an array of ones.

@gogetron
Copy link
Contributor Author

gogetron commented Apr 4, 2024

After an enriching discussion with @elisno we have noticed for some platforms there are some numerical issues caused by floating point representation when calculating the distances. The issue is that distances that should be exactly 0 are given very low values but not exactly 0.

This last push calculates a tolerance value depending on the metric used for the calculation. When the avg_knn_distances are below this tolerance, the avg_knn_distances are assumed to be zero and we can set the score of that point to 1. This functionality is provided by the added function correct_precision_errors.

The flaky test produces now stable results and an error_message was added to the assert function for future debugging if we faced any issue, printing the calculated distances to better understand why the test is failing.

Future work: The new added test includes some minor changes when the metric used is euclidean to verify that almost identical features still get scores of 1. It would be great to add a similar behavior for the cosine distance.

Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for working on this @gogetron!

Dealing with these kinds of numerical issues was not easy as they weren't easily reproducible on all platforms, and debugging was far from straightforward.

But this PR LGTM! Seems like the affected tests should behave now.


EDIT: It's not within the scope of this PR, but we may want to look at similar changes for the near-duplicate check in Datalab.

@elisno elisno merged commit 76dfdd2 into cleanlab:master Apr 6, 2024
19 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_scores_for_identical_examples unit test fails
3 participants