-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test_scores_for_identical_examples unit test fails #1079
Fix flaky test_scores_for_identical_examples unit test fails #1079
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1079 +/- ##
==========================================
- Coverage 96.15% 96.12% -0.03%
==========================================
Files 74 75 +1
Lines 5850 5885 +35
Branches 1044 1050 +6
==========================================
+ Hits 5625 5657 +32
- Misses 134 135 +1
- Partials 91 93 +2 ☔ View full report in Codecov by Sentry. |
cleanlab/internal/outlier.py
Outdated
@@ -70,6 +70,6 @@ def transform_distances_to_scores( | |||
ood_features_scores: np.ndarray = np.exp(-1 * avg_distances / scaling_factor * t) | |||
|
|||
# Set scores to 1 if the average distance is close to 0 | |||
inds = np.isclose(avg_distances, 0) | |||
inds = np.isclose(avg_distances, 0, atol=5e-6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be changing our source code to fix unit tests. Should instead fix the unit test. This atol value was added here for a reason, but will let @elisno share how to best proceed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out tthat he default atol
was a bit too small for the euclidean distance, so we explicitly set the atol
parameter based how large the floating-point error can get for the distance computations.
It's still not 100% verified how this happens on only a few platforms for the euclidean distances, but we're confident that we can put a MUCH lower tolerance for the cosine metric.
The takeaway is: A single atol
value won't work for different types of datasets, we need to set it carefully when comparing values against 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not within the scope of this PR, but we may want to look at similar things for the near-duplicate check.
I completely agree with your comment. Maybe I should have changed the title of this PR because actually I think the test is raising a behavior that we do not expect. Here is one example: import numpy as np
from cleanlab.outlier import OutOfDistribution
np.random.seed(0)
N = 20
K = 3
fill_value = 0.6
features = np.full((N, K), fill_value=fill_value)
OutOfDistribution().fit_score(features=features, verbose=False)
# array([0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
# 0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
# 0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
# 0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944]) I would expect the score to be much higher for this input data. With this new PR this code produces an array of ones. |
After an enriching discussion with @elisno we have noticed for some platforms there are some numerical issues caused by floating point representation when calculating the distances. The issue is that distances that should be exactly 0 are given very low values but not exactly 0. This last push calculates a tolerance value depending on the metric used for the calculation. When the avg_knn_distances are below this tolerance, the avg_knn_distances are assumed to be zero and we can set the score of that point to 1. This functionality is provided by the added function The flaky test produces now stable results and an error_message was added to the assert function for future debugging if we faced any issue, printing the calculated distances to better understand why the test is failing. Future work: The new added test includes some minor changes when the metric used is euclidean to verify that almost identical features still get scores of 1. It would be great to add a similar behavior for the cosine distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working on this @gogetron!
Dealing with these kinds of numerical issues was not easy as they weren't easily reproducible on all platforms, and debugging was far from straightforward.
But this PR LGTM! Seems like the affected tests should behave now.
EDIT: It's not within the scope of this PR, but we may want to look at similar changes for the near-duplicate check in Datalab.
Summary
[ ✏️ Write your summary here. ]
This test was failing with default parameters for
np.isclose
function which is called in thetransform_distances_to_scores
function. However it seems that even when providing the same values in the array the distances can be above 1e-8 which is the default parameter fornp.isclose
. I have found empirically that 5e-6 seems to be a reasonable value for an input array with a value between 1e-15 up to 10 repeated to fill the array. In addition, I would expect the function to return ones when given the same value repeated in the whole array so it would fix both the function and also the flaky test.Impact
Testing
Unaddressed Cases
Under some conditions, random values higher than 10 repeated over the whole array might result in scores that are not equal to 1.0. There is a trade-off with how high we want to set the tolerance for
np.isclose
. I have calibrated the new value with values up to 10 because that was the limit established in the hypothesis test, thus it should solve the current flaky test.Links to Relevant Issues or Conversations
Fixes #1074
References
Reviewer Notes