Fix flaky test_scores_for_identical_examples unit test fails #1079

gogetron · 2024-04-01T11:25:23Z

Summary

🎯 Purpose: Fixes #1074

[ ✏️ Write your summary here. ]
This test was failing with default parameters for np.isclose function which is called in the transform_distances_to_scores function. However it seems that even when providing the same values in the array the distances can be above 1e-8 which is the default parameter for np.isclose. I have found empirically that 5e-6 seems to be a reasonable value for an input array with a value between 1e-15 up to 10 repeated to fill the array. In addition, I would expect the function to return ones when given the same value repeated in the whole array so it would fix both the function and also the flaky test.

Impact

🌐 Areas Affected: Out Of Distribution scores with distances between 5e-6 and 1e-8 were given a score in the past different from 1.0 and after this PR they would be exactly 1.0.

Testing

🔍 Testing Done: Existing test suite.

Unaddressed Cases

It's ok if your unit tests are not yet
comprehensive when you first open the PR,
we can revisit them later!

⚠️ Mention any aspects of testing that have not been covered, and why.

Under some conditions, random values higher than 10 repeated over the whole array might result in scores that are not equal to 1.0. There is a trade-off with how high we want to set the tolerance for np.isclose. I have calibrated the new value with values up to 10 because that was the limit established in the hypothesis test, thus it should solve the current flaky test.

Links to Relevant Issues or Conversations

🔗 What Git or Slack items (Issues, threads, etc) that are specifically related to
this work? Please link them here.

Fixes #1074

References

📚 Include any extra information that may be helpful to reviewers of your
Pull-Request here (e.g. relevant URLs or Wiki pages that could help give
background information).

Reviewer Notes

💡 Include any specific points for the reviewer to consider during their review.

codecov · 2024-04-01T12:09:29Z

Codecov Report

Attention: Patch coverage is 91.30435% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 96.12%. Comparing base (abd0924) to head (1ca5ef6).
Report is 2 commits behind head on master.

Files	Patch %	Lines
cleanlab/internal/outlier.py	85.71%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1079      +/-   ##
==========================================
- Coverage   96.15%   96.12%   -0.03%     
==========================================
  Files          74       75       +1     
  Lines        5850     5885      +35     
  Branches     1044     1050       +6     
==========================================
+ Hits         5625     5657      +32     
- Misses        134      135       +1     
- Partials       91       93       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jwmueller · 2024-04-01T22:30:37Z

cleanlab/internal/outlier.py

@@ -70,6 +70,6 @@ def transform_distances_to_scores(
    ood_features_scores: np.ndarray = np.exp(-1 * avg_distances / scaling_factor * t)

    # Set scores to 1 if the average distance is close to 0
-    inds = np.isclose(avg_distances, 0)
+    inds = np.isclose(avg_distances, 0, atol=5e-6)


I don't think we should be changing our source code to fix unit tests. Should instead fix the unit test. This atol value was added here for a reason, but will let @elisno share how to best proceed

It turns out tthat he default atol was a bit too small for the euclidean distance, so we explicitly set the atol parameter based how large the floating-point error can get for the distance computations.

It's still not 100% verified how this happens on only a few platforms for the euclidean distances, but we're confident that we can put a MUCH lower tolerance for the cosine metric.
The takeaway is: A single atol value won't work for different types of datasets, we need to set it carefully when comparing values against 0.

It's not within the scope of this PR, but we may want to look at similar things for the near-duplicate check.

gogetron · 2024-04-02T09:17:21Z

I completely agree with your comment. Maybe I should have changed the title of this PR because actually I think the test is raising a behavior that we do not expect. Here is one example:

import numpy as np

from cleanlab.outlier import OutOfDistribution

np.random.seed(0)

N = 20
K = 3
fill_value = 0.6

features = np.full((N, K), fill_value=fill_value)

OutOfDistribution().fit_score(features=features, verbose=False)
#     array([0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944,
#         0.36787944, 0.36787944, 0.36787944, 0.36787944, 0.36787944])

I would expect the score to be much higher for this input data. With this new PR this code produces an array of ones.

gogetron · 2024-04-04T16:59:43Z

After an enriching discussion with @elisno we have noticed for some platforms there are some numerical issues caused by floating point representation when calculating the distances. The issue is that distances that should be exactly 0 are given very low values but not exactly 0.

This last push calculates a tolerance value depending on the metric used for the calculation. When the avg_knn_distances are below this tolerance, the avg_knn_distances are assumed to be zero and we can set the score of that point to 1. This functionality is provided by the added function correct_precision_errors.

The flaky test produces now stable results and an error_message was added to the assert function for future debugging if we faced any issue, printing the calculated distances to better understand why the test is failing.

Future work: The new added test includes some minor changes when the metric used is euclidean to verify that almost identical features still get scores of 1. It would be great to add a similar behavior for the cosine distance.

tests/test_outlier.py

elisno

Thank you so much for working on this @gogetron!

Dealing with these kinds of numerical issues was not easy as they weren't easily reproducible on all platforms, and debugging was far from straightforward.

But this PR LGTM! Seems like the affected tests should behave now.

EDIT: It's not within the scope of this PR, but we may want to look at similar changes for the near-duplicate check in Datalab.

Fix: np.isclose atol parameter

4f31a8b

jwmueller reviewed Apr 1, 2024

View reviewed changes

gogetron added 3 commits April 4, 2024 18:33

Add: correct_precision_errors

d3833de

Add: test_scores_for_identical_examples_across_rows

830338b

Fix: include p when metric is minkowski

1ca5ef6

elisno reviewed Apr 6, 2024

View reviewed changes

tests/test_outlier.py Show resolved Hide resolved

elisno approved these changes Apr 6, 2024

View reviewed changes

elisno merged commit 76dfdd2 into cleanlab:master Apr 6, 2024
19 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test_scores_for_identical_examples unit test fails #1079

Fix flaky test_scores_for_identical_examples unit test fails #1079

gogetron commented Apr 1, 2024

codecov bot commented Apr 1, 2024 •

edited

Loading

jwmueller Apr 1, 2024

elisno Apr 6, 2024

elisno Apr 6, 2024

gogetron commented Apr 2, 2024

gogetron commented Apr 4, 2024

elisno left a comment •

edited

Loading

Fix flaky test_scores_for_identical_examples unit test fails #1079

Fix flaky test_scores_for_identical_examples unit test fails #1079

Conversation

gogetron commented Apr 1, 2024

Summary

Impact

Testing

Links to Relevant Issues or Conversations

Fixes #1074

References

Reviewer Notes

codecov bot commented Apr 1, 2024 • edited Loading

Codecov Report

jwmueller Apr 1, 2024

Choose a reason for hiding this comment

elisno Apr 6, 2024

Choose a reason for hiding this comment

elisno Apr 6, 2024

Choose a reason for hiding this comment

gogetron commented Apr 2, 2024

gogetron commented Apr 4, 2024

elisno left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Apr 1, 2024 •

edited

Loading

elisno left a comment •

edited

Loading