Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

elisno · 2024-03-19T04:07:49Z

This PR ensures that in an edge case where all examples in a dataset have identical features, they have the expected outlier- and near-duplicate scores.

It slightly increases the minimum median values over the machine epsilon to avoid some numerically unstable operations with floating point round-off errors.

MWE

A small dataset of 20 examples with identical features will only have (exact) duplicates but no outliers.
The scores for these issue types will be 0.0 and 1.0 for near-duplicates and outliers, respectively.

import numpy as np
from cleanlab import Datalab

# Creating a dataset with all identical examples
X = np.full((20, 5), fill_value=np.random.rand(5))
lab = Datalab(data={"X": X})
lab.find_issues(features=X)

lab.report(show_all_issues=True)

# ...
# ------------------ near_duplicate issues -------------------
# ...
# Overall dataset quality in terms of this issue: 0.0000
# ...
# ---------------------- outlier issues ----------------------
# Overall dataset quality in terms of this issue: 1.0000
# ...

Not covered in this PR

Consider label issues for such a dataset.
Special considerations for datasets with a single, large set of exact duplicates (much larger than what may be found in a typical knn-graph).
- The near-duplicate sets belonging to two exact duplicate examples should be equivalent (up the the indices of the exact duplicates themselves). The transitive relation among these sets must be addressed in a separate PR.

Cases in modules like cleanlab.object_detection.filter.py where machine epsilons appear in denominators in divisions.

cleanlab/cleanlab/object_detection/filter.py

Lines 305 to 307 in 589edd6

    
           eps = np.finfo(np.float32).eps 
        
           recalls = true_positives / np.maximum(num_gts[:, np.newaxis], eps) 
        
           precisions = true_positives / np.maximum((true_positives + false_positives), eps)

Other tiny values appearing in such places like

cleanlab/cleanlab/internal/constants.py

Line 23 in 61e9d6d

TINY_VALUE = 1e-100 # very tiny value for clipping

cleanlab/cleanlab/internal/latent_algebra.py

Line 127 in 61e9d6d

inverse_noise_matrix = joint.T / np.clip(ps, a_min=TINY_VALUE, a_max=None)

Related Issues / PRs

This PR partially addresses #1055.

Increase minimum scaling factor to avoid numerical instability when using it as a denominator in a division. Manually give score 0.0 to examples whose distance to their closest neighbor is approximately equal to zero (up to floating-point round-off error).

…tures

Includes a separate test that adds another random example (different from all the identical ones).

codecov · 2024-03-19T04:52:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.20%. Comparing base (589edd6) to head (3330daf).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1056      +/-   ##
==========================================
- Coverage   96.25%   96.20%   -0.05%     
==========================================
  Files          74       74              
  Lines        5845     5849       +4     
  Branches     1044     1044              
==========================================
+ Hits         5626     5627       +1     
- Misses        130      132       +2     
- Partials       89       90       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

elisno added 3 commits March 19, 2024 03:02

Fix near duplicate issue scoring for examples only with identical fea…

f3c1c60

…tures

Add end-to-end tests for all identical examples dataset (without labels)

61e9d6d

Includes a separate test that adds another random example (different from all the identical ones).

elisno changed the title ~~Set up~~ Set up unit tests for datasets of examples with only identical features Mar 19, 2024

elisno requested a review from aditya1503 March 19, 2024 04:15

reduce number of examples in hypothesis test for ood scores

3330daf

elisno changed the title ~~Set up unit tests for datasets of examples with only identical features~~ Refine Scoring and Enhance Stability for Datasets with Identical Examples Mar 19, 2024

elisno merged commit 2f2bc1f into cleanlab:master Mar 19, 2024
20 of 21 checks passed

elisno added the usability-improvement Making features more intuitive and accessible for users. label Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

elisno commented Mar 19, 2024

codecov bot commented Mar 19, 2024 •

edited

Loading

	eps = np.finfo(np.float32).eps
	recalls = true_positives / np.maximum(num_gts[:, np.newaxis], eps)
	precisions = true_positives / np.maximum((true_positives + false_positives), eps)

Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

Conversation

elisno commented Mar 19, 2024

MWE

Not covered in this PR

Related Issues / PRs

codecov bot commented Mar 19, 2024 • edited Loading

Codecov Report

codecov bot commented Mar 19, 2024 •

edited

Loading