Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056

Merged
merged 4 commits into from
Mar 19, 2024

Conversation

elisno
Copy link
Member

@elisno elisno commented Mar 19, 2024

This PR ensures that in an edge case where all examples in a dataset have identical features, they have the expected outlier- and near-duplicate scores.

It slightly increases the minimum median values over the machine epsilon to avoid some numerically unstable operations with floating point round-off errors.

MWE

A small dataset of 20 examples with identical features will only have (exact) duplicates but no outliers.
The scores for these issue types will be 0.0 and 1.0 for near-duplicates and outliers, respectively.

import numpy as np
from cleanlab import Datalab

# Creating a dataset with all identical examples
X = np.full((20, 5), fill_value=np.random.rand(5))
lab = Datalab(data={"X": X})
lab.find_issues(features=X)

lab.report(show_all_issues=True)

# ...
# ------------------ near_duplicate issues -------------------
# ...
# Overall dataset quality in terms of this issue: 0.0000
# ...
# ---------------------- outlier issues ----------------------
# Overall dataset quality in terms of this issue: 1.0000
# ...

Not covered in this PR

  • Consider label issues for such a dataset.
  • Special considerations for datasets with a single, large set of exact duplicates (much larger than what may be found in a typical knn-graph).
    • The near-duplicate sets belonging to two exact duplicate examples should be equivalent (up the the indices of the exact duplicates themselves). The transitive relation among these sets must be addressed in a separate PR.
  • Cases in modules like cleanlab.object_detection.filter.py where machine epsilons appear in denominators in divisions.
    eps = np.finfo(np.float32).eps
    recalls = true_positives / np.maximum(num_gts[:, np.newaxis], eps)
    precisions = true_positives / np.maximum((true_positives + false_positives), eps)
  • Other tiny values appearing in such places like
    TINY_VALUE = 1e-100 # very tiny value for clipping

    inverse_noise_matrix = joint.T / np.clip(ps, a_min=TINY_VALUE, a_max=None)

Related Issues / PRs

This PR partially addresses #1055.

elisno added 3 commits March 19, 2024 03:02
Increase minimum scaling factor to avoid numerical instability when using it as a denominator in a division.
Manually give score 0.0 to examples whose distance to their closest neighbor is approximately equal to zero (up to floating-point round-off error).
Includes a separate test that adds another random example (different from all the identical ones).
@elisno elisno changed the title Set up Set up unit tests for datasets of examples with only identical features Mar 19, 2024
@elisno elisno requested a review from aditya1503 March 19, 2024 04:15
Copy link

codecov bot commented Mar 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.20%. Comparing base (589edd6) to head (3330daf).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1056      +/-   ##
==========================================
- Coverage   96.25%   96.20%   -0.05%     
==========================================
  Files          74       74              
  Lines        5845     5849       +4     
  Branches     1044     1044              
==========================================
+ Hits         5626     5627       +1     
- Misses        130      132       +2     
- Partials       89       90       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@elisno elisno changed the title Set up unit tests for datasets of examples with only identical features Refine Scoring and Enhance Stability for Datasets with Identical Examples Mar 19, 2024
@elisno elisno merged commit 2f2bc1f into cleanlab:master Mar 19, 2024
20 of 21 checks passed
@elisno elisno added the usability-improvement Making features more intuitive and accessible for users. label Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability-improvement Making features more intuitive and accessible for users.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant