Refine Scoring and Enhance Stability for Datasets with Identical Examples #1056
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR ensures that in an edge case where all examples in a dataset have identical features, they have the expected outlier- and near-duplicate scores.
It slightly increases the minimum median values over the machine epsilon to avoid some numerically unstable operations with floating point round-off errors.
MWE
A small dataset of 20 examples with identical features will only have (exact) duplicates but no outliers.
The scores for these issue types will be 0.0 and 1.0 for near-duplicates and outliers, respectively.
Not covered in this PR
cleanlab/cleanlab/object_detection/filter.py
Lines 305 to 307 in 589edd6
cleanlab/cleanlab/internal/constants.py
Line 23 in 61e9d6d
cleanlab/cleanlab/internal/latent_algebra.py
Line 127 in 61e9d6d
Related Issues / PRs
This PR partially addresses #1055.