Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predictions for regression in Datalab #902

Merged
merged 38 commits into from
Dec 7, 2023

Conversation

elisno
Copy link
Member

@elisno elisno commented Nov 22, 2023

Summary

Introduce the predictions argument to RegressionLabelIssueManager.find_issues.
The pred_probs argument will still be used on the Datalab-level. Internally, the predictions key will pick up the same value as the pred_probs key when setting up issue managers (for regression).

# Example code snippet

from cleanlab import Datalab
import numpy as np
from sklearn.datasets import make_regression

# Generate some toy-examples
N = 100
X, y, W = make_regression(n_samples=N, n_features=3, noise=0.1, random_state=0, coef=True, bias=False)

# Get predictions from a regression model. Simplified for illustration.
y_pred = X @ W

# `predictions` used in find_issues
lab = Datalab(data={"y": y}, label_name='y', task="regression")
lab.find_issues(pred_probs=y_pred, issue_types= {"label": {}})

## Outputs
#Finding label issues ...
#
#Audit complete. 7 issues found in the dataset.

Testing

🔍 Testing Done: Added specific tests for the RegressionLabelIssueManager. Write additional tests that run Datalab for regression with this new argument.

Links to Relevant Issues or Conversations

# How to set a threshold for flagging issues based on predictions
lab.find_issues(pred_probs=..., issue_types={"label":{"threshold": 1.0}}}

…eManager

Move featrure-based and prediction-based strategies to different helper functions within module.

Add several tests for the issue_manager and the find_issues method.
Datalab resolves conflicts between the pred_probs and the predictions arguments, depending on the given task.
update the **kwargs during call to helper function for finding issues to include the issue manager's threshold.
Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (4ac43c0) 96.89% compared to head (5165eb4) 96.76%.

❗ Current head 5165eb4 differs from pull request most recent head 41caca5. Consider uploading reports for the commit 41caca5 to get more accurate results

Files Patch % Lines
cleanlab/datalab/internal/model_outputs.py 72.41% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #902      +/-   ##
==========================================
- Coverage   96.89%   96.76%   -0.13%     
==========================================
  Files          70       71       +1     
  Lines        5544     5601      +57     
  Branches      945      951       +6     
==========================================
+ Hits         5372     5420      +48     
- Misses         85       94       +9     
  Partials       87       87              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cleanlab/datalab/datalab.py Outdated Show resolved Hide resolved
cleanlab/datalab/datalab.py Outdated Show resolved Hide resolved
cleanlab/datalab/datalab.py Outdated Show resolved Hide resolved
elisno and others added 12 commits December 1, 2023 15:35
used as inputs to Datalab.find_issues

Instead of introducing new arguments that behave
identically to exisisting arguments, which are mutually
exlusive, we use a model-output hierarchy that allows
us to validate and process as intputs to Datalab based
on the provided task.

This is now only focused on unifying the `pred_probs´
and ´predictions´ arguments at the
user-facing level (Datalab, specifically IssueFinder-level),
without affecting the signatures of lower level classes
(like the IssueManagers).

This is strategically designed to accommodate a wider
range of model outputs in the future. This includes
extending these validation techniques to handle
complex outputs like 'features', 'pred_probs' in
datasets with higher dimensionality (e.g.,
segmentation) or datasets featuring non-normalized
rows (e.g., multi-label), thereby broadening the
applicability of Datalab in diverse machine learning
tasks.
With this change, Datalab/IssueFinder considers all
pred_probs arguments as valid input for any tasks.
The pred_probs arguments will be cast into a task
specific ModelOutput validation class. This
intermediate representation ensures compatibility with
the intended IssueManagers (by validating the data
values and assigning the input to the correct
argument).

One note is the the issue_finder.get_available_issue_types(**kwargs)
test needs to change.
It will replace a "pred_probs" key with a "predictions"
key for regression tasks.
---------

Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
add tests about issue flags
elisno and others added 2 commits December 5, 2023 23:14
Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still had a couple of questions about the tests

elisno and others added 11 commits December 6, 2023 17:24
only apply label noise to examples selected as having errors

adjust the assertions about flagged issues
also switch model from KNeighborsRegressor to HuberRegressor to improve robustness against outliers.

Add sanity-checks to dataset generator (an OLS regressor is sensitive to outliers).
only rely on jaccard similarity and fpr
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
assert (summary[summary["issue_type"] == "label"]["num_issues"] == 40).all()
# FPR
fpr = len(list(set(issue_ids).difference(set(expected_issue_ids)))) / len(issue_ids)
assert fpr < 0.3
Copy link
Member

@jwmueller jwmueller Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open GH issue to improve this fpr in the future, to below 0.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Improve the stability of results from end-to-end tests of Datalab with label error-detection for regression tasks

@jwmueller jwmueller self-requested a review December 7, 2023 03:49
Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Go ahead and merge after addressing my remaining minor comments

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants