Predictions for regression in Datalab #902

elisno · 2023-11-22T19:38:37Z

Summary

Introduce the predictions argument to RegressionLabelIssueManager.find_issues.
The pred_probs argument will still be used on the Datalab-level. Internally, the predictions key will pick up the same value as the pred_probs key when setting up issue managers (for regression).

# Example code snippet

from cleanlab import Datalab
import numpy as np
from sklearn.datasets import make_regression

# Generate some toy-examples
N = 100
X, y, W = make_regression(n_samples=N, n_features=3, noise=0.1, random_state=0, coef=True, bias=False)

# Get predictions from a regression model. Simplified for illustration.
y_pred = X @ W

# `predictions` used in find_issues
lab = Datalab(data={"y": y}, label_name='y', task="regression")
lab.find_issues(pred_probs=y_pred, issue_types= {"label": {}})

## Outputs
#Finding label issues ...
#
#Audit complete. 7 issues found in the dataset.

Testing

🔍 Testing Done: Added specific tests for the RegressionLabelIssueManager. Write additional tests that run Datalab for regression with this new argument.

Links to Relevant Issues or Conversations

This is a follow-up to datalab_regression #796.
This PR addresses Extend regression in Datalab to handle predictions as input #899.

# How to set a threshold for flagging issues based on predictions
lab.find_issues(pred_probs=..., issue_types={"label":{"threshold": 1.0}}}

…eManager Move featrure-based and prediction-based strategies to different helper functions within module. Add several tests for the issue_manager and the find_issues method.

Datalab resolves conflicts between the pred_probs and the predictions arguments, depending on the given task.

… in regression datasets

update the **kwargs during call to helper function for finding issues to include the issue manager's threshold.

codecov · 2023-11-22T19:45:32Z

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (4ac43c0) 96.89% compared to head (5165eb4) 96.76%.

❗ Current head 5165eb4 differs from pull request most recent head 41caca5. Consider uploading reports for the commit 41caca5 to get more accurate results

Files	Patch %	Lines
cleanlab/datalab/internal/model_outputs.py	72.41%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #902      +/-   ##
==========================================
- Coverage   96.89%   96.76%   -0.13%     
==========================================
  Files          70       71       +1     
  Lines        5544     5601      +57     
  Branches      945      951       +6     
==========================================
+ Hits         5372     5420      +48     
- Misses         85       94       +9     
  Partials       87       87

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cleanlab/datalab/internal/issue_manager/regression/label.py

tests/datalab/test_datalab.py

cleanlab/datalab/datalab.py

cleanlab/datalab/internal/issue_manager/regression/label.py

used as inputs to Datalab.find_issues Instead of introducing new arguments that behave identically to exisisting arguments, which are mutually exlusive, we use a model-output hierarchy that allows us to validate and process as intputs to Datalab based on the provided task. This is now only focused on unifying the `pred_probs´ and ´predictions´ arguments at the user-facing level (Datalab, specifically IssueFinder-level), without affecting the signatures of lower level classes (like the IssueManagers). This is strategically designed to accommodate a wider range of model outputs in the future. This includes extending these validation techniques to handle complex outputs like 'features', 'pred_probs' in datasets with higher dimensionality (e.g., segmentation) or datasets featuring non-normalized rows (e.g., multi-label), thereby broadening the applicability of Datalab in diverse machine learning tasks.

With this change, Datalab/IssueFinder considers all pred_probs arguments as valid input for any tasks. The pred_probs arguments will be cast into a task specific ModelOutput validation class. This intermediate representation ensures compatibility with the intended IssueManagers (by validating the data values and assigning the input to the correct argument). One note is the the issue_finder.get_available_issue_types(**kwargs) test needs to change. It will replace a "pred_probs" key with a "predictions" key for regression tasks.

--------- Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>

…& IssueFinder)

…ypes

add tests about issue flags

Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>

cleanlab/datalab/datalab.py

docs/source/cleanlab/datalab/guide/custom_issue_manager.rst

cleanlab/datalab/internal/issue_manager/regression/label.py

tests/datalab/test_datalab.py

jwmueller

still had a couple of questions about the tests

only apply label noise to examples selected as having errors adjust the assertions about flagged issues

also switch model from KNeighborsRegressor to HuberRegressor to improve robustness against outliers. Add sanity-checks to dataset generator (an OLS regressor is sensitive to outliers).

only rely on jaccard similarity and fpr

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

cleanlab/datalab/internal/issue_finder.py

docs/source/cleanlab/datalab/internal/issue_manager/index.rst

jwmueller · 2023-12-07T03:46:56Z

tests/datalab/test_datalab.py

-    assert (summary[summary["issue_type"] == "label"]["num_issues"] == 40).all()
+        # FPR
+        fpr = len(list(set(issue_ids).difference(set(expected_issue_ids)))) / len(issue_ids)
+        assert fpr < 0.3


open GH issue to improve this fpr in the future, to below 0.2

Improve the stability of results from end-to-end tests of Datalab with label error-detection for regression tasks

jwmueller

LGTM. Go ahead and merge after addressing my remaining minor comments

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

elisno added 11 commits November 22, 2023 16:46

add predictions argument to find_issues method of RegressionLabelIssu…

48b776f

…eManager Move featrure-based and prediction-based strategies to different helper functions within module. Add several tests for the issue_manager and the find_issues method.

move Datalab + regression tests into separate test class

a754f6f

apply black formatter

b0bd9d8

add predictions as an argument to Datalab's find_issues method

c2d3049

Datalab resolves conflicts between the pred_probs and the predictions arguments, depending on the given task.

split tests up further

c3d004e

add score assertion to test on finding label issue based on features

7266f37

add datalab tests for label issue checks with predictions for regression

3d41559

add Datalab test for supplying another model for finding label issues…

2511f24

… in regression datasets

apply black formatter

5db1554

clarify settings tasks with register decorator in guide

75cf0c3

set threshold as a field in issue manager (for predictions)

d03bfed

update the **kwargs during call to helper function for finding issues to include the issue manager's threshold.