-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictions for regression in Datalab #902
Predictions for regression in Datalab #902
Conversation
…eManager Move featrure-based and prediction-based strategies to different helper functions within module. Add several tests for the issue_manager and the find_issues method.
Datalab resolves conflicts between the pred_probs and the predictions arguments, depending on the given task.
… in regression datasets
update the **kwargs during call to helper function for finding issues to include the issue manager's threshold.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #902 +/- ##
==========================================
- Coverage 96.89% 96.76% -0.13%
==========================================
Files 70 71 +1
Lines 5544 5601 +57
Branches 945 951 +6
==========================================
+ Hits 5372 5420 +48
- Misses 85 94 +9
Partials 87 87 ☔ View full report in Codecov by Sentry. |
used as inputs to Datalab.find_issues Instead of introducing new arguments that behave identically to exisisting arguments, which are mutually exlusive, we use a model-output hierarchy that allows us to validate and process as intputs to Datalab based on the provided task. This is now only focused on unifying the `pred_probs´ and ´predictions´ arguments at the user-facing level (Datalab, specifically IssueFinder-level), without affecting the signatures of lower level classes (like the IssueManagers). This is strategically designed to accommodate a wider range of model outputs in the future. This includes extending these validation techniques to handle complex outputs like 'features', 'pred_probs' in datasets with higher dimensionality (e.g., segmentation) or datasets featuring non-normalized rows (e.g., multi-label), thereby broadening the applicability of Datalab in diverse machine learning tasks.
With this change, Datalab/IssueFinder considers all pred_probs arguments as valid input for any tasks. The pred_probs arguments will be cast into a task specific ModelOutput validation class. This intermediate representation ensures compatibility with the intended IssueManagers (by validating the data values and assigning the input to the correct argument). One note is the the issue_finder.get_available_issue_types(**kwargs) test needs to change. It will replace a "pred_probs" key with a "predictions" key for regression tasks.
--------- Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
add tests about issue flags
Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
Co-authored-by: Hui Wen <45724323+huiwengoh@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still had a couple of questions about the tests
only apply label noise to examples selected as having errors adjust the assertions about flagged issues
also switch model from KNeighborsRegressor to HuberRegressor to improve robustness against outliers. Add sanity-checks to dataset generator (an OLS regressor is sensitive to outliers).
only rely on jaccard similarity and fpr
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
assert (summary[summary["issue_type"] == "label"]["num_issues"] == 40).all() | ||
# FPR | ||
fpr = len(list(set(issue_ids).difference(set(expected_issue_ids)))) / len(issue_ids) | ||
assert fpr < 0.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open GH issue to improve this fpr
in the future, to below 0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Improve the stability of results from end-to-end tests of Datalab with label error-detection for regression tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Go ahead and merge after addressing my remaining minor comments
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Summary
Introduce the
predictions
argument toRegressionLabelIssueManager.find_issues
.The
pred_probs
argument will still be used on the Datalab-level. Internally, thepredictions
key will pick up the same value as thepred_probs
key when setting up issue managers (for regression).Testing
Links to Relevant Issues or Conversations