-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT Common parameter validation #22722
MAINT Common parameter validation #22722
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments for possible extensions of this work.
sklearn/cluster/_kmeans.py
Outdated
"n_clusters": [(numbers.Integral, Interval(1, None, closed="left"))], | ||
"init": [ | ||
(str, {"k-means++", "random"}), | ||
(callable,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future work: we can imagine defining the subset of callables with a specific signature to pass here as valid values
@@ -1166,7 +1190,8 @@ def fit(self, X, y=None, sample_weight=None): | |||
accept_large_sparse=False, | |||
) | |||
|
|||
self._check_params(X) | |||
self._check_params_vs_input(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a second round of checks after data validation that deals with valid values that depend on the data or on other parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the PR on this topic!
The way the specification is defined is a dictionary of list of tuples where each tuple is: (valid_type, constraint)
. I like thinking of everything as a constraint.
As for the developer API, I see two parts:
- Defining the constraints
- Actually performing the validation.
In this PR, item 1 is a dictionary, and item 2 is a function call to validate_param
. There is also another API for validate_params
that combines item 1 and item 2 that is used in function calls. My preference is to have one API instead of two.
As we are already defining a Interval
object, I think it's okay to go straight to defining a Validator object:
validator = Validator(
n_clusters=[Interval(Integral, 1, None, closed="left")],
init=[Options(["k-means++", "random"]), callable, "array-like")],
tol=[Interval(Real, 0, None, closed="left")],
algorithm=[Options(["lloyd", "elkan", "auto", "full"], deprecated={"auto", "full"})],
max_no_improvement=[None, Interval(Integral, 0, None, closed="left")]
)
validator.validate(n_clusters=2, ...)
The above can be used directly in functions.
For estimators:
class MyEstimator:
_validator = Validator(...)
def fit(self, X, y):
self._validator.validate(self.get_params())
I think the dictionary of lists of tuples has semantics that makes it harder to parse and a validator object makes the semantics clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the dictionary of constraints idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tackling this, @jeremiedbb.
I think this very valuable for maintenance on the long term.
Here is a first review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite interesting, and I'm quite happy with it. But I really would like to see what @jnothman thinks about it. I don't think this adds too much complexity and it's not a required API for developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really nice. I'm the second approver here, but since it's quite major, I'd like another set of eyes giving a thumb up before merging.
Seems like there hasn't been any objections since I left my last comment a month ago. Will update the branch, and merge after if CI passes. |
@jeremiedbb CI fails here. |
+1 for this change. It is a step in the right direction! Thank @jeremiedbb for your effort! |
29c12fe resolves the failing tests. Feel free to cherry-pick (I've tried to do a PR on top of the branch of this PR but I can't). Thank you once again, @jeremiedbb. I am looking forward to the merge. |
thanks @jjerphan. I was also looking at this :) |
I would wait for another 3rd approval before merging this one. What do you think, @adrinjalali? |
We also have @lorentzenchr 's +1 here. I think we can merge. I think there's been enough time to object if there were concerns. |
How about opening a follow-up issue to track progress on the modules (making |
* common parameter validation * black * cln * wip * wip * rework * renaming and cleaning * lint * re lint * cln * add tests * lint * make random_state constraint * lint * closed positional * increase coverage + validate constraints * exp typing * trigger ci ? * lint * cln * rev type hints * cln * interval closed kwarg only * address comments * address comments + more tests + cln + improve err msg * lint * cln * cln * address comments * address comments * lint * adapt or skip new estimators * lint Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) removed SVDD param-validation exception from test_common.py since scikit-learn#23462 is go (scikit-learn#22722)
…cSVM scikit-learn#24001) finish v1.2 deprecation of params kwargs in `.fit` of SVDD (similar to ocSVM scikit-learn#20843) TST ensure SVDD passes param-validation test_common.py due to scikit-learn#23462 (scikit-learn#22722)
This PR proposes a unified design for parameter validation across estimators, classes and functions.
The goal is to have a consistent way to raise an informative error message when a parameter does not have a valid type/value. Here's an example:
It's also meant to centralize all these checks in one place, i.e. being the first instruction of fit or of a function. Currently they can be spread throughout fit making it hard to follow and slow to fail. I also find that having all this boilerplate inside fit makes the actual interesting code of the algorithm hard to find and mixed up with non-relevant code.
In addition, these checks are currently often done for a small subset of the parameters and often not tested. And when tested, it's often spread inside several tests.
This PR only deals with non co-dependent types and values between parameters. For instance if a value of a parameter is valid only if some value of another parameter is set.
I propose to add to
BaseEstimator
a method_validate_params
that performs validation for all parameters of estimators and a decoratorvalidate_params
for public functions. Validation is made against a dictparam_name: constraint
where constraint is a list of valid types/values.I also propose to add a new common test that makes sure this is done for all estimators (almost all of them being skipped right now).
closes #14721