-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] add option to cross_validate to return estimators fitted on each split #9686
Conversation
it's always a good idea to include a test to show that the implementation
does what you think it does.
On 5 Sep 2017 2:41 am, "Aurélien Bellet" <notifications@github.com> wrote:
Reference Issues
Fixes #6827 <#6827>
(alternative to #9496
<#9496>)
What does this implement/fix? Explain your changes.
This PR adds an option to cross_validate so that it returns a list of the
estimators fitted on each split. This list is an additional entry to the
returned dictionary.
Any other comments?
I simply wrote some basic code implementing the change. If there is some
consensus in favor of this solution, I will go ahead and improve the code,
add tests and update the doc.
------------------------------
You can view, comment on, or merge this pull request online at:
#9686
Commit Summary
- add option to cross_validate to return the estimators fitted on each
split
File Changes
- *M* sklearn/model_selection/_validation.py
<https://github.com/scikit-learn/scikit-learn/pull/9686/files#diff-0>
(35)
Patch Links:
- https://github.com/scikit-learn/scikit-learn/pull/9686.patch
- https://github.com/scikit-learn/scikit-learn/pull/9686.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#9686>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64-xBheNjbtNCrD_k6xx27W8IShQks5sfCg1gaJpZM4PMHJx>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this kind of change... But I'm not sure if we will find consensus, and @GaelVaroquaux has generally been against giving the user the kitchen sink (to use a strange English expression for "everything and more"). It also has the potential to snowball and become a feature request in grid search. Even if it does not take excessive memory here, it may there. And someone will complain that it falls to return some un-picklable they have created.
Thanks for making this concrete, but let's consider an alternative: Is there a way to instead allow the user to specify a filename format and dump the estimator to disk?
train_scores, test_scores, fit_times, score_times = zip(*scores) | ||
if return_estimator: | ||
(train_scores, test_scores, fit_times, score_times, | ||
fitted_est) = zip(*scores) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For good or bad, all the other return values here are named in plural. This should be consistent unless there's a very good reason not to be.
for train, test in cv.split(X, y, groups)) | ||
|
||
if return_train_score: | ||
train_scores, test_scores, fit_times, score_times = zip(*scores) | ||
if return_estimator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be a neater way to write this without nested ifs. But alas I'm not sure what it is.
Or a callback taking a fitted estimator as input? |
I'm afraid the callback approach is plagued by the potential for other parameters: passing the training data, the test data, the fold no, etc. And then we become a framework. |
Fair enough. I can implement dumping the estimator to disk option if there is some consensus that this is the way to go? |
and @GaelVaroquaux has generally been against giving the user the
kitchen sink (to use a strange English expression for "everything and
more"). It also has the potential to snowball and become a feature
request in grid search.
I am OK with this, as long as we give limitations in the docstring (ie:
beware that all models are not picklable) and as long as really don't get
this in GridSearch.
My logic is that if it's a contained kitchen sink, it's OK. I am worried
about the overflowing kitchen sink :).
|
@GaelVaroquaux you are OK with the solution provided by this PR or with the idea of adding an option to dump to file? In the first case we do not need to pickle anything, do we? |
would it be more useful, to avoid keeping it all in memory, to be dumping
models to disk?
…On 8 Sep 2017 2:33 am, "Gael Varoquaux" ***@***.***> wrote:
@bellet <https://github.com/bellet> : I am OK with the solution provided
by this PR. I would say that the issues raise by @jnothman
<https://github.com/jnothman> need to be addressed, and a test needs to
be added, and I would be +1 for merging.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9686 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6x2YuvFvqB4-KRKQJn20C6i1Qmw3ks5sgBrdgaJpZM4PMHJx>
.
|
would it be more useful, to avoid keeping it all in memory, to be dumping models to disk?
We really don't want to be coupling algorithms with persistence. It opens
a bag of problems (such as files collision across several code paths /
VM).
If people really want to do that (and I sometimes want to), they should
write the for loop themselves. It's not that hard.
|
I would have just had users prescribe a filename format where the iteration
index would be substituted in.
…On 8 September 2017 at 08:44, Gael Varoquaux ***@***.***> wrote:
> would it be more useful, to avoid keeping it all in memory, to be
dumping models to disk?
We really don't want to be coupling algorithms with persistence. It opens
a bag of problems (such as files collision across several code paths /
VM).
If people really want to do that (and I sometimes want to), they should
write the for loop themselves. It's not that hard.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9686 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-t3aLYyoEvVf2Ti3Z908xb9M_4yks5sgHHRgaJpZM4PMHJx>
.
|
OK I will go ahead with this solution and fix the issues you raised and we can then see who is OK with merging |
I have addressed the issues raised by @jnothman (not sure whether my alternative to the nested ifs is neater...) and added a test. Comments welcome |
I do not really understand why the code coverage fails. Unless I misunderstand the report it seems that all of my changes are covered... |
@jnothman can you see what's wrong? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's surprising that there's no coverage... perhaps that test is somehow not being run. Put an assert False
in with your assertions just to check...?
@@ -489,6 +511,8 @@ def _fit_and_score(estimator, X, y, scorer, train, test, verbose, | |||
ret.extend([fit_time, score_time]) | |||
if return_parameters: | |||
ret.append(parameters) | |||
if return_estimator: | |||
ret.append(estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently coverage is missing from this line!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an assert False
along with my assert_almost_equal
on the estimator parameters and this makes the test fail, so they are run. Hence it doesn't make sense that this line is not covered, otherwise how can the test pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps put the assert False here just to prove the coverage tool wrong? If it does not fail, you've got some investigation to do...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just did this and the test now fails. Should I commit this for the sake of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How strange...
Do you believe there's more to do here before full review? If not, relabel WIP -> MRG please. |
I actually need to add the new option to the doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
@@ -140,6 +144,8 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, | |||
The time for scoring the estimator on the test set for each | |||
cv split. (Note time for scoring on the train set is not | |||
included even if ``return_train_score`` is set to ``True`` | |||
``estimator`` | |||
The list of estimator objects for each cv split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop "list of" for consistency
if return_train_score: | ||
train_scores, test_scores, fit_times, score_times = zip(*scores) | ||
train_scores = zipped_scores[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think = zipped_scores.pop(0)
and similar below will help simplify this logic.
Thanks for the comment, this indeed simplifies things a bit. I also briefly mentioned the new option in the doc. |
doc/modules/cross_validation.rst
Outdated
@@ -196,6 +196,8 @@ following keys - | |||
for all the scorers. If train scores are not needed, this should be set to | |||
``False`` explicitly. | |||
|
|||
``return_estimator`` is set to ``False`` by default. When set to ``True``, it adds an ``estimator`` key containing the estimators fitted on each split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please limit lines to under 80 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this is somewhat unnecessary. Or I'd state it more in terms of use case rather than API: "you can also retain to estimator fitted for each training set with return_estimator=True"
train_scores = _aggregate_score_dicts(train_scores) | ||
if return_estimator: | ||
(test_scores, fit_times, score_times, | ||
fitted_estimators) = zipped_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use pop here too and remove the else
doc/modules/cross_validation.rst
Outdated
@@ -196,6 +196,8 @@ following keys - | |||
for all the scorers. If train scores are not needed, this should be set to | |||
``False`` explicitly. | |||
|
|||
``return_estimator`` is set to ``False`` by default. When set to ``True``, it adds an ``estimator`` key containing the estimators fitted on each split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think this is somewhat unnecessary. Or I'd state it more in terms of use case rather than API: "you can also retain to estimator fitted for each training set with return_estimator=True"
@GaelVaroquaux do you have additional comments? thanks! |
travis is failing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM apart from minor nitpicks and doctest failures.
@@ -182,7 +182,7 @@ The ``cross_validate`` function differs from ``cross_val_score`` in two ways - | |||
|
|||
- It allows specifying multiple metrics for evaluation. | |||
|
|||
- It returns a dict containing training scores, fit-times and score-times in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -149,6 +153,8 @@ def cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, | |||
The time for scoring the estimator on the test set for each | |||
cv split. (Note time for scoring on the train set is not | |||
included even if ``return_train_score`` is set to ``True`` | |||
``estimator`` | |||
The estimator objects for each cv split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if return_estimator is True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@amueller for travis failure see above discussion #9686 (review) with @jnothman |
doc/modules/cross_validation.rst
Outdated
@@ -227,8 +231,9 @@ Here is an example of ``cross_validate`` using a single metric:: | |||
|
|||
>>> scores = cross_validate(clf, iris.data, iris.target, | |||
... scoring='precision_macro') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should have a comma at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, nice catch!
@amueller do you have other requests? thanks! |
travis is failing because of doctests. |
How do I locate which doctests fail? All tests pass when I run nosetests on sklearn/model_selection/_validation.py (except for a warning which is not related to this PR).
|
OK it seems it was only the documentation thing. All checks pass now @amueller |
@amueller @jnothman @GaelVaroquaux anything else should be done here? |
Please add an Enhancements entry to the change log at |
or maybe this is a New Feature. Otherwise, this is good to merge. |
@jnothman done, thanks! |
Thanks @bellet! |
Reference Issues
Fixes #6827 (alternative to #9496)
What does this implement/fix? Explain your changes.
This PR adds an option to cross_validate so that it returns a list of the estimators fitted on each split. This list is an additional entry to the returned dictionary.
Any other comments?
I simply wrote some basic code implementing the change. If core devs reach some consensus in favor of this solution, I will go ahead and improve the code, add tests and update the doc.