-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix prediction fails with MOO ensemble and dummy is best #1518
Fix prediction fails with MOO ensemble and dummy is best #1518
Conversation
Codecov Report
@@ Coverage Diff @@
## development #1518 +/- ##
===============================================
+ Coverage 83.94% 84.19% +0.25%
===============================================
Files 153 153
Lines 11654 11663 +9
Branches 2031 2033 +2
===============================================
+ Hits 9783 9820 +37
+ Misses 1326 1295 -31
- Partials 545 548 +3 |
Reproducing script, make the data random and stop after one model: import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
import numpy as np
rand = np.random.RandomState(2)
X = rand.random((100, 50))
y = rand.randint(0, 2, (100,))
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
def callback(
smbo,
run_info,
result,
time_left,
):
if int(result.additional_info["num_run"]) > 0:
return False
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit=30,
get_trials_callback=callback,
include={"classifier": ["bernoulli_nb"]},
delete_tmp_folder_after_terminate=False
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
for ens in automl.get_pareto_set():
ens.predict(X_test) Error: Traceback (most recent call last):
File "test.py", line 41, in <module>
ens.predict(X_test)
File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 309, in predict
maj = np.argmax(self.predict_proba(X), axis=1)
File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 330, in _predict_proba
weights=self._weights_not_none)
File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 56, in _weights_not_none
return [w for est, w in zip(self.estimators, self.weights)
File "/home/skantify/code/asklearn/dev/.venv/lib/python3.8/site-packages/sklearn/ensemble/_voting.py", line 57, in <listcomp>
if est[1] != 'drop']
TypeError: 'MyDummyClassifier' object is not subscriptable |
The culprit is that all estimators are normally wrapped in a Pipeline while if it's a dummy, it's not. auto-sklearn/autosklearn/evaluation/abstract_evaluator.py Lines 238 to 252 in 9d63cb5
I opted to solve this by just directly modifying
The first of those three points is not so straightforward to fix I think. The second could be achieved with a callback to SMAC. The third I'm not sure we can directly enforce, will SMAC choose the configuration again if it's already been evaluated and only has one valid configuration? I think this might also solve some other failures that occur when it says "data_preprocessor" not available in |
If it is easier, the code for wrapping the Dummy into a Pipeline could also go directly into the abstract evaluator. I'm not sure if adding the dummy to the search space will really help us, because it could then be used together with preprocessing algorithm which would enlarge the search space (or the number of forbidden configurations). |
I pushed some cases that are automl instance with only a dummy in them, they should hopefully pass all the tests since they're fairly broad cases I validated they are used and run with no issues now. If they pass the online tests I'll see about moving them into the abstract evaluator. @case(tags=["classifier", "fitted"])
def case_classifier_fitted_only_dummy(
make_cache: Callable[[str], Cache],
make_backend: Callable[..., Backend],
make_automl_classifier: Callable[..., AutoMLClassifier],
) -> AutoMLClassifier:
"""Case of a fitted classifier but only dummy was found"""
key = "case_classifier_fitted_only_dummy"
# This locks the cache for this item while we check, required for pytest-xdist
with make_cache(key) as cache:
if "model" not in cache:
model = make_automl_classifier(
temporary_directory=cache.path("backend"),
delete_tmp_folder_after_terminate=False,
include={"classifier": ["bernoulli_nb"]}, # Just a meh model
get_trials_callback=stop_at_first,
)
rand = np.random.RandomState(2)
_X = rand.random((100, 50))
_y = rand.randint(0, 2, (100,))
X, Xt, y, yt = sklearn.model_selection.train_test_split(
_X, _y, random_state=1 # Required to ensure dummy is best
)
model.fit(X, y, dataset_name="random")
# We now validate that indeed, the only model is the Dummy
members = list(model.models_.values())
if len(members) != 1 and not isinstance(members[0], MyDummyClassifier):
raise ValueError("Should only have one model, dummy\n", members)
cache.save(model, "model")
model = cache.load("model")
model._backend = copy_backend(old=model._backend, new=make_backend())
return model |
fde9940
to
d8d3a4f
Compare
I tried to move the Pipeline step into the |
* Init commit * Fix DummyClassifiers in _load_pareto_set * Add test for dummy only in classifiers * Update no ensemble docstring * Add automl case where automl only has dummy * Remove tmp file * Fix `include` statement to be regressor
predict
can fail when dummy model is best with new Moo updates #1495