Closed
Description
Description
In predict_proba of LGBMClassifier at least, if the input is a pandas dataframe, in a categorical column, when a value is not seen while fitting, entire column becomes corrupt.
Some might argue it's not important, but this behaviour is not documented, unexpected, and took me a lot of time to detect. It has lead to appearance of nulls out of nowhere in a chain of models making predictions on the same data. IMHO no model should change its inputs, (if there are performance reasons, still at least not without some special flag explicitly set?).
Reproducible example
import lightgbm
import pandas as pd, numpy as np
from lightgbm import LGBMClassifier
nsamples=50
X_train = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_train["cat"] = np.random.choice(["a", "b"], size=nsamples, replace=True)
X_train['cat']=X_train['cat'].astype('category')
est=LGBMClassifier(verbose=0)
est.fit(X_train, np.random.randint(0, 2, size=nsamples))
X_test = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_test["cat"] = np.random.choice(["a", "c"], size=nsamples, replace=True) # note that c is unseen before
X_test["cat"] = X_test["cat"].astype("category")
print(X_test["cat"].value_counts(dropna=False)) # and it's retained
#cat
#a 30
#c 20
#Name: count, dtype: int64
est.predict_proba(X_test) # but not after predict_proba
print(X_test["cat"].value_counts(dropna=False)) # note that values in cat column have been corrupted alltogether
#cat
#a 25
#NaN 25
#b 0
#Name: count, dtype: int64
Environment info
print(np.version.version,pd.version,lightgbm.version)
1.24.4 2.0.3 4.1.0
OS=Windows
Command(s) you used to install LightGBM
pip install lightgbm
Activity
jmoralez commentedon Nov 15, 2023
Hey @fingoldo, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix.
fingoldo commentedon Nov 15, 2023
Thank you so much Jose, that's what I call a fast turnaround! ;-) For now I just pass a .copy() of a dataframe to LightGBM, then other models of the ensemble are not affected.
[python-package] take shallow copy of dataframe in predict (fixes #6195…
github-actions commentedon Dec 25, 2024
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.