Skip to content

[python-package] LightGBM predict_proba() corrupts pandas categorical columns with unseen values #6195

Closed
@fingoldo

Description

Description

In predict_proba of LGBMClassifier at least, if the input is a pandas dataframe, in a categorical column, when a value is not seen while fitting, entire column becomes corrupt.

Some might argue it's not important, but this behaviour is not documented, unexpected, and took me a lot of time to detect. It has lead to appearance of nulls out of nowhere in a chain of models making predictions on the same data. IMHO no model should change its inputs, (if there are performance reasons, still at least not without some special flag explicitly set?).

Reproducible example

import lightgbm
import pandas as pd, numpy as np
from lightgbm import LGBMClassifier

nsamples=50

X_train = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_train["cat"] = np.random.choice(["a", "b"], size=nsamples, replace=True)
X_train['cat']=X_train['cat'].astype('category')

est=LGBMClassifier(verbose=0)
est.fit(X_train, np.random.randint(0, 2, size=nsamples))

X_test = pd.DataFrame(np.random.random(size=(nsamples, 4)))
X_test["cat"] = np.random.choice(["a", "c"], size=nsamples, replace=True) # note that c is unseen before
X_test["cat"] = X_test["cat"].astype("category")
print(X_test["cat"].value_counts(dropna=False)) # and it's retained

#cat
#a    30
#c    20
#Name: count, dtype: int64

est.predict_proba(X_test) # but not after predict_proba
print(X_test["cat"].value_counts(dropna=False)) # note that values in cat column have been corrupted alltogether

#cat
#a      25
#NaN    25
#b       0
#Name: count, dtype: int64

Environment info

print(np.version.version,pd.version,lightgbm.version)

1.24.4 2.0.3 4.1.0
OS=Windows

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

Activity

changed the title LightGBM corrupts categorical columns with unseen values on prediction [python-package] LightGBM `predict_proba()` corrupts `pandas` categorical columns with unseen values on Nov 15, 2023
jmoralez

jmoralez commented on Nov 15, 2023

@jmoralez
Collaborator

Hey @fingoldo, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix.

fingoldo

fingoldo commented on Nov 15, 2023

@fingoldo
Author

Hey @fingoldo, thanks for using LightGBM and sorry for the troubles. We used to take a shallow copy there but it wasn't obvious that the predict step depended on that and a recent refactor removed it. We'll work on a fix.

Thank you so much Jose, that's what I call a fast turnaround! ;-) For now I just pass a .copy() of a dataframe to LightGBM, then other models of the ensemble are not affected.

added a commit that references this issue on Dec 7, 2023
e797985
github-actions

github-actions commented on Dec 25, 2024

@github-actions

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

locked as resolved and limited conversation to collaborators on Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      [python-package] LightGBM `predict_proba()` corrupts `pandas` categorical columns with unseen values · Issue #6195 · microsoft/LightGBM