Description
Describe the bug
At check_array, dtype_orig
is determined for array
objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig)
. This excludes the pandas nullable extension types such as boolean
, Int64
, and Float64
, resulting in a dtype_orig
of None
.
If pandas_requires_conversion
, then there ends up being a call to array = array.astype(None)
, which pandas will take to mean a conversion to float64
should be attempted. If non numeric/boolean data is present in array
, this can result in a ValueError: could not convert string to float:
being raised if the data has the object
dtype with string data or ValueError: Cannot cast object dtype to float64
if the data has the category
dtype with object
categories.
I first found this in using the imblearn SMOTEN
and SMOTENC
oversamplers, but this could happen from other uses of check_array
.
Steps/Code to Reproduce
Reproduction via oversamplers
import pandas as pd
from imblearn import over_sampling as im
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
y = pd.Series([0, 1, 1, 0], dtype="int64")
for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
with pytest.raises(ValueError):
oversampler.fit_resample(X, y)
Reproduction via check_array directly
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["boolean", "Int64", "Float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
with pytest.raises(ValueError):
check_array(X, dtype=None)
Expected Results
We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"]
, which is no error.
import pandas as pd
from sklearn.utils.validation import check_array
for dtype in ["bool", "int64", "float64"]:
X = pd.DataFrame(
{
"a": pd.Series([1, 0, 1, 0], dtype=dtype),
"b": pd.Series(["a", "b", "c", "d"], dtype="object"),
"c": pd.Series([9, 8, 7, 6], dtype="int64"),
}
)
check_array(X, dtype=None)
Actual Results
The actual results is a ValueError: could not convert string to float:
being raised if the data has the object
dtype with string data or ValueError: Cannot cast object dtype to float64
if the data has the category
dtype with object
categories.
Versions
System:
python: 3.8.2 (default, May 21 2021, 12:12:59) [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
machine: macOS-10.16-x86_64-i386-64bit
Python dependencies:
sklearn: 1.2.2
pip: 22.2.2
setuptools: 59.8.0
numpy: 1.22.4
scipy: 1.8.1
Cython: 0.29.32
pandas: 1.5.3
matplotlib: 3.5.3
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
Activity
tamargrey commentedon Mar 9, 2023
This also seems to be the case when any
category
dtype is present with non numeric categoriesthomasjpfan commentedon Mar 10, 2023
Thank you for opening the issue! I opened #25814 to fix it.