Skip to content

check_array can call array.astype(None), raising ValueError if pandas extension types are present in a pd.DataFrame array  #25798

Closed
@tamargrey

Description

Describe the bug

At check_array, dtype_orig is determined for array objects that are pandas DataFrames by checking all(isinstance(dtype_iter, np.dtype) for dtype_iter in dtypes_orig). This excludes the pandas nullable extension types such as boolean, Int64, and Float64, resulting in a dtype_orig of None.

If pandas_requires_conversion, then there ends up being a call to array = array.astype(None), which pandas will take to mean a conversion to float64 should be attempted. If non numeric/boolean data is present in array, this can result in a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

I first found this in using the imblearn SMOTEN and SMOTENC oversamplers, but this could happen from other uses of check_array.

Steps/Code to Reproduce

Reproduction via oversamplers

    import pandas as pd
    from imblearn import over_sampling as im
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )
        y = pd.Series([0, 1, 1, 0], dtype="int64")

        for oversampler in [im.SMOTENC(categorical_features=[0, 1]), im.SMOTEN()]:
            with pytest.raises(ValueError):
                oversampler.fit_resample(X, y)

Reproduction via check_array directly

    import pandas as pd
    from sklearn.utils.validation import check_array
    for dtype in ["boolean", "Int64", "Float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        with pytest.raises(ValueError):
            check_array(X, dtype=None)

Expected Results

We should get the same behavior that's seen with the non nullable equivalents ["bool", "int64", "float64"], which is no error.

    import pandas as pd
    from sklearn.utils.validation import check_array

    for dtype in ["bool", "int64", "float64"]:
        X = pd.DataFrame(
            {
                "a": pd.Series([1, 0, 1, 0], dtype=dtype),
                "b": pd.Series(["a", "b", "c", "d"], dtype="object"),
                "c": pd.Series([9, 8, 7, 6], dtype="int64"),
            }
        )

        check_array(X, dtype=None)

Actual Results

The actual results is a ValueError: could not convert string to float: being raised if the data has the object dtype with string data or ValueError: Cannot cast object dtype to float64 if the data has the category dtype with object categories.

Versions

System:
    python: 3.8.2 (default, May 21 2021, 12:12:59)  [Clang 11.0.3 (clang-1103.0.32.62)]
executable: /Users/tamar.grey/.pyenv/versions/3.8.2/envs/evalml-dev/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.2.2
          pip: 22.2.2
   setuptools: 59.8.0
        numpy: 1.22.4
        scipy: 1.8.1
       Cython: 0.29.32
       pandas: 1.5.3
   matplotlib: 3.5.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Activity

changed the title `check_array` can call `array.astype(None)` if pandas extension types are present in a pd.DataFrame `array` `check_array` can call `array.astype(None)`, raising ValueError if pandas extension types are present in a pd.DataFrame `array` on Mar 9, 2023
tamargrey

tamargrey commented on Mar 9, 2023

@tamargrey
Author

This also seems to be the case when any category dtype is present with non numeric categories

thomasjpfan

thomasjpfan commented on Mar 10, 2023

@thomasjpfan
Member

Thank you for opening the issue! I opened #25814 to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      `check_array` can call `array.astype(None)`, raising ValueError if pandas extension types are present in a pd.DataFrame `array` · Issue #25798 · scikit-learn/scikit-learn