Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Annotate the first array_split argument with TypeVar #23217

Open
FactorizeD opened this issue Feb 15, 2023 · 1 comment
Open

ENH: Annotate the first array_split argument with TypeVar #23217

FactorizeD opened this issue Feb 15, 2023 · 1 comment

Comments

@FactorizeD
Copy link

Proposed new feature or change:

Hi all,

Consider the following code (numpy v1.23.0, mypy v0.982; I checked and the signature of the function hasn't changed in numpy v.1.24.0):

df = pd.DataFrame({'some_column': [0, 1, 2, 3]})
for batch in np.array_split(
    df,
    2,
):
    print(type(batch))

# <class 'pandas.core.frame.DataFrame'>
# <class 'pandas.core.frame.DataFrame'>

However, when mypy is run on this code with reveal_type(batch) instead of print(...), we get: Revealed type is "numpy.ndarray[Any, numpy.dtype[Any]]".

Has this been thought of in the past? I guess the best solution would be to introduce TypeVar somewhere, but it might be not so obvious?

@BvB93
Copy link
Member

BvB93 commented Sep 6, 2023

I'm afraid that use of a typevar here wouldn't quite solve the issue:

What's going on here is that np.array_split() calls np.swapaxes() under the hood, which in turn calls pd.DataFrame.swapaxes() via an intermediate getattr() call, which in turn is finally responsible for returning the dataframe. Properly typing this entire protocol chain would, at best, be non-trivial (my memory is a bit vague on the subject, but I recall some past issues in mypy related to overloaded protocols and typevars in the past. Not sure to what degree this is still relevant though).

To make things even more "interesting" pd.DataFrame.swapaxes() seems to be deprecated ever since pandas 2.1.0, so even during runtime the code snippet you posted above will return two numpy arrays in the future unless pandas implements something like the __array_function__ protocol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants