Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slicing problem that can occur when shuffle algorthm for 'take' is applied #11669

Open
davidhassell opened this issue Jan 14, 2025 · 0 comments
Labels
needs triage Needs a response from a contributor

Comments

@davidhassell
Copy link
Contributor

Hi,

The change that introduced the shuffle algorithm for take in slicing (introduced by PR #11267 in 2024.08.2, no associated issue) seems like a useful optimisation - thanks. However I'd really like to have the option of turning it off!

The problem I have is with the concatenation step in concatenate_arrays (https://github.com/dask/dask/blob/main/dask/array/_shuffle.py#L318), which is called by _shuffle (https://github.com/dask/dask/blob/main/dask/array/_shuffle.py#L289). This causes problems with objects that implement the __array_function__ protocol, but not for np.concatenation. For these we get (where NetCDF4Array is my array-like data object being sliced):

File ~/dask/dask/array/_shuffle.py:323, in concatenate_arrays(arrs, sorter, axis)
    318 def concatenate_arrays(arrs, sorter, axis):
    319    return take_lookup(
--> 320         concatenate_lookup.dispatch(type(arrs[0]))(arrs, axis=axis),
    321         np.argsort(sorter[1]),
    322         axis=axis,
    323     )

TypeError: no implementation found for 'numpy.concatenate' on types that implement __array_function__: [<class 'NetCDF4Array'>]

It is essential in one of my workflows [*] that my NetCDFArray array-like objects in the graph do not get converted to a numpy arrays through the slicing process - i.e. after an arbitrary number of computed __getitem__s, I still need a NetCDF4Array object in the graph, and not a numpy array. ( NetCDFArray.__getitem__ does not return a numpy array, rather another NetCDF4Array object with a lazy slice, and I manually add a "convert to numpy" layer to the graph if I need to.)

I can't implement a np.concatenate __array_function__ protocol that allows np.concatenate to return a new NetCDFArray object, because that would be like re-implementing Dask (!).

Un-implementing the __array_function__ protocol is not an option because, even though it makes the error go away and the graph executes, I end up with numpy arrays in my chunks, rather than the NetCDF4Array objects that I need.

I have pinned to 2024.07.1 to get round this, but that's not going to be sustainable for ever. I can't see how the shuffle algorithm can get around doing a concatenation, so the only option I can see is the option to disable shuffling during take and revert to the older behaviour ... Am I missing a trick, though?

Many thanks for your help,
David


[*] The workflow is using Dask to manage "active storage reductions": essentially getting the remote server to carry out reduction calculations where the data actually is and returning the small reduced result, rather than bringing the data across the network and performing the reductions locally. To do this, the array-like data in the Dask graph must, up until the reduction graph layer, remain as a virtual array-like object that knows how to get the data (i.e. my NetCDFArray), rather than as an object that actually contains the data locally (i.e. a numpy array). Prior to the reduction layer, we don't allow operations that would change the data values relative to those in the remote file (policed in our library outside of Dask) - but we must allow slicing, since real-life reductions are so often not required on the whole of a file's data.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jan 14, 2025
@davidhassell davidhassell changed the title Slicing problem when that can occur when shuffle algorthm for 'take' is applied Slicing problem that can occur when shuffle algorthm for 'take' is applied Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant