You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The change that introduced the shuffle algorithm for take in slicing (introduced by PR #11267 in 2024.08.2, no associated issue) seems like a useful optimisation - thanks. However I'd really like to have the option of turning it off!
It is essential in one of my workflows [*] that my NetCDFArray array-like objects in the graph do not get converted to a numpy arrays through the slicing process - i.e. after an arbitrary number of computed __getitem__s, I still need a NetCDF4Array object in the graph, and not a numpy array. ( NetCDFArray.__getitem__ does not return a numpy array, rather another NetCDF4Array object with a lazy slice, and I manually add a "convert to numpy" layer to the graph if I need to.)
I can't implement a np.concatenate__array_function__ protocol that allows np.concatenate to return a new NetCDFArray object, because that would be like re-implementing Dask (!).
Un-implementing the __array_function__ protocol is not an option because, even though it makes the error go away and the graph executes, I end up with numpy arrays in my chunks, rather than the NetCDF4Array objects that I need.
I have pinned to 2024.07.1 to get round this, but that's not going to be sustainable for ever. I can't see how the shuffle algorithm can get around doing a concatenation, so the only option I can see is the option to disable shuffling during take and revert to the older behaviour ... Am I missing a trick, though?
Many thanks for your help,
David
[*] The workflow is using Dask to manage "active storage reductions": essentially getting the remote server to carry out reduction calculations where the data actually is and returning the small reduced result, rather than bringing the data across the network and performing the reductions locally. To do this, the array-like data in the Dask graph must, up until the reduction graph layer, remain as a virtual array-like object that knows how to get the data (i.e. my NetCDFArray), rather than as an object that actually contains the data locally (i.e. a numpy array). Prior to the reduction layer, we don't allow operations that would change the data values relative to those in the remote file (policed in our library outside of Dask) - but we must allow slicing, since real-life reductions are so often not required on the whole of a file's data.
The text was updated successfully, but these errors were encountered:
davidhassell
changed the title
Slicing problem when that can occur when shuffle algorthm for 'take' is applied
Slicing problem that can occur when shuffle algorthm for 'take' is applied
Jan 15, 2025
Hi,
The change that introduced the shuffle algorithm for
take
in slicing (introduced by PR #11267 in2024.08.2
, no associated issue) seems like a useful optimisation - thanks. However I'd really like to have the option of turning it off!The problem I have is with the concatenation step in
concatenate_arrays
(https://github.com/dask/dask/blob/main/dask/array/_shuffle.py#L318), which is called by_shuffle
(https://github.com/dask/dask/blob/main/dask/array/_shuffle.py#L289). This causes problems with objects that implement the__array_function__
protocol, but not fornp.concatenation
. For these we get (whereNetCDF4Array
is my array-like data object being sliced):It is essential in one of my workflows [*] that my
NetCDFArray
array-like objects in the graph do not get converted to a numpy arrays through the slicing process - i.e. after an arbitrary number of computed__getitem__
s, I still need aNetCDF4Array
object in the graph, and not a numpy array. (NetCDFArray.__getitem__
does not return a numpy array, rather anotherNetCDF4Array
object with a lazy slice, and I manually add a "convert to numpy" layer to the graph if I need to.)I can't implement a
np.concatenate
__array_function__
protocol that allowsnp.concatenate
to return a newNetCDFArray
object, because that would be like re-implementing Dask (!).Un-implementing the
__array_function__
protocol is not an option because, even though it makes the error go away and the graph executes, I end up with numpy arrays in my chunks, rather than theNetCDF4Array
objects that I need.I have pinned to
2024.07.1
to get round this, but that's not going to be sustainable for ever. I can't see how the shuffle algorithm can get around doing a concatenation, so the only option I can see is the option to disable shuffling duringtake
and revert to the older behaviour ... Am I missing a trick, though?Many thanks for your help,
David
[*] The workflow is using Dask to manage "active storage reductions": essentially getting the remote server to carry out reduction calculations where the data actually is and returning the small reduced result, rather than bringing the data across the network and performing the reductions locally. To do this, the array-like data in the Dask graph must, up until the reduction graph layer, remain as a virtual array-like object that knows how to get the data (i.e. my
NetCDFArray
), rather than as an object that actually contains the data locally (i.e. a numpy array). Prior to the reduction layer, we don't allow operations that would change the data values relative to those in the remote file (policed in our library outside of Dask) - but we must allow slicing, since real-life reductions are so often not required on the whole of a file's data.The text was updated successfully, but these errors were encountered: