Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using nested keys in array graphs creates large number of unnecessary tasks for higher-dimensional arrays #11619

Open
hendrikmakait opened this issue Dec 20, 2024 · 0 comments
Labels

Comments

@hendrikmakait
Copy link
Member

While investigating dask/distributed#8958, I noticed this:

<Task None concrete(<Task None _identity_cast(<Task None _identity_cast(<Task None _identity_cast(<Task None _identity_cast(<Task None _identity_cast(Alias(('getitem-f7fd4f245dfedafeb33a2841a9c414ca', 2, 3, 19, 5, 0)), typ=<class 'list'>)>, <Task None _identity_cast(Alias(('getitem-f7fd4f245dfedafeb33a2841a9c414ca', 2, 3, 19, 5.9, 0)), typ=<class 'list'>)>, typ=<class 'list'>)>, typ=<class 'list'>)>, typ=<class 'list'>)>, typ=<class 'list'>)>)>=

Basically, the embedding of keys into nested data structures creates a large overhead of task objects. For the workload I investigated, this appears to have contributed up to 50% of all tasks. (Take that number with a grain of salt.)

We should avoid using these nested data structures for keys entirely. In array-code, I've identified the usage of concrete (the example above) as a culprit that can be trivially removed. However, concatenate3 and related functions are other culprits that require a bit more rewriting.

@github-actions github-actions bot added the needs triage Needs a response from a contributor label Dec 20, 2024
@phofl phofl added array and removed needs triage Needs a response from a contributor labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants