Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I wrote a new algorithm for reduction in wgpu.
Contrary to the former, it can use many threads in the computation of a single output element, leveraging shared memory.
Therefore, it is better on unbalanced tensors with shapes like [50, 10000, 50] (when reduce dim is the large one, here 1)
However it is slower on balanced tensors like [512, 512, 512], because the former was already very well parallelized in those cases.
It makes it a very good case for autotune, which I have implemented too.
For now, sum_dim and mean_dim are done with the new algorithm, with autotune. ArgsMax, ArgsMin and Sum haven't been done yet.