Perf/wgpu/reduce dim #943

louisfd · 2023-11-10T00:15:04Z

I wrote a new algorithm for reduction in wgpu.
Contrary to the former, it can use many threads in the computation of a single output element, leveraging shared memory.
Therefore, it is better on unbalanced tensors with shapes like [50, 10000, 50] (when reduce dim is the large one, here 1)
However it is slower on balanced tensors like [512, 512, 512], because the former was already very well parallelized in those cases.
It makes it a very good case for autotune, which I have implemented too.
For now, sum_dim and mean_dim are done with the new algorithm, with autotune. ArgsMax, ArgsMin and Sum haven't been done yet.

nathanielsimard

🏎🏎🏎

louisfd added 10 commits November 8, 2023 15:45

new reduce half working

762e40d

surprisingly working

d218855

good on elongated matrix, bad on balanced ones

d009542

working and clean

da1e745

autotune not tested, tests fail at non contiguous

40006f4

fixed

dee44af

autotune tested

93dced4

mean dim

426555a

some fixes

f9dadd8

clippy

a235eee

nathanielsimard approved these changes Nov 12, 2023

View reviewed changes

louisfd merged commit 831335a into main Nov 13, 2023
7 checks passed

nathanielsimard deleted the perf/wgpu/reduce_dim branch November 17, 2023 00:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf/wgpu/reduce dim #943

Perf/wgpu/reduce dim #943

louisfd commented Nov 10, 2023 •

edited

Loading

nathanielsimard left a comment

Perf/wgpu/reduce dim #943

Perf/wgpu/reduce dim #943

Conversation

louisfd commented Nov 10, 2023 • edited Loading

nathanielsimard left a comment

Choose a reason for hiding this comment

louisfd commented Nov 10, 2023 •

edited

Loading