Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiation of matrix-matrix product with CuArrays unexpectedly slow. #486

Closed
CarpeNecopinum opened this issue Nov 11, 2018 · 3 comments

Comments

@CarpeNecopinum
Copy link

using CuArrays
using Flux

m1 = Tracker.param(rand(3, 100000) |> gpu)
m2 = Tracker.param(rand(100000, 3) |> gpu)

Tracker.forward((x,y)->x*y, m1, m2) # takes ~15 seconds even after repeated execution
Tracker.forward((x,y)->x*y, cpu(m1), cpu(m2)) # returns almost immediately

AD of a matrix-matrix product is needed e.g. for Neural Style Transfer (used https://pytorch.org/tutorials/advanced/neural_style_tutorial.html as a rough guideline, the gram-matrix needs a matrix-matrix product), but I noticed that training on the GPU is far slower than on the CPU. I narrowed it down to the issue shown by the code above.

@MikeInnes
Copy link
Member

I can't reproduce this on stable versions:

julia> @time Tracker.forward((x,y)->x*y, m1, m2);
  0.013572 seconds (2.29 k allocations: 111.667 KiB)

julia> @time Tracker.forward((x,y)->x*y, cpu(m1), cpu(m2));
  0.018829 seconds (2.07 k allocations: 6.967 MiB)

Are you on stable versions or master branches? It might be worth sending a manifest.

@CarpeNecopinum
Copy link
Author

CarpeNecopinum commented Nov 14, 2018

The manifest of the environment I tested this in.

(Created by ] add CuArrays Flux in a new environment)

For some reason, the @time macro is kind of misleading here. It displays similar timings as in your test above, but the first call blocks further calculations / the REPL task way longer. That's why I put the timings in comments, rather than using @time in the opening comment.

@MikeInnes
Copy link
Member

You could try:

julia> CuArrays.@time CuArrays.@sync Tracker.forward((x,y)->x*y, m1, m2);
  0.011956 seconds (2.29 k CPU allocations: 111.901 KiB) (1 GPU allocation: 36 bytes, 0.65% gc time of which 100.00% spent allocating)

That should get the full computation time (if it doesn't this is a very strange issue). Assuming it does, post the stats here and it'll show if it's a memory issue. Beyond that it might be good to try running under the profiling tools (both Julia and cudanative/nvprof).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants