-
Notifications
You must be signed in to change notification settings - Fork 73
GPU Matrices
BIDMat currently has GPU versions of FMat, IMat and SMat, which are respectively GMat, GIMat and GSMat. Most operators and matrix functions defined for CPU matrices will also work on GPU matrices. These operations should also be defined for the Mat superclass. This allows generic code to be written and run on either CPU host or GPU, and supports sparse or dense data.
e.g. the inner loop of LDA (Latent Dirichlet Allocation) looks like this:
def uupdate(sdata:Mat, user:Mat, ipass:Int):Unit = { for (i <- 0 until opts.uiter) { val preds = DDS(mm, user, sdata) val dc = sdata.contents val pc = preds.contents max(opts.weps, pc, pc) pc ~ dc / pc val unew = user ∘ (mm * preds) + opts.alpha if (opts.exppsi) exppsi(unew, unew) user <-- unew } }
The matrix sdata
can be either sparse or dense, and CPU- or GPU-based. DDS
returns the product of mm
and user
at the non-zeros of sdata
which for a dense sdata
is just the full product.
You can convert to GPU matrices with constructors for each type, e.g. GMat(a)
constructs a GMat from an FMat
source (and returns its argument if a is already a GMat). Similarly GIMat(mi)
and GSMat(s)
construct GIMats and GSMats respectively from IMat or SMat arguments. GPU matrices should have the same toString as the corresponding CPU matrix, so look the same when returned from interactive commands. e.g.
> val a = rand(4,6) a: BIDMat.FMat = 0.67636 0.15675 0.43748 0.081511 0.46293 0.097704 0.31279 0.69485 0.91233 0.87120 0.12652 0.71330 0.10547 0.88596 0.58793 0.90858 0.45308 0.45136 0.83065 0.84817 0.080891 0.022294 0.73676 0.14168 > GMat(a) res12: BIDMat.GMat = 0.67636 0.15675 0.43748 0.081511 0.46293 0.097704 0.31279 0.69485 0.91233 0.87120 0.12652 0.71330 0.10547 0.88596 0.58793 0.90858 0.45308 0.45136 0.83065 0.84817 0.080891 0.022294 0.73676 0.14168
you can access elements of GPU matrices with indices, e.g. a(0,0)
but element access is normally only useful for debugging or interactive exploration. Pulling single elements across the CPU/GPU boundary is very expensive, and will normally nullify any benefits from computing on the GPU.
GIMat
s support block indexing, e.g. for integer (GIMat) matrices ii
and jj
, you can access a block of a GMat aa
as
aa(ii,jj)
or set the contents of a GMat to a GMat RHS bb
as:
aa(ii,jj) = bb
block indexing can be combined with integer arguments for single row or column access:
aa(5,jj)
or
aa(ii,0) = bb(?,2)
Calculations on the GPU have to be implemented with GPU memory to avoid transit across the PCI bus. Operator, functions and block access on GPU matrices normally require all arguments to be in GPU memory. It would be possible to automatically transfer data between CPU and GPU for mixed-locus operations, but this is not implemented in the library for a few reasons. First of all, its not clear which location to prefer for mixed two-argument operations. The GPU is typically faster, but transiting the bus will remove the benefits much of the time. The CPU has much more memory (and a garbage collector) so memory problems are much easier to avoid by computing on the CPU host. Secondly, storage needs to be allocated for a matrix to be returned for many operations, and the same trade-offs recur. Thirdly, when used with BIDMach, the Learner class normally site calculations either entirely on the GPU or entirely on the CPU, and mixed-locus operations usually indicate an error. By omitting implicit cast operators that move arguments between CPU and GPU, we force exceptions with mixed-locus operations, and this helps identify and fix the problems earlier.
When you create a GMat, GIMat or GSMat (say from an FMat a with GMat(a)
), storage is automatically created on the GPU. There is no garbage collector in the CUDA runtime, so that storage
must be explicitly freed, or the GPU and runtime must be reset. To free a GPU matrix mat
, do
mat.freeThere is no checking for other references to that matrix, and those references will lead to errors if used. But this approach is quite handy for e.g. functions that copy some matrices from CPU to GPU, process them, and copy the results back. The GPU matrices allocated for that calculation can safely be freed after use. This approach is used in several sorting routines.
In any case, you are responsible for checking for dangling references if you use free
The second way to clear all GPU matrices on a particular GPU is to call resetGPU
. If you do this and have been using caching (next section), then the cache will probably contain references to GPU matrices that are no longer valid. Therefore you should clear the caches after a GPU reset. Do this with:
> resetGPU > Mat.clearCaches
You can also call resetGPUs
to reset all available GPUs.
Explicit storage management is very tricky for complex calculations. Luckily, a simple caching scheme works very well for storage re-use for minibatch algorithms. Matrix caching applies to both CPU and GPU matrices but it was developed specifically for GPUs whose C++ runtimes lack memory management. Caching is described in detail in the next chapter. But there are a couple of subtleties with using it without "leaking" uncached matrices which we review here.
When evaluating operations with constants like a+2
, the cache key is built from the GUID
of a
and the value 2. This is to allow the use of different constant sums like a+3
, a+4
etc. without aliasing. But the consequence is that iterative calculations with changing constants will not be cached. e.g., in a moving average expression:
alpha = 1f/n newB = alpha * A + (1 - alpha) * B
alpha
varies from one iteration to the next, and each iteration will produce different cached copies. One solution is to use a 1x1 matrix container for alpha. The locus of this container (CPU or GPU) should match that of A
and B
. The value can be updated using the set method:
alpha.set(1f/n)
and then alpha * A
and (1-alpha)
will both use the same cached container from one iteration to the next.
Not every binary operator that works for CPU matrices is implemented for GPU matrices. We are working to add more. Sometimes they are missing by choice. The workhorse of many gradient algorithms is dense-sparse matrix multiply, and dense-sparse on a transposed data matrix. i.e.
> val a:GMat = grand(4096,4096) > val b:GSMat = GSMat(sprand(4096,4096,0.1)) > val x = a * b > val y = a *^ b
both these operators are defined and implemented very efficiently. But
> val u = b * a > val v = b *^ ai.e. sparse x dense, are not. Algebraically, they are equivalent modulo transpose and we could do:
> val u = (a.t * b.t).t > val u = (a.t *^ b).tand
> val v = (a * b.t).t > val v = (a *^ b).tthe problem is that we don't have a fast way to implement
b * a
directly, and because the dense-sparse form has much more coherent memory accesses, its always likely to be faster. Transpose is quite a fast operation, so the best way to implement these operations is by using explicit transposes as suggested above.
We could have implemented sparse-dense operators with explicit transposes, but that would require allocating temp variables somehow. That's a challenge with GPUs. The allocation itself can easily dominate the calculation. Relying on matrix caching is another option, but its not a good one in this case, because the intermediate results are hidden from the user. So a user that writes a series of apparently non-allocating operations (e.g. using the ~
operator) without caching would find that a significant amount of GPU memory was being allocated anyway. By using explicit transposes as above, it will always be clear what temp matrices are being created.
For now these operators are not included. This may change in future if we can find a better way to implement them.