some kind of an ml library in cuda
rough timeline:
- feedforward neural network
- manual gradient calculation for backward pass
- automatic gradient calculation
- add convolution expression
- convolutional neural network
...some intermediate steps, then eventually:
- transformer
somewhere along the line:
- get close to cuBLAS performance on SGEMM
- decide how to benchmark/ compare speed
task for initial steps: MNIST dataset