Skip to content

LeiWang1999/Stream-k.tvm

Repository files navigation

Reproduce the PPoPP'23 Paper Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU with TVM TIR and TL, which could be helpful for us to optimize the performance for small shapes.

example

Dependencies:

pip install git+https://github.com/microsoft/BitBLAS.git

TODO Items:

  • Implement Float16 Tensor Core.
  • With Vectorized AtomicAdd.
  • Implement Auto Tune.
  • With BlockReduction.
  • With Ladder Layout Propagation.
  • Apply Padding when the tile shape is not aligned for Swizzling.
  • Implement Q4.
  • Implement with BitBLAS Fast Dequantize.
  • Enhance TL Thread Level mma Abstraction.
  • Implement Dequantize template and integrate with BitBLAS.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published