Skip to content

Latest commit

 

History

History

matmul

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Notes

I do not agree with the argument made in https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/#popular-gemm-benchmark-isnt-accurate that the cache should be flushed prior to each iteration. For actual models, the input tensor will be the output of the previous layer, and it is reasonable to expect it to be on the cache already. As for weight tensors, they can be prefetched since their values are known before the computation occurs.

Because of this, the throughput measurements for both NVIDIA GPUs and Gaudi HPUs are made without flushing the cache.

Note that for small matrices below 4Kx4K, the Python host overhead becomes visible during GEMM. Because of this, the GEMM throughput measurements implement repeated GEMMs in a single function.

Results for BF16 matrix multiplication

M K N Device Mean TFLOPS Peak TFLOPS MFU
16384 8192 1280 Gaudi v2 395.4 432 91.5%
A100 264.9 312 84.9%
H100 754.0 989.4 76.2%
16384 1024 8192 Gaudi v2 417.9 432 96.7%
A100 235.9 312 75.6%
H100 657.1 989.4 66.4%
16384 8192 7168 Gaudi v2 425.2 432 98.4%
A100 261.7 312 83.9%
H100 700.7 989.4 70.8%
16384 3584 8192 Gaudi v2 422.9 432 97.9%
A100 257.1 312 82.4%
H100 665.5 989.4 67.3%
4096 4096 4096 Gaudi v2 408.2 432 94.5%
A100 252.4 312 80.9%
H100 748.4 989.4 75.7%
8192 8192 8192 Gaudi v2 423.9 432 98.1%
A100 263.6 312 84.5%
H100 693.4 989.4 70.1%
16384 16384 16384 Gaudi v2 379.8 432 87.9%
A100 267.1 312 85.6%
H100 688.9 989.4 69.7%