matmul

Notes

I do not agree with the argument made in https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-benchmark-part-1-training/#popular-gemm-benchmark-isnt-accurate that the cache should be flushed prior to each iteration. For actual models, the input tensor will be the output of the previous layer, and it is reasonable to expect it to be on the cache already. As for weight tensors, they can be prefetched since their values are known before the computation occurs.

Because of this, the throughput measurements for both NVIDIA GPUs and Gaudi HPUs are made without flushing the cache.

Note that for small matrices below 4Kx4K, the Python host overhead becomes visible during GEMM. Because of this, the GEMM throughput measurements implement repeated GEMMs in a single function.

Results for BF16 matrix multiplication

M	K	N	Device	Mean TFLOPS	Peak TFLOPS	MFU
16384	8192	1280	Gaudi v2	395.4	432	91.5%
			A100	264.9	312	84.9%
			H100	754.0	989.4	76.2%
16384	1024	8192	Gaudi v2	417.9	432	96.7%
			A100	235.9	312	75.6%
			H100	657.1	989.4	66.4%
16384	8192	7168	Gaudi v2	425.2	432	98.4%
			A100	261.7	312	83.9%
			H100	700.7	989.4	70.8%
16384	3584	8192	Gaudi v2	422.9	432	97.9%
			A100	257.1	312	82.4%
			H100	665.5	989.4	67.3%
4096	4096	4096	Gaudi v2	408.2	432	94.5%
			A100	252.4	312	80.9%
			H100	748.4	989.4	75.7%
8192	8192	8192	Gaudi v2	423.9	432	98.1%
			A100	263.6	312	84.5%
			H100	693.4	989.4	70.1%
16384	16384	16384	Gaudi v2	379.8	432	87.9%
			A100	267.1	312	85.6%
			H100	688.9	989.4	69.7%

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
bf16.py		bf16.py
cuda_bf16.py		cuda_bf16.py
cuda_fp8.py		cuda_fp8.py
fp8.py		fp8.py
fp8_inc.py		fp8_inc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matmul

matmul

README.md

Notes

Results for BF16 matrix multiplication

Files

matmul

Directory actions

More options

Directory actions

More options

Latest commit

History

matmul

Folders and files

parent directory

README.md

Notes

Results for BF16 matrix multiplication