Rocm warp size fix #5402

rraminen · 2024-04-11T18:17:17Z

This PR enables building the below extensions for AMD GPUs with warp size 32.

transformer_inference
quantizer
random_ltd

This PR works stand-alone for torch version <=2.0. For the latest versions, #5401 is required to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x:

transformer_inference:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ==========

quantizer:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer

Before this PR:
==== 244 failed, 8 warnings in 30.53s ====

After this PR:
====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes:
#4753
#5474
ROCm#68

cc: @jithunnair-amd

Fixes #4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - pytorch/pytorch#121030 - #5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

Fixes microsoft#4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - pytorch/pytorch#121030 - microsoft#5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

@jithunnair-amd

This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, microsoft#5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: **transformer_inference:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== **quantizer:** pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: microsoft#4753 microsoft#5474 ROCm#68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16. Fixes the issue introduced in #5402 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>

rraminen marked this pull request as ready for review April 17, 2024 19:31

rraminen requested review from mrwyattii, awan-10 and arashb as code owners April 17, 2024 19:31

rraminen mentioned this pull request Apr 23, 2024

rocblas -> hipblas changes for ROCm #5401

Merged

rraminen force-pushed the rocm_warp_size_fix branch from 295b743 to afaee86 Compare April 29, 2024 18:27

rraminen force-pushed the rocm_warp_size_fix branch from afaee86 to fb5ad02 Compare May 10, 2024 17:24

rraminen and others added 6 commits May 14, 2024 10:01

Warp_size issue fix on ROCm for JIT builds

02f0e39

Warp_size issue fix on ROCm for non-JIT builds

cc8922f

Set default value of rocm_wavefront_size to 32

8229706

Added comments

5d0e1c6

Support on all AMD GPUs

88c7be4

Formatting fix

c87c4f3

rraminen force-pushed the rocm_warp_size_fix branch from fb5ad02 to c87c4f3 Compare May 14, 2024 15:01

Merge branch 'master' into rocm_warp_size_fix

fa9aa86

loadams and others added 2 commits May 17, 2024 07:40

Merge branch 'master' into rocm_warp_size_fix

caf80f3

For random_ltd build on NAVI3x

25e7f75

loadams approved these changes May 17, 2024

View reviewed changes

mrwyattii approved these changes May 17, 2024

View reviewed changes

loadams added this pull request to the merge queue May 17, 2024

Merged via the queue into microsoft:master with commit 76c9c69 May 17, 2024
13 checks passed

jagadish-amd mentioned this pull request Oct 21, 2024

[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 #6622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rocm warp size fix #5402

Rocm warp size fix #5402

rraminen commented Apr 11, 2024 •

edited

Loading

Rocm warp size fix #5402

Rocm warp size fix #5402

Conversation

rraminen commented Apr 11, 2024 • edited Loading

rraminen commented Apr 11, 2024 •

edited

Loading