[CUDA] Fix MultiHeadAttention thread safe and bias support #21498

tianleiwu · 2024-07-25T07:22:28Z

Description

Issues Fixed

(1) TRT cross attention not thread safe. Core changes like this are used to make it thread-safe:

Add an once_flag to CumulatedSequenceLengthCache to make sure it is only initialized once; and change the cache to be read only after initialization. Previously, the content is not read-only so it might be changed by other thread and potentially cause buffer overrun.
The kernel initialization is not guarded (Although the factory of kernel loading has static mutex to guard multiple threading), so the mutable variable might be set by two different threads at the same time. Add an once_flag to avoid that.

This requires need some workspace computation change as well. So I did not create a separated pull request.

(2) Bias for cross attention

That scenario has assumption that only query has bias, but not for key and value. However, such assumption is not verified in runtime and there was no comment of assumption, and there was no test case so the support of scenario was disabled by mistake. Actually, the scenario is used in whisper model (TODO: we shall add tests for whisper to CI pipeline, and also update fusion script to verify such assumptions if needed.)

CUDA/CPU kernels supports bias for cross attention as long as bias is zero for key and value. I updated the check to support the scenario and added comments wherever there is such assumption.

(3) Fallback support

Previously, unfused kernel did not support packed qkv and packed kv formats. That means some case might fail since there is no fallback. Example error message are like packed QKV format is not implemented for current GPU. Please disable it in fusion options. or packed KV format is not implemented for current GPU. Please disable packed kv in fusion options..

I added new AddBiasTranpose cuda kernels for them to support fallback, so that all supported cases will not fail.

Improvements

(4) QKV workspace size.

The logic for no_qkv_workspace could be easily out of sync since related code are scattered in different source files. I refactor the code to move all related code to one file (attention_prepare_qkv.cu) and add asserts, so that the logic can be in sync.

(5) Remove confusing concept of pass past in kv

parameters.pass_past_in_kv is confusing since the k/v in cross attention is not past state. Remove it and use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH instead.

New code does not use past_key/past_value for cross attention, so the logic is more clear.

(6) More coverage and less workspace and less transpose of flash and efficient attention
Previously, there is one condition does not run flash or efficient attention:

 bool past_no_bias = (pass_key_value_as_past || past_key != nullptr || present_key != nullptr) && bias == nullptr;

After this change, we can use flash and efficient attention for the case, and also less workspace.

For example, cross attention with bias, the original code uses two additional workspaces:

  transpose: past_key (BxNxSxH) => temp_k_workspace (BxSxNxH), past_value (BxNxSxH_v) => temp_v_workspace (BxSxNxH_v)
  add bias: query => q,   temp_k_workspace => k,   temp_v_workspace => v

New logic is like

   if (has bias)
      Add bias to query, key, value, and store in q, k, v workspace
   else
      Use query, key and value directly as q, k and v in kernel

We can see that, we do not need allocate temp_k_workspace and temp_v_workspace so use less memory. New code saved two transposes in this case.

Flash and efficient attention supports BSNH or BNSH formats for k and v. In old code, k/v are also converted to BSNH format. Some is not necessary. I do some change to convert k/v to BSNH or BNSH case by case. So that there are more cases can be covered by flash or efficient attention to improve performance.

(6) Debugging support
Previously, there is less debug info. In this change, I add a flag for debug info in the AttentionData. So that we can output debug info during the processing.

Also add functions to consolidate the dumping of inputs, QKV processing and outputs; Add an environment variable ORT_ENABLE_GPU_DUMP to allow disable dumping from cuda kernel.

Summary of changes

(1) Refactoring the CheckInputs, and pass in operator type.
(2) Refactoring the PrepareQKV to support fallback for packed qkv or packed kv inputs.
(3) Change a few case of PrepareQKV to allow more case covered by flash and efficient attention.
(4) use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH to replace parameters.pass_past_in_kv
(5) Allow bias input for Q_K_V_BSNH_BNSH_BNSH, and add comments of assumption that key/value has no bias in this case.
(6) Fix thread-safe issue in CumulatedSequenceLengthCache handling.
(7) Add test cases to cover all supported scenarios.

Current support scenarios for MultiHeadAttention for CUDA/CPU:

Q	K	V	pastK	pastV	presentK	presentV	Bias	Op desc
BSNH	BLNH	BLNH	-	-	-	-	QKV	not packed
BLN3H	-	-	-	-	-	-	QKV	qkv packed not support in CPU
BSNH	BLN2H	-	-	-	-	-	---	kv packed not support in CPU
BSNH	BNLH	BNLH	-	-	-	-	Q--	cross attention bias for Q only
BSNH	BLNH	BLNH	-	-	BNTH	BNTH	QKV	no past only present
BSNH	BLNH	BLNH	BNPH	BNPH	BNTH	BNTH	QKV	past and present (not share buffer)

Motivation and Context

#18854

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h

### Description #### Issues Fixed (1) **TRT cross attention not thread safe**. [Core changes like this](6fd7aba) are used to make it thread-safe: * Add an once_flag to CumulatedSequenceLengthCache to make sure it is only initialized once; and change the cache to be read only after initialization. Previously, the content is not read-only so it might be changed by other thread and potentially cause buffer overrun. * The kernel initialization is not guarded (Although the factory of kernel loading has static mutex to guard multiple threading), so the mutable variable might be set by two different threads at the same time. Add an once_flag to avoid that. This requires need some workspace computation change as well. So I did not create a separated pull request. (2) **Bias for cross attention** That scenario has assumption that only query has bias, but not for key and value. However, such assumption is not verified in runtime and there was no comment of assumption, and there was no test case so the support of scenario was disabled by mistake. Actually, the scenario is used in whisper model (TODO: we shall add tests for whisper to CI pipeline, and also update fusion script to verify such assumptions if needed.) CUDA/CPU kernels supports bias for cross attention as long as bias is zero for key and value. I updated the check to support the scenario and added comments wherever there is such assumption. (3) **Fallback support** Previously, unfused kernel did not support packed qkv and packed kv formats. That means some case might fail since there is no fallback. I added new AddBiasTranpose cuda kernels for them to support fallback, so that all supported cases will not fail. #### Improvements (4) **QKV workspace size**. The logic for no_qkv_workspace could be easily out of sync since related code are scattered in different source files. I refactor the code to move all related code to one file (attention_prepare_qkv.cu) and add asserts, so that the logic can be in sync. (5) **Remove confusing concept of pass past in kv** parameters.pass_past_in_kv is confusing since the k/v in cross attention is not past state. Remove it and use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH instead. New code does not use past_key/past_value for cross attention, so the logic is more clear. (6) **More coverage and less workspace and less transpose of flash and efficient attention** Previously, there is one condition does not run flash or efficient attention: ``` bool past_no_bias = (pass_key_value_as_past || past_key != nullptr || present_key != nullptr) && bias == nullptr; ``` After this change, we can use flash and efficient attention for the case, and also less workspace. For example, cross attention with bias, the original code uses two additional workspaces: ``` transpose: past_key (BxNxSxH) => temp_k_workspace (BxSxNxH), past_value (BxNxSxH_v) => temp_v_workspace (BxSxNxH_v) add bias: query => q, temp_k_workspace => k, temp_v_workspace => v ``` New logic is like ``` if (has bias) Add bias to query, key, value, and store in q, k, v workspace else Use query, key and value directly as q, k and v in kernel ``` We can see that, we do not need allocate temp_k_workspace and temp_v_workspace so use less memory. New code saved two transposes in this case. Flash and efficient attention supports BSNH or BNSH formats for k and v. In old code, k/v are also converted to BSNH format. Some is not necessary. I do some change to convert k/v to BSNH or BNSH case by case. So that there are more cases can be covered by flash or efficient attention to improve performance. (6) **Debugging support** Previously, there is less debug info. In this change, I add a flag for debug info in the AttentionData. So that we can output debug info during the processing. Also add functions to consolidate the dumping of inputs, QKV processing and outputs; Add an environment variable `ORT_ENABLE_GPU_DUMP` to allow disable dumping from cuda kernel. #### Summary of changes (1) Refactoring the CheckInputs, and pass in operator type. (2) Refactoring the PrepareQKV to support fallback for packed qkv or packed kv inputs. (3) Change a few case of PrepareQKV to allow more case covered by flash and efficient attention. (4) use parameters.qkv_format == Q_K_V_BSNH_BNSH_BNSH to replace parameters.pass_past_in_kv (5) Allow bias input for Q_K_V_BSNH_BNSH_BNSH, and add comments of assumption that key/value has no bias in this case. (6) Fix thread-safe issue in CumulatedSequenceLengthCache handling. (7) Add test cases to cover all supported scenarios. Current support scenarios for MultiHeadAttention for CUDA/CPU: | Q | K | V | pastK| pastV | presentK| presentV | Bias | Op desc | ---- | ---- | ---- | ------ | ----- | --------- | -------- | -----|--------- | BSNH | BLNH| BLNH| - | - | - | - | QKV | not packed | BLN3H| - | - | - | - | - | - | QKV | qkv packed not support in CPU | BSNH | BLN2H| - | - | - | - | - | --- | kv packed not support in CPU | BSNH | BNLH| BNLH| - | - | - | - | Q-- | cross attention bias for Q only | BSNH | BLNH | BLNH | - | - | BNTH | BNTH | QKV | no past only present | BSNH | BLNH | BLNH | BNPH | BNPH | BNTH | BNTH | QKV | past and present (not share buffer) ### Motivation and Context  #18854

tianleiwu added 4 commits July 23, 2024 17:30

fix bias check for dmmha

e50d4a7

update comment

049b4a1

update comment

c3af7ad

update comment

7420365

tianleiwu marked this pull request as draft July 25, 2024 07:22

tianleiwu marked this pull request as ready for review July 25, 2024 16:51

tianleiwu requested review from kunal-vaishnavi, wangyems and cloudhan July 25, 2024 16:51

wangyems reviewed Jul 25, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Outdated Show resolved Hide resolved

tianleiwu marked this pull request as draft July 26, 2024 00:43

cloudhan reviewed Jul 26, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Show resolved Hide resolved

tianleiwu force-pushed the tlwu/fix_dmmha_input_check branch from 98dea5e to 9d34245 Compare July 26, 2024 06:38

refactoring

f4b85fe

tianleiwu force-pushed the tlwu/fix_dmmha_input_check branch from d726987 to f4b85fe Compare July 26, 2024 07:38

tianleiwu added 3 commits July 26, 2024 00:41

update comment

c44be77

update check

edefe5d

max_sequence_length

22875bf

tianleiwu marked this pull request as ready for review July 26, 2024 19:40

add comments

ae31c43

tianleiwu marked this pull request as draft July 26, 2024 22:10

Merge branch 'main' into tlwu/fix_dmmha_input_check

fdf7ecf

github-advanced-security bot found potential problems Jul 29, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Jul 29, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cpu/bert/multihead_attention_helper.h Fixed Show fixed Hide fixed

tianleiwu force-pushed the tlwu/fix_dmmha_input_check branch from 322f6ec to bcb25bf Compare July 29, 2024 21:15

tianleiwu changed the title ~~[CUDA] Fix DecoderMaskedMultiHeadAttention bias input check~~ [CUDA] Fix MultiHeadAttention thread safe and bias support Jul 29, 2024

fix thread-safe; full fallback support; refactoring

cc97579

tianleiwu force-pushed the tlwu/fix_dmmha_input_check branch from bcb25bf to cc97579 Compare July 30, 2024 00:02

tianleiwu added 2 commits July 29, 2024 18:20

Add test cases with present outputs but no past inputs

90e87c0

update assert

5d6829c

fix linux build

15a4c5b

tianleiwu marked this pull request as ready for review July 30, 2024 06:47

tianleiwu requested review from cloudhan and wangyems July 30, 2024 06:47

tianleiwu added 3 commits July 30, 2024 07:17

fix a check

8a4a88e

move ORT_ENFORCE

a194bf7

comments

2b8bfa3

wangyems previously approved these changes Jul 30, 2024

View reviewed changes

update test; add asserts

bb58666

tianleiwu dismissed wangyems’s stale review via bb58666 July 31, 2024 08:51

tianleiwu requested a review from wangyems July 31, 2024 08:52

wangyems approved these changes Jul 31, 2024

View reviewed changes

tianleiwu merged commit c5f8389 into main Jul 31, 2024
92 of 95 checks passed

tianleiwu deleted the tlwu/fix_dmmha_input_check branch July 31, 2024 16:01

tianleiwu added the release:1.19.0 Cherry pick to ORT 1.19 label Aug 1, 2024

prathikr added the cherry-picked Cherry-picked for a cherrypicks branch label Aug 6, 2024

kunal-vaishnavi mentioned this pull request Oct 31, 2024

DMMHA: add unit tests; fix CPU, CUDA kernel #22567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Fix MultiHeadAttention thread safe and bias support #21498

[CUDA] Fix MultiHeadAttention thread safe and bias support #21498

tianleiwu commented Jul 25, 2024 •

edited

Loading

[CUDA] Fix MultiHeadAttention thread safe and bias support #21498

[CUDA] Fix MultiHeadAttention thread safe and bias support #21498

Conversation

tianleiwu commented Jul 25, 2024 • edited Loading

Description

Issues Fixed

Improvements

Summary of changes

Motivation and Context

tianleiwu commented Jul 25, 2024 •

edited

Loading