Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

Merged
merged 1 commit into from
Aug 17, 2024

Conversation

gdb
Copy link
Contributor

@gdb gdb commented Aug 13, 2024

Currently, multi_tensor_apply causes an illegal memory access due to an overflow in the sizes field of TensorListMetadata. This can be reproduced using the following standalone script:

import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)

Currently, multi_tensor_apply causes an illegal memory access due to
an overflow in the `size` field of `TensorListMetadata`. This can be
reproduced using the following standalone script:

```python
import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)
```
@awgu
Copy link

awgu commented Aug 13, 2024

cc: @crcrpar are the following out of date:

// TODO: Kernel arg size limit may be <4KB for some other cards (ie Jetson)
constexpr int depth_to_max_tensors[6] = {110, 64, 48, 36, 30, 24};
constexpr int depth_to_max_blocks[6] = {320, 320, 320, 320, 320, 320};

I see the same limits in PyTorch where you already updated to use int64_t in pytorch/pytorch#101760. Otherwise, I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

(Though, it seems that CUDA 12.1 on Volta+ increased the kernel arg size limit from 4 KB to 32 KB.)

@crcrpar
Copy link
Collaborator

crcrpar commented Aug 17, 2024

I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

Yes, but apex does not have multi-tensor-apply with a list of scalars so we might be able to dodge a tweak of depth_to_max_tensors and depth_to_max_blocks

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excuse my delay, thank you

@crcrpar crcrpar merged commit 79e3dc4 into NVIDIA:master Aug 17, 2024
@firoj0
Copy link

firoj0 commented Sep 10, 2024

C/C++ CI

Copy link

@firoj0 firoj0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C/C++ CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants