Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

gdb · 2024-08-13T19:07:35Z

Currently, multi_tensor_apply causes an illegal memory access due to an overflow in the sizes field of TensorListMetadata. This can be reproduced using the following standalone script:

import torch, amp_C
from apex.multi_tensor_apply import multi_tensor_applier
multi_tensor_adam = amp_C.multi_tensor_adam

size = 2**32+1
g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')]
_dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda')

multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1)
print(g_32)

Currently, multi_tensor_apply causes an illegal memory access due to an overflow in the `size` field of `TensorListMetadata`. This can be reproduced using the following standalone script: ```python import torch, amp_C from apex.multi_tensor_apply import multi_tensor_applier multi_tensor_adam = amp_C.multi_tensor_adam size = 2**32+1 g_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')] p_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')] m_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')] v_32 = [torch.zeros(size, dtype=torch.float32, device='cuda')] _dummy_overflow_buf = torch.zeros(1, dtype=torch.int32, device='cuda') multi_tensor_applier(multi_tensor_adam, _dummy_overflow_buf, [g_32, p_32, m_32, v_32], 0.0, 0.9, 0.95, 1e-08, 1, 1, 1, 0.1) print(g_32) ```

awgu · 2024-08-13T23:27:50Z

cc: @crcrpar are the following out of date:

apex/csrc/multi_tensor_apply.cuh

Lines 15 to 17 in b3bd26a

    
           // TODO:  Kernel arg size limit may be <4KB for some other cards (ie Jetson) 
        
           constexpr int depth_to_max_tensors[6] = {110, 64, 48, 36, 30, 24}; 
        
           constexpr int depth_to_max_blocks[6] = {320, 320, 320, 320, 320, 320};

I see the same limits in PyTorch where you already updated to use int64_t in pytorch/pytorch#101760. Otherwise, I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

(Though, it seems that CUDA 12.1 on Volta+ increased the kernel arg size limit from 4 KB to 32 KB.)

crcrpar · 2024-08-17T05:39:10Z

I would expect that changing to use int64_t increases the TensorListMetadata struct size and hence the kernel arg size.

Yes, but apex does not have multi-tensor-apply with a list of scalars so we might be able to dodge a tweak of depth_to_max_tensors and depth_to_max_blocks

crcrpar

excuse my delay, thank you

firoj0 · 2024-09-10T21:41:13Z

firoj0

crcrpar approved these changes Aug 17, 2024

View reviewed changes

crcrpar merged commit 79e3dc4 into NVIDIA:master Aug 17, 2024

firoj0 reviewed Sep 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

gdb commented Aug 13, 2024 •

edited

Loading

awgu commented Aug 13, 2024

crcrpar commented Aug 17, 2024

crcrpar left a comment

firoj0 commented Sep 10, 2024

firoj0 left a comment

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

Fix illegal memory access with multi_tensor_apply size above INT_MAX #1825

Conversation

gdb commented Aug 13, 2024 • edited Loading

awgu commented Aug 13, 2024

crcrpar commented Aug 17, 2024

crcrpar left a comment

Choose a reason for hiding this comment

firoj0 commented Sep 10, 2024

firoj0 left a comment

Choose a reason for hiding this comment

gdb commented Aug 13, 2024 •

edited

Loading