Description
Describe the issue
I am sharing below my OrtCUDAProviderOptions
, which I use to set the gpu device to use for computation on a server with multiple GPUs.
When setting the deviceId
, I encounter buggy memory allocations.
- For example, setting the following
OrtCUDAProviderOptions
:
OrtCUDAProviderOptions cudaOptions =new OrtCUDAProviderOptions(6);
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
options.addCUDA(cudaOptions);
results in:
GPU0: 1545MiB / 24576MiB
GPU6: 3MiB / 24576MiB
It discards deviceId
being set to 6
and takes 0
.
- Where as commenting out
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
to become:
OrtCUDAProviderOptions cudaOptions =new OrtCUDAProviderOptions(6);
// cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
options.addCUDA(cudaOptions);
results in:
GPU0: 545MiB / 24576MiB (0% Util)
GPU6: 1545MiB / 24576MiB
where it selected the correct gpu, but resulted in 545MiB
being allocated on device 0
without utilizing the device.
- Finally, keeping
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
but addingcudaOptions.add("device_id", String.valueOf(6));
for selecting the device instead of directly specifying it in the constructor as in:
OrtCUDAProviderOptions cudaOptions = new OrtCUDAProviderOptions();
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT"); // this has no effect anymore! The flag is not considered
cudaOptions.add("device_id", String.valueOf(6));
options.addCUDA(cudaOptions);
results in:
GPU0: 545MiB / 24576MiB (0% Util)
GPU6: 1545MiB / 24576MiB
which is the same as example 2.
To confirm whether cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
is being executed or ignored in example 3, I did some experiments and turned out it is not considered anymore and is neglected/shadowed out by cudaOptions.add("device_id", String.valueOf(6));
added afterwards.
There are two problems here:
- Using
cudaOptions.add("cudnn_conv_algo_search", "DEFAULT");
results in selecting the wrong device, in this case device0
all the time. - There is an initial memory being allocated on device
0
, even though the correctdeviceId
has been utilized for the computation.
As a workaround, I am exporting only one visible cuda device to avoid this problem.
To reproduce
Unfortunately model cannot be provided, but can write a toy example+model and supply if needed.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04.4 LTS
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
v1.17.3
ONNX Runtime API
Java
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda: 11.2, cudnn: 8.1.1
Model File
No response
Is this a quantized model?
No
Activity
Craigacp commentedon May 2, 2024
Ok, I think I understand what's going on there. I had expected the C API's
UpdateCUDAProviderOptions
function to be something I could use to append options to a CUDA options struct, but it looks like what it actually does is delete the old one and only set the options specified in the update call (https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/cuda/cuda_provider_factory.cc#L236). The Java code callsUpdateCUDAProviderOptions
each time theadd
method is called, so the new option overwrites all the old ones. That's pretty annoying, but I can fix it on the Java side. I'll put a fix together next week.hashJoe commentedon May 2, 2024
Thanks for the clarification!
Regardingthe 2nd problem, does the call to
UpdateCUDAProviderOptions
allocate some memory on device0
(maybe for initialization) before setting main device to6
and proceed with main computation? And is it possible to pass multiple options at once to avoid calling theadd
method several times?Craigacp commentedon May 2, 2024
It is possible in the native code to pass multiple options at once, but not how I've written the Java binding to that native code. The Java object tracks all the options that are set, so I need to modify the
SessionOptions.addCUDA
call to call a new method onOrtCUDAProviderOptions
which calls update once with the aggregated parameters before it's passed in.WRT the memory allocation on GPU zero, that might be an artifact of how CUDA & ORT works, I think the primary GPU tends to end up with some driver & code related stuff in general, but someone with more CUDA expertise might be able to help there.
hashJoe commentedon May 3, 2024
I did additional tests regarding the memory allocation problem on gpu
0
, where I simply inserted the following code using cuda api before calling anything that is ORT related:and ORT code still allocates on
deviceId=6
:and the problem above is solved, where nothing is allocated on gpu
0
anymore. However, the same memory size is still allocated on gpu6
-->1545MiB
. I would have expected that the memory now should be aggregated with that which has been allocated on gpu0
to become maybe smthn like~2000MiB
if it is to be some cuda driver related stuff.In other words, I assume there is also a bug where ORT starts allocating on gpu
0
for main computation before detecting thedeviceId
set by the user?hashJoe commentedon May 3, 2024
Additionally, without the cuda code:
When gpu
0
is free,545MiB
is being allocated on it, however, when it is allocated by another process and not much space is left, a portion of that memory is allocated instead~100MiB
[java] CUDA & TensorRT options fix (#20549)
[java] CUDA & TensorRT options fix (microsoft#20549)
[java] CUDA & TensorRT options fix (#20549)