[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

cbhua · 2024-09-04T20:01:32Z

Describe the bug

This pull request resolves an invalid CUDA ID error that occurs when transferring a Bounded variable between servers with different numbers of GPUs.

In the current implementation, when changing the device of a Bounded variable, there is:

rl/torchrl/data/tensor_specs.py

Lines 2273 to 2274 in df4fa78

    
           low=self.space.low.to(dest), 
        
           high=self.space.high.to(dest),

This operation first attempts to move self.space._low to its self.space.device before transferring to the target device (dest):

rl/torchrl/data/tensor_specs.py

Lines 376 to 382 in df4fa78

    
           @property 
        
           def low(self): 
        
               return self._low.to(self.device) 
        
           @property 
        
           def high(self): 
        
               return self._high.to(self.device)

This process leads to errors when a variable previously on cuda:7 (in an 8 GPU server) is loaded on a server with only one GPU, as it incorrectly attempts to access cuda:7.

The issue was identified when a model was trained and saved on a multi-GPU cluster and subsequently loaded on a local server equipped with fewer GPUs. The model’s saved state includes device information specific to the original multi-GPU environment. When attempting to assign the model to a device available on the current server, the discrepancy in device IDs between the environments leads to this bug.

To Reproduce

Run the following code in a server with 8 graphic cards:

import torch
from torchrl.data.tensor_specs import Bounded

spec = Bounded(low=0, high=1, shape=(), device="cuda:7")
torch.save(spec, "spec.pt")

Then copy the spec.pt to another server with only 1 graphic cards. Run the following code:

import torch

spec_load = torch.load("spec.pt")
print(spec_load.device) # Expected output: "cuda:7"

spec_load.to("cuda:0")

Expected behavior

We will receive the CUDA error:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
----> 6 spec_load.to("cuda:0")

File ~/github/rl/torchrl/data/tensor_specs.py:2273, in Bounded.to(self, dest)
   2270 if dest_device == self.device and dest_dtype == self.dtype:
   2271     return self
   2272 return Bounded(
-> 2273     low=self.space.low.to(dest),
   2274     high=self.space.high.to(dest),
   2275     shape=self.shape,
   2276     device=dest_device,
   2277     dtype=dest_dtype,
   2278 )

File ~/github/rl/torchrl/data/tensor_specs.py:378, in ContinuousBox.low(self)
    376 @property
    377 def low(self):
--> 378     return self._low.to(self.device)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This happens because at

rl/torchrl/data/tensor_specs.py

Lines 376 to 382 in df4fa78

    
           @property 
        
           def low(self): 
        
               return self._low.to(self.device) 
        
           @property 
        
           def high(self): 
        
               return self._high.to(self.device)

the self._low will be passed to cuda:7.

System info

>>> print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
>>> 0.5.0+df4fa78 2.1.1 3.11.0 (main, Mar  1 2023, 18:26:19) [GCC 11.2.0] linux

Reason and Possible fixes

I created a PR to fix this bug.

Checklist

I have checked that there is no similar issue in the repo (required)
I have read the documentation (required)
I have provided a minimal working example to reproduce the bug (required)

The text was updated successfully, but these errors were encountered:

cbhua added the bug Something isn't working label Sep 4, 2024

cbhua assigned vmoens Sep 4, 2024

cbhua mentioned this issue Sep 4, 2024

[BugFix] Fix invalid CUDA ID error when loading Bounded variables across devices #2421

Merged

4 tasks

vmoens closed this as completed in #2421 Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

cbhua commented Sep 4, 2024

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

Comments

cbhua commented Sep 4, 2024

Describe the bug

To Reproduce

Expected behavior

System info

Reason and Possible fixes

Checklist