Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

Closed
3 tasks done
cbhua opened this issue Sep 4, 2024 · 0 comments · Fixed by #2421
Closed
3 tasks done

[BUG] Invalid CUDA ID error when loading Bounded variables across devices #2420

cbhua opened this issue Sep 4, 2024 · 0 comments · Fixed by #2421
Assignees
Labels
bug Something isn't working

Comments

@cbhua
Copy link
Contributor

cbhua commented Sep 4, 2024

Describe the bug

This pull request resolves an invalid CUDA ID error that occurs when transferring a Bounded variable between servers with different numbers of GPUs.

In the current implementation, when changing the device of a Bounded variable, there is:

low=self.space.low.to(dest),
high=self.space.high.to(dest),

This operation first attempts to move self.space._low to its self.space.device before transferring to the target device (dest):

@property
def low(self):
return self._low.to(self.device)
@property
def high(self):
return self._high.to(self.device)

This process leads to errors when a variable previously on cuda:7 (in an 8 GPU server) is loaded on a server with only one GPU, as it incorrectly attempts to access cuda:7.

The issue was identified when a model was trained and saved on a multi-GPU cluster and subsequently loaded on a local server equipped with fewer GPUs. The model’s saved state includes device information specific to the original multi-GPU environment. When attempting to assign the model to a device available on the current server, the discrepancy in device IDs between the environments leads to this bug.

To Reproduce

Run the following code in a server with 8 graphic cards:

import torch
from torchrl.data.tensor_specs import Bounded

spec = Bounded(low=0, high=1, shape=(), device="cuda:7")
torch.save(spec, "spec.pt")

Then copy the spec.pt to another server with only 1 graphic cards. Run the following code:

import torch

spec_load = torch.load("spec.pt")
print(spec_load.device) # Expected output: "cuda:7"

spec_load.to("cuda:0")

Expected behavior

We will receive the CUDA error:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
----> 6 spec_load.to("cuda:0")

File ~/github/rl/torchrl/data/tensor_specs.py:2273, in Bounded.to(self, dest)
   2270 if dest_device == self.device and dest_dtype == self.dtype:
   2271     return self
   2272 return Bounded(
-> 2273     low=self.space.low.to(dest),
   2274     high=self.space.high.to(dest),
   2275     shape=self.shape,
   2276     device=dest_device,
   2277     dtype=dest_dtype,
   2278 )

File ~/github/rl/torchrl/data/tensor_specs.py:378, in ContinuousBox.low(self)
    376 @property
    377 def low(self):
--> 378     return self._low.to(self.device)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This happens because at

@property
def low(self):
return self._low.to(self.device)
@property
def high(self):
return self._high.to(self.device)

the self._low will be passed to cuda:7.

System info

>>> print(torchrl.__version__, numpy.__version__, sys.version, sys.platform)
>>> 0.5.0+df4fa78 2.1.1 3.11.0 (main, Mar  1 2023, 18:26:19) [GCC 11.2.0] linux

Reason and Possible fixes

I created a PR to fix this bug.

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have read the documentation (required)
  • I have provided a minimal working example to reproduce the bug (required)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants