You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This process leads to errors when a variable previously on cuda:7 (in an 8 GPU server) is loaded on a server with only one GPU, as it incorrectly attempts to access cuda:7.
The issue was identified when a model was trained and saved on a multi-GPU cluster and subsequently loaded on a local server equipped with fewer GPUs. The model’s saved state includes device information specific to the original multi-GPU environment. When attempting to assign the model to a device available on the current server, the discrepancy in device IDs between the environments leads to this bug.
To Reproduce
Run the following code in a server with 8 graphic cards:
Describe the bug
This pull request resolves an invalid CUDA ID error that occurs when transferring a Bounded variable between servers with different numbers of GPUs.
In the current implementation, when changing the device of a
Bounded
variable, there is:rl/torchrl/data/tensor_specs.py
Lines 2273 to 2274 in df4fa78
This operation first attempts to move
self.space._low
to itsself.space.device
before transferring to the target device (dest
):rl/torchrl/data/tensor_specs.py
Lines 376 to 382 in df4fa78
This process leads to errors when a variable previously on
cuda:7
(in an 8 GPU server) is loaded on a server with only one GPU, as it incorrectly attempts to accesscuda:7
.The issue was identified when a model was trained and saved on a multi-GPU cluster and subsequently loaded on a local server equipped with fewer GPUs. The model’s saved state includes device information specific to the original multi-GPU environment. When attempting to assign the model to a device available on the current server, the discrepancy in device IDs between the environments leads to this bug.
To Reproduce
Run the following code in a server with 8 graphic cards:
Then copy the
spec.pt
to another server with only 1 graphic cards. Run the following code:Expected behavior
We will receive the CUDA error:
This happens because at
rl/torchrl/data/tensor_specs.py
Lines 376 to 382 in df4fa78
the
self._low
will be passed tocuda:7
.System info
Reason and Possible fixes
I created a PR to fix this bug.
Checklist
The text was updated successfully, but these errors were encountered: