Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow usage of a different device on main and sub-envs in ParallelEnv and SerialEnv #1626

Merged
merged 71 commits into from
Nov 30, 2023

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Oct 11, 2023

Allows sub-envs to have a different device than main env in batched environments.

The idea is to send data to CUDA only once if needed, rather than using CUDA as a shared memory container.
One can finely control where the data is passed, as in the following script:
https://gist.github.com/vmoens/d49e733c37356e4465bc3fe85d499db1

I here report the results with a CNN policy:
For sub-envs on CPU and main env on CPU (as well as policy), I get 789 it/sec (8 procs) -- 2431.1049 it/sec (32 procs)
For sub-envs on CPU and main env on CUDA (as well as policy), I get 2697 it/sec (8 procs) -- 5074.3945 it/sec (32 procs)
For sub-envs on CUDA and main env on CUDA (as well as policy), I get 840 it/sec (8 procs) -- OOM (32 procs)

@skandermoalla I believe this will solve your perf problem

Copy link

github-actions bot commented Nov 10, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 92. Improved: $\large\color{#35bf28}9$. Worsened: $\large\color{#d91a1a}4$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 0.1259s 0.1254s 7.9738 Ops/s 8.1205 Ops/s $\color{#d91a1a}-1.81\%$
test_sync 0.1033s 0.1024s 9.7614 Ops/s 9.7601 Ops/s $\color{#35bf28}+0.01\%$
test_async 0.2832s 99.8089ms 10.0191 Ops/s 10.0066 Ops/s $\color{#35bf28}+0.13\%$
test_single_pixels 0.1311s 0.1307s 7.6528 Ops/s 6.8536 Ops/s $\textbf{\color{#35bf28}+11.66\%}$
test_sync_pixels 0.1034s 0.1016s 9.8435 Ops/s 10.4872 Ops/s $\textbf{\color{#d91a1a}-6.14\%}$
test_async_pixels 0.2442s 90.6410ms 11.0325 Ops/s 11.2103 Ops/s $\color{#d91a1a}-1.59\%$
test_simple 0.9814s 0.9133s 1.0949 Ops/s 1.1044 Ops/s $\color{#d91a1a}-0.86\%$
test_transformed 1.1872s 1.1306s 0.8845 Ops/s 0.8649 Ops/s $\color{#35bf28}+2.27\%$
test_serial 2.5515s 2.4805s 0.4031 Ops/s 0.4011 Ops/s $\color{#35bf28}+0.50\%$
test_parallel 2.5597s 2.4722s 0.4045 Ops/s 0.3955 Ops/s $\color{#35bf28}+2.28\%$
test_step_mdp_speed[True-True-True-True-True] 97.7920μs 34.2620μs 29.1869 KOps/s 27.3521 KOps/s $\textbf{\color{#35bf28}+6.71\%}$
test_step_mdp_speed[True-True-True-True-False] 45.4700μs 20.2812μs 49.3069 KOps/s 47.8085 KOps/s $\color{#35bf28}+3.13\%$
test_step_mdp_speed[True-True-True-False-True] 54.9010μs 20.0092μs 49.9769 KOps/s 47.6925 KOps/s $\color{#35bf28}+4.79\%$
test_step_mdp_speed[True-True-True-False-False] 33.8310μs 11.8995μs 84.0369 KOps/s 80.7285 KOps/s $\color{#35bf28}+4.10\%$
test_step_mdp_speed[True-True-False-True-True] 69.4610μs 35.6676μs 28.0367 KOps/s 26.8591 KOps/s $\color{#35bf28}+4.38\%$
test_step_mdp_speed[True-True-False-True-False] 47.1410μs 21.8432μs 45.7809 KOps/s 43.5075 KOps/s $\textbf{\color{#35bf28}+5.23\%}$
test_step_mdp_speed[True-True-False-False-True] 49.8810μs 21.9846μs 45.4865 KOps/s 43.1915 KOps/s $\textbf{\color{#35bf28}+5.31\%}$
test_step_mdp_speed[True-True-False-False-False] 45.3010μs 13.9256μs 71.8103 KOps/s 70.8904 KOps/s $\color{#35bf28}+1.30\%$
test_step_mdp_speed[True-False-True-True-True] 71.4210μs 38.1126μs 26.2381 KOps/s 25.0615 KOps/s $\color{#35bf28}+4.69\%$
test_step_mdp_speed[True-False-True-True-False] 50.5610μs 24.0378μs 41.6012 KOps/s 39.5686 KOps/s $\textbf{\color{#35bf28}+5.14\%}$
test_step_mdp_speed[True-False-True-False-True] 47.9700μs 22.1030μs 45.2427 KOps/s 43.2266 KOps/s $\color{#35bf28}+4.66\%$
test_step_mdp_speed[True-False-True-False-False] 70.4010μs 13.9248μs 71.8141 KOps/s 70.4499 KOps/s $\color{#35bf28}+1.94\%$
test_step_mdp_speed[True-False-False-True-True] 0.1155ms 39.7618μs 25.1498 KOps/s 24.1644 KOps/s $\color{#35bf28}+4.08\%$
test_step_mdp_speed[True-False-False-True-False] 58.7510μs 25.5535μs 39.1336 KOps/s 37.5049 KOps/s $\color{#35bf28}+4.34\%$
test_step_mdp_speed[True-False-False-False-True] 51.0210μs 23.6545μs 42.2753 KOps/s 41.2821 KOps/s $\color{#35bf28}+2.41\%$
test_step_mdp_speed[True-False-False-False-False] 48.6800μs 15.8653μs 63.0305 KOps/s 62.8093 KOps/s $\color{#35bf28}+0.35\%$
test_step_mdp_speed[False-True-True-True-True] 74.2710μs 38.2859μs 26.1193 KOps/s 24.9645 KOps/s $\color{#35bf28}+4.63\%$
test_step_mdp_speed[False-True-True-True-False] 51.9010μs 23.7856μs 42.0422 KOps/s 41.0036 KOps/s $\color{#35bf28}+2.53\%$
test_step_mdp_speed[False-True-True-False-True] 91.0510μs 26.0157μs 38.4383 KOps/s 37.1147 KOps/s $\color{#35bf28}+3.57\%$
test_step_mdp_speed[False-True-True-False-False] 40.7000μs 15.6278μs 63.9884 KOps/s 63.3255 KOps/s $\color{#35bf28}+1.05\%$
test_step_mdp_speed[False-True-False-True-True] 89.0410μs 39.4363μs 25.3574 KOps/s 23.8670 KOps/s $\textbf{\color{#35bf28}+6.24\%}$
test_step_mdp_speed[False-True-False-True-False] 57.5210μs 25.7972μs 38.7639 KOps/s 37.6892 KOps/s $\color{#35bf28}+2.85\%$
test_step_mdp_speed[False-True-False-False-True] 63.2600μs 27.6742μs 36.1347 KOps/s 34.8663 KOps/s $\color{#35bf28}+3.64\%$
test_step_mdp_speed[False-True-False-False-False] 43.6600μs 17.5097μs 57.1112 KOps/s 55.4806 KOps/s $\color{#35bf28}+2.94\%$
test_step_mdp_speed[False-False-True-True-True] 76.5210μs 42.6108μs 23.4682 KOps/s 23.7306 KOps/s $\color{#d91a1a}-1.11\%$
test_step_mdp_speed[False-False-True-True-False] 55.4810μs 27.7874μs 35.9875 KOps/s 35.4762 KOps/s $\color{#35bf28}+1.44\%$
test_step_mdp_speed[False-False-True-False-True] 66.1210μs 28.1492μs 35.5250 KOps/s 35.0571 KOps/s $\color{#35bf28}+1.33\%$
test_step_mdp_speed[False-False-True-False-False] 43.7610μs 17.4181μs 57.4116 KOps/s 54.3673 KOps/s $\textbf{\color{#35bf28}+5.60\%}$
test_step_mdp_speed[False-False-False-True-True] 81.7210μs 43.5889μs 22.9416 KOps/s 22.3720 KOps/s $\color{#35bf28}+2.55\%$
test_step_mdp_speed[False-False-False-True-False] 52.1510μs 30.3179μs 32.9838 KOps/s 32.6216 KOps/s $\color{#35bf28}+1.11\%$
test_step_mdp_speed[False-False-False-False-True] 58.2710μs 30.3180μs 32.9838 KOps/s 32.0458 KOps/s $\color{#35bf28}+2.93\%$
test_step_mdp_speed[False-False-False-False-False] 46.0310μs 19.5450μs 51.1641 KOps/s 51.0555 KOps/s $\color{#35bf28}+0.21\%$
test_values[generalized_advantage_estimate-True-True] 27.5589ms 26.9326ms 37.1297 Ops/s 36.7502 Ops/s $\color{#35bf28}+1.03\%$
test_values[vec_generalized_advantage_estimate-True-True] 86.9972ms 3.3028ms 302.7778 Ops/s 301.1197 Ops/s $\color{#35bf28}+0.55\%$
test_values[td0_return_estimate-False-False] 0.1009ms 65.3388μs 15.3049 KOps/s 15.1746 KOps/s $\color{#35bf28}+0.86\%$
test_values[td1_return_estimate-False-False] 59.3002ms 57.4174ms 17.4163 Ops/s 16.6958 Ops/s $\color{#35bf28}+4.32\%$
test_values[vec_td1_return_estimate-False-False] 2.0711ms 1.7368ms 575.7777 Ops/s 570.1424 Ops/s $\color{#35bf28}+0.99\%$
test_values[td_lambda_return_estimate-True-False] 95.8203ms 94.7855ms 10.5501 Ops/s 10.4972 Ops/s $\color{#35bf28}+0.50\%$
test_values[vec_td_lambda_return_estimate-True-False] 2.0440ms 1.7628ms 567.2661 Ops/s 563.8115 Ops/s $\color{#35bf28}+0.61\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 26.7482ms 26.5598ms 37.6508 Ops/s 39.6980 Ops/s $\textbf{\color{#d91a1a}-5.16\%}$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 0.8786ms 0.7274ms 1.3747 KOps/s 1.3747 KOps/s $+0.00\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.8027ms 0.6835ms 1.4631 KOps/s 1.4670 KOps/s $\color{#d91a1a}-0.27\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5458ms 1.4776ms 676.7716 Ops/s 676.1352 Ops/s $\color{#35bf28}+0.09\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.9803ms 0.7111ms 1.4063 KOps/s 1.4065 KOps/s $\color{#d91a1a}-0.02\%$
test_dqn_speed 7.8719ms 1.4535ms 688.0111 Ops/s 684.1205 Ops/s $\color{#35bf28}+0.57\%$
test_ddpg_speed 4.6926ms 3.2781ms 305.0508 Ops/s 302.7283 Ops/s $\color{#35bf28}+0.77\%$
test_sac_speed 94.5158ms 9.9228ms 100.7775 Ops/s 109.0498 Ops/s $\textbf{\color{#d91a1a}-7.59\%}$
test_redq_speed 17.1540ms 16.6117ms 60.1986 Ops/s 59.9874 Ops/s $\color{#35bf28}+0.35\%$
test_redq_deprec_speed 14.1292ms 13.0337ms 76.7243 Ops/s 78.2171 Ops/s $\color{#d91a1a}-1.91\%$
test_td3_speed 19.1671ms 9.4240ms 106.1124 Ops/s 105.8476 Ops/s $\color{#35bf28}+0.25\%$
test_cql_speed 32.9136ms 31.6257ms 31.6199 Ops/s 31.4242 Ops/s $\color{#35bf28}+0.62\%$
test_a2c_speed 8.5479ms 7.1826ms 139.2262 Ops/s 141.3987 Ops/s $\color{#d91a1a}-1.54\%$
test_ppo_speed 9.2579ms 7.5562ms 132.3423 Ops/s 136.0287 Ops/s $\color{#d91a1a}-2.71\%$
test_reinforce_speed 7.6660ms 6.1963ms 161.3857 Ops/s 164.5100 Ops/s $\color{#d91a1a}-1.90\%$
test_iql_speed 28.9764ms 27.2077ms 36.7543 Ops/s 36.7289 Ops/s $\color{#35bf28}+0.07\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.0587ms 2.4499ms 408.1863 Ops/s 400.3058 Ops/s $\color{#35bf28}+1.97\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 3.8814ms 2.6376ms 379.1297 Ops/s 331.0611 Ops/s $\textbf{\color{#35bf28}+14.52\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.7665ms 2.6415ms 378.5725 Ops/s 375.3239 Ops/s $\color{#35bf28}+0.87\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1386ms 2.4742ms 404.1647 Ops/s 402.3102 Ops/s $\color{#35bf28}+0.46\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3.7675ms 2.6431ms 378.3495 Ops/s 329.1144 Ops/s $\textbf{\color{#35bf28}+14.96\%}$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3.8913ms 2.6618ms 375.6811 Ops/s 372.8908 Ops/s $\color{#35bf28}+0.75\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 2.9061ms 2.4585ms 406.7515 Ops/s 401.8351 Ops/s $\color{#35bf28}+1.22\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 3.8980ms 2.6384ms 379.0131 Ops/s 375.3138 Ops/s $\color{#35bf28}+0.99\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 3.5356ms 2.6478ms 377.6717 Ops/s 370.6337 Ops/s $\color{#35bf28}+1.90\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.9303ms 2.4705ms 404.7807 Ops/s 400.3605 Ops/s $\color{#35bf28}+1.10\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 4.3535ms 2.6664ms 375.0384 Ops/s 371.3615 Ops/s $\color{#35bf28}+0.99\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.8083ms 2.6577ms 376.2662 Ops/s 373.9163 Ops/s $\color{#35bf28}+0.63\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.0302ms 2.4652ms 405.6468 Ops/s 399.6933 Ops/s $\color{#35bf28}+1.49\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3.4469ms 2.6506ms 377.2739 Ops/s 372.4766 Ops/s $\color{#35bf28}+1.29\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.1462ms 2.6680ms 374.8173 Ops/s 373.2655 Ops/s $\color{#35bf28}+0.42\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.0964ms 2.4755ms 403.9538 Ops/s 401.2249 Ops/s $\color{#35bf28}+0.68\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 3.9595ms 2.6614ms 375.7460 Ops/s 371.3602 Ops/s $\color{#35bf28}+1.18\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.4477ms 2.6770ms 373.5523 Ops/s 373.3958 Ops/s $\color{#35bf28}+0.04\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.2177s 19.7264ms 50.6934 Ops/s 50.8581 Ops/s $\color{#d91a1a}-0.32\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 0.1336s 17.9541ms 55.6975 Ops/s 55.8224 Ops/s $\color{#d91a1a}-0.22\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 0.1288s 15.4323ms 64.7993 Ops/s 64.1679 Ops/s $\color{#35bf28}+0.98\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1303s 17.7642ms 56.2930 Ops/s 56.1044 Ops/s $\color{#35bf28}+0.34\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 0.1265s 17.7066ms 56.4761 Ops/s 54.9705 Ops/s $\color{#35bf28}+2.74\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 0.1450s 18.1534ms 55.0860 Ops/s 61.6577 Ops/s $\textbf{\color{#d91a1a}-10.66\%}$
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1356s 17.9631ms 55.6698 Ops/s 54.5770 Ops/s $\color{#35bf28}+2.00\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1352s 17.8547ms 56.0075 Ops/s 55.2671 Ops/s $\color{#35bf28}+1.34\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 0.1378s 15.7640ms 63.4357 Ops/s 64.0702 Ops/s $\color{#d91a1a}-0.99\%$

@vmoens vmoens changed the title [Feature] Use a different device on sub-envs in ParallelEnv and SerialEnv [Feature] Allow usage of a different device on main and sub-envs in ParallelEnv and SerialEnv Nov 27, 2023
@vmoens vmoens merged commit 6c27bdb into main Nov 30, 2023
59 of 61 checks passed
@vmoens vmoens deleted the parallel_cuda_refactor branch November 30, 2023 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants