-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow usage of a different device on main and sub-envs in ParallelEnv and SerialEnv #1626
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Conflicts: # benchmarks/ecosystem/gym_env_throughput.py
|
Name | Max | Mean | Ops | Ops on Repo HEAD
|
Change |
---|---|---|---|---|---|
test_single | 0.1259s | 0.1254s | 7.9738 Ops/s | 8.1205 Ops/s | |
test_sync | 0.1033s | 0.1024s | 9.7614 Ops/s | 9.7601 Ops/s | |
test_async | 0.2832s | 99.8089ms | 10.0191 Ops/s | 10.0066 Ops/s | |
test_single_pixels | 0.1311s | 0.1307s | 7.6528 Ops/s | 6.8536 Ops/s | |
test_sync_pixels | 0.1034s | 0.1016s | 9.8435 Ops/s | 10.4872 Ops/s | |
test_async_pixels | 0.2442s | 90.6410ms | 11.0325 Ops/s | 11.2103 Ops/s | |
test_simple | 0.9814s | 0.9133s | 1.0949 Ops/s | 1.1044 Ops/s | |
test_transformed | 1.1872s | 1.1306s | 0.8845 Ops/s | 0.8649 Ops/s | |
test_serial | 2.5515s | 2.4805s | 0.4031 Ops/s | 0.4011 Ops/s | |
test_parallel | 2.5597s | 2.4722s | 0.4045 Ops/s | 0.3955 Ops/s | |
test_step_mdp_speed[True-True-True-True-True] | 97.7920μs | 34.2620μs | 29.1869 KOps/s | 27.3521 KOps/s | |
test_step_mdp_speed[True-True-True-True-False] | 45.4700μs | 20.2812μs | 49.3069 KOps/s | 47.8085 KOps/s | |
test_step_mdp_speed[True-True-True-False-True] | 54.9010μs | 20.0092μs | 49.9769 KOps/s | 47.6925 KOps/s | |
test_step_mdp_speed[True-True-True-False-False] | 33.8310μs | 11.8995μs | 84.0369 KOps/s | 80.7285 KOps/s | |
test_step_mdp_speed[True-True-False-True-True] | 69.4610μs | 35.6676μs | 28.0367 KOps/s | 26.8591 KOps/s | |
test_step_mdp_speed[True-True-False-True-False] | 47.1410μs | 21.8432μs | 45.7809 KOps/s | 43.5075 KOps/s | |
test_step_mdp_speed[True-True-False-False-True] | 49.8810μs | 21.9846μs | 45.4865 KOps/s | 43.1915 KOps/s | |
test_step_mdp_speed[True-True-False-False-False] | 45.3010μs | 13.9256μs | 71.8103 KOps/s | 70.8904 KOps/s | |
test_step_mdp_speed[True-False-True-True-True] | 71.4210μs | 38.1126μs | 26.2381 KOps/s | 25.0615 KOps/s | |
test_step_mdp_speed[True-False-True-True-False] | 50.5610μs | 24.0378μs | 41.6012 KOps/s | 39.5686 KOps/s | |
test_step_mdp_speed[True-False-True-False-True] | 47.9700μs | 22.1030μs | 45.2427 KOps/s | 43.2266 KOps/s | |
test_step_mdp_speed[True-False-True-False-False] | 70.4010μs | 13.9248μs | 71.8141 KOps/s | 70.4499 KOps/s | |
test_step_mdp_speed[True-False-False-True-True] | 0.1155ms | 39.7618μs | 25.1498 KOps/s | 24.1644 KOps/s | |
test_step_mdp_speed[True-False-False-True-False] | 58.7510μs | 25.5535μs | 39.1336 KOps/s | 37.5049 KOps/s | |
test_step_mdp_speed[True-False-False-False-True] | 51.0210μs | 23.6545μs | 42.2753 KOps/s | 41.2821 KOps/s | |
test_step_mdp_speed[True-False-False-False-False] | 48.6800μs | 15.8653μs | 63.0305 KOps/s | 62.8093 KOps/s | |
test_step_mdp_speed[False-True-True-True-True] | 74.2710μs | 38.2859μs | 26.1193 KOps/s | 24.9645 KOps/s | |
test_step_mdp_speed[False-True-True-True-False] | 51.9010μs | 23.7856μs | 42.0422 KOps/s | 41.0036 KOps/s | |
test_step_mdp_speed[False-True-True-False-True] | 91.0510μs | 26.0157μs | 38.4383 KOps/s | 37.1147 KOps/s | |
test_step_mdp_speed[False-True-True-False-False] | 40.7000μs | 15.6278μs | 63.9884 KOps/s | 63.3255 KOps/s | |
test_step_mdp_speed[False-True-False-True-True] | 89.0410μs | 39.4363μs | 25.3574 KOps/s | 23.8670 KOps/s | |
test_step_mdp_speed[False-True-False-True-False] | 57.5210μs | 25.7972μs | 38.7639 KOps/s | 37.6892 KOps/s | |
test_step_mdp_speed[False-True-False-False-True] | 63.2600μs | 27.6742μs | 36.1347 KOps/s | 34.8663 KOps/s | |
test_step_mdp_speed[False-True-False-False-False] | 43.6600μs | 17.5097μs | 57.1112 KOps/s | 55.4806 KOps/s | |
test_step_mdp_speed[False-False-True-True-True] | 76.5210μs | 42.6108μs | 23.4682 KOps/s | 23.7306 KOps/s | |
test_step_mdp_speed[False-False-True-True-False] | 55.4810μs | 27.7874μs | 35.9875 KOps/s | 35.4762 KOps/s | |
test_step_mdp_speed[False-False-True-False-True] | 66.1210μs | 28.1492μs | 35.5250 KOps/s | 35.0571 KOps/s | |
test_step_mdp_speed[False-False-True-False-False] | 43.7610μs | 17.4181μs | 57.4116 KOps/s | 54.3673 KOps/s | |
test_step_mdp_speed[False-False-False-True-True] | 81.7210μs | 43.5889μs | 22.9416 KOps/s | 22.3720 KOps/s | |
test_step_mdp_speed[False-False-False-True-False] | 52.1510μs | 30.3179μs | 32.9838 KOps/s | 32.6216 KOps/s | |
test_step_mdp_speed[False-False-False-False-True] | 58.2710μs | 30.3180μs | 32.9838 KOps/s | 32.0458 KOps/s | |
test_step_mdp_speed[False-False-False-False-False] | 46.0310μs | 19.5450μs | 51.1641 KOps/s | 51.0555 KOps/s | |
test_values[generalized_advantage_estimate-True-True] | 27.5589ms | 26.9326ms | 37.1297 Ops/s | 36.7502 Ops/s | |
test_values[vec_generalized_advantage_estimate-True-True] | 86.9972ms | 3.3028ms | 302.7778 Ops/s | 301.1197 Ops/s | |
test_values[td0_return_estimate-False-False] | 0.1009ms | 65.3388μs | 15.3049 KOps/s | 15.1746 KOps/s | |
test_values[td1_return_estimate-False-False] | 59.3002ms | 57.4174ms | 17.4163 Ops/s | 16.6958 Ops/s | |
test_values[vec_td1_return_estimate-False-False] | 2.0711ms | 1.7368ms | 575.7777 Ops/s | 570.1424 Ops/s | |
test_values[td_lambda_return_estimate-True-False] | 95.8203ms | 94.7855ms | 10.5501 Ops/s | 10.4972 Ops/s | |
test_values[vec_td_lambda_return_estimate-True-False] | 2.0440ms | 1.7628ms | 567.2661 Ops/s | 563.8115 Ops/s | |
test_gae_speed[generalized_advantage_estimate-False-1-512] | 26.7482ms | 26.5598ms | 37.6508 Ops/s | 39.6980 Ops/s | |
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] | 0.8786ms | 0.7274ms | 1.3747 KOps/s | 1.3747 KOps/s | |
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] | 0.8027ms | 0.6835ms | 1.4631 KOps/s | 1.4670 KOps/s | |
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] | 1.5458ms | 1.4776ms | 676.7716 Ops/s | 676.1352 Ops/s | |
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] | 0.9803ms | 0.7111ms | 1.4063 KOps/s | 1.4065 KOps/s | |
test_dqn_speed | 7.8719ms | 1.4535ms | 688.0111 Ops/s | 684.1205 Ops/s | |
test_ddpg_speed | 4.6926ms | 3.2781ms | 305.0508 Ops/s | 302.7283 Ops/s | |
test_sac_speed | 94.5158ms | 9.9228ms | 100.7775 Ops/s | 109.0498 Ops/s | |
test_redq_speed | 17.1540ms | 16.6117ms | 60.1986 Ops/s | 59.9874 Ops/s | |
test_redq_deprec_speed | 14.1292ms | 13.0337ms | 76.7243 Ops/s | 78.2171 Ops/s | |
test_td3_speed | 19.1671ms | 9.4240ms | 106.1124 Ops/s | 105.8476 Ops/s | |
test_cql_speed | 32.9136ms | 31.6257ms | 31.6199 Ops/s | 31.4242 Ops/s | |
test_a2c_speed | 8.5479ms | 7.1826ms | 139.2262 Ops/s | 141.3987 Ops/s | |
test_ppo_speed | 9.2579ms | 7.5562ms | 132.3423 Ops/s | 136.0287 Ops/s | |
test_reinforce_speed | 7.6660ms | 6.1963ms | 161.3857 Ops/s | 164.5100 Ops/s | |
test_iql_speed | 28.9764ms | 27.2077ms | 36.7543 Ops/s | 36.7289 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 3.0587ms | 2.4499ms | 408.1863 Ops/s | 400.3058 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 3.8814ms | 2.6376ms | 379.1297 Ops/s | 331.0611 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 3.7665ms | 2.6415ms | 378.5725 Ops/s | 375.3239 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.1386ms | 2.4742ms | 404.1647 Ops/s | 402.3102 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 3.7675ms | 2.6431ms | 378.3495 Ops/s | 329.1144 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 3.8913ms | 2.6618ms | 375.6811 Ops/s | 372.8908 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 2.9061ms | 2.4585ms | 406.7515 Ops/s | 401.8351 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 3.8980ms | 2.6384ms | 379.0131 Ops/s | 375.3138 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 3.5356ms | 2.6478ms | 377.6717 Ops/s | 370.6337 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 2.9303ms | 2.4705ms | 404.7807 Ops/s | 400.3605 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 4.3535ms | 2.6664ms | 375.0384 Ops/s | 371.3615 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 3.8083ms | 2.6577ms | 376.2662 Ops/s | 373.9163 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.0302ms | 2.4652ms | 405.6468 Ops/s | 399.6933 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 3.4469ms | 2.6506ms | 377.2739 Ops/s | 372.4766 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 4.1462ms | 2.6680ms | 374.8173 Ops/s | 373.2655 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 3.0964ms | 2.4755ms | 403.9538 Ops/s | 401.2249 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 3.9595ms | 2.6614ms | 375.7460 Ops/s | 371.3602 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 4.4477ms | 2.6770ms | 373.5523 Ops/s | 373.3958 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] | 0.2177s | 19.7264ms | 50.6934 Ops/s | 50.8581 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] | 0.1336s | 17.9541ms | 55.6975 Ops/s | 55.8224 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] | 0.1288s | 15.4323ms | 64.7993 Ops/s | 64.1679 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] | 0.1303s | 17.7642ms | 56.2930 Ops/s | 56.1044 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] | 0.1265s | 17.7066ms | 56.4761 Ops/s | 54.9705 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] | 0.1450s | 18.1534ms | 55.0860 Ops/s | 61.6577 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] | 0.1356s | 17.9631ms | 55.6698 Ops/s | 54.5770 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] | 0.1352s | 17.8547ms | 56.0075 Ops/s | 55.2671 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] | 0.1378s | 15.7640ms | 63.4357 Ops/s | 64.0702 Ops/s |
vmoens
changed the title
[Feature] Use a different device on sub-envs in ParallelEnv and SerialEnv
[Feature] Allow usage of a different device on main and sub-envs in ParallelEnv and SerialEnv
Nov 27, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
enhancement
New feature or request
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Allows sub-envs to have a different device than main env in batched environments.
The idea is to send data to CUDA only once if needed, rather than using CUDA as a shared memory container.
One can finely control where the data is passed, as in the following script:
https://gist.github.com/vmoens/d49e733c37356e4465bc3fe85d499db1
I here report the results with a CNN policy:
For sub-envs on CPU and main env on CPU (as well as policy), I get 789 it/sec (8 procs) -- 2431.1049 it/sec (32 procs)
For sub-envs on CPU and main env on CUDA (as well as policy), I get 2697 it/sec (8 procs) -- 5074.3945 it/sec (32 procs)
For sub-envs on CUDA and main env on CUDA (as well as policy), I get 840 it/sec (8 procs) -- OOM (32 procs)
@skandermoalla I believe this will solve your perf problem