-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Threaded collection and parallel envs #1559
Conversation
|
Name | Max | Mean | Ops | Ops on Repo HEAD
|
Change |
---|---|---|---|---|---|
test_single | 75.7331ms | 75.3728ms | 13.2674 Ops/s | 12.9456 Ops/s | |
test_sync | 0.1234s | 43.9157ms | 22.7709 Ops/s | 22.1297 Ops/s | |
test_async | 57.3507ms | 39.6664ms | 25.2103 Ops/s | 24.5132 Ops/s | |
test_simple | 0.7161s | 0.6433s | 1.5546 Ops/s | 1.5449 Ops/s | |
test_transformed | 0.8805s | 0.8366s | 1.1954 Ops/s | 1.2142 Ops/s | |
test_serial | 1.8527s | 1.7887s | 0.5591 Ops/s | 0.5515 Ops/s | |
test_parallel | 1.6683s | 1.5869s | 0.6301 Ops/s | 0.6054 Ops/s | |
test_step_mdp_speed[True-True-True-True-True] | 0.2046ms | 43.8864μs | 22.7861 KOps/s | 22.4219 KOps/s | |
test_step_mdp_speed[True-True-True-True-False] | 0.1660ms | 25.0369μs | 39.9411 KOps/s | 39.3735 KOps/s | |
test_step_mdp_speed[True-True-True-False-True] | 0.1878ms | 31.1343μs | 32.1189 KOps/s | 31.8413 KOps/s | |
test_step_mdp_speed[True-True-True-False-False] | 0.1818ms | 17.2341μs | 58.0246 KOps/s | 56.9421 KOps/s | |
test_step_mdp_speed[True-True-False-True-True] | 0.2078ms | 45.5779μs | 21.9405 KOps/s | 21.6470 KOps/s | |
test_step_mdp_speed[True-True-False-True-False] | 0.1911ms | 26.6725μs | 37.4917 KOps/s | 36.7634 KOps/s | |
test_step_mdp_speed[True-True-False-False-True] | 0.1888ms | 33.0425μs | 30.2640 KOps/s | 28.5368 KOps/s | |
test_step_mdp_speed[True-True-False-False-False] | 0.1784ms | 19.4700μs | 51.3610 KOps/s | 50.8005 KOps/s | |
test_step_mdp_speed[True-False-True-True-True] | 0.2058ms | 47.6401μs | 20.9907 KOps/s | 20.5899 KOps/s | |
test_step_mdp_speed[True-False-True-True-False] | 0.3140ms | 28.9532μs | 34.5385 KOps/s | 34.6469 KOps/s | |
test_step_mdp_speed[True-False-True-False-True] | 0.1015ms | 33.1660μs | 30.1514 KOps/s | 29.7298 KOps/s | |
test_step_mdp_speed[True-False-True-False-False] | 57.8000μs | 19.5152μs | 51.2422 KOps/s | 51.2973 KOps/s | |
test_step_mdp_speed[True-False-False-True-True] | 0.1416ms | 49.1496μs | 20.3460 KOps/s | 20.1609 KOps/s | |
test_step_mdp_speed[True-False-False-True-False] | 0.1450ms | 30.5690μs | 32.7129 KOps/s | 32.9290 KOps/s | |
test_step_mdp_speed[True-False-False-False-True] | 65.2010μs | 34.6842μs | 28.8316 KOps/s | 28.7955 KOps/s | |
test_step_mdp_speed[True-False-False-False-False] | 58.3010μs | 21.0598μs | 47.4839 KOps/s | 47.3671 KOps/s | |
test_step_mdp_speed[False-True-True-True-True] | 0.1669ms | 47.2083μs | 21.1827 KOps/s | 20.9366 KOps/s | |
test_step_mdp_speed[False-True-True-True-False] | 0.1249ms | 28.6167μs | 34.9447 KOps/s | 34.3321 KOps/s | |
test_step_mdp_speed[False-True-True-False-True] | 69.7010μs | 36.9550μs | 27.0599 KOps/s | 27.0891 KOps/s | |
test_step_mdp_speed[False-True-True-False-False] | 0.1084ms | 21.2751μs | 47.0032 KOps/s | 45.5881 KOps/s | |
test_step_mdp_speed[False-True-False-True-True] | 0.1387ms | 49.4872μs | 20.2072 KOps/s | 20.1971 KOps/s | |
test_step_mdp_speed[False-True-False-True-False] | 56.3000μs | 30.4492μs | 32.8416 KOps/s | 32.5476 KOps/s | |
test_step_mdp_speed[False-True-False-False-True] | 0.1121ms | 38.9208μs | 25.6932 KOps/s | 25.9564 KOps/s | |
test_step_mdp_speed[False-True-False-False-False] | 0.1345ms | 23.3311μs | 42.8612 KOps/s | 42.6911 KOps/s | |
test_step_mdp_speed[False-False-True-True-True] | 0.1495ms | 51.1633μs | 19.5452 KOps/s | 19.4567 KOps/s | |
test_step_mdp_speed[False-False-True-True-False] | 0.1351ms | 32.2953μs | 30.9643 KOps/s | 30.8688 KOps/s | |
test_step_mdp_speed[False-False-True-False-True] | 89.4010μs | 38.5784μs | 25.9212 KOps/s | 25.8017 KOps/s | |
test_step_mdp_speed[False-False-True-False-False] | 0.1245ms | 23.2085μs | 43.0877 KOps/s | 43.0673 KOps/s | |
test_step_mdp_speed[False-False-False-True-True] | 87.0010μs | 51.7244μs | 19.3332 KOps/s | 19.2550 KOps/s | |
test_step_mdp_speed[False-False-False-True-False] | 0.1200ms | 33.6658μs | 29.7037 KOps/s | 29.5347 KOps/s | |
test_step_mdp_speed[False-False-False-False-True] | 0.1318ms | 39.5864μs | 25.2612 KOps/s | 24.9561 KOps/s | |
test_step_mdp_speed[False-False-False-False-False] | 57.7010μs | 24.6965μs | 40.4916 KOps/s | 40.6593 KOps/s | |
test_values[generalized_advantage_estimate-True-True] | 15.2543ms | 14.6749ms | 68.1434 Ops/s | 70.7416 Ops/s | |
test_values[vec_generalized_advantage_estimate-True-True] | 49.7779ms | 43.9560ms | 22.7500 Ops/s | 22.9874 Ops/s | |
test_values[td0_return_estimate-False-False] | 0.7981ms | 0.5115ms | 1.9551 KOps/s | 4.0092 KOps/s | |
test_values[td1_return_estimate-False-False] | 14.6661ms | 14.0187ms | 71.3331 Ops/s | 72.5173 Ops/s | |
test_values[vec_td1_return_estimate-False-False] | 52.9204ms | 44.5584ms | 22.4425 Ops/s | 23.5957 Ops/s | |
test_values[td_lambda_return_estimate-True-False] | 54.5282ms | 33.6613ms | 29.7077 Ops/s | 30.8829 Ops/s | |
test_values[vec_td_lambda_return_estimate-True-False] | 54.7905ms | 45.4336ms | 22.0102 Ops/s | 23.9138 Ops/s | |
test_gae_speed[generalized_advantage_estimate-False-1-512] | 12.0347ms | 11.8689ms | 84.2535 Ops/s | 84.3864 Ops/s | |
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] | 6.4456ms | 4.2425ms | 235.7124 Ops/s | 263.8863 Ops/s | |
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] | 3.9257ms | 0.6279ms | 1.5927 KOps/s | 2.0992 KOps/s | |
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] | 62.0475ms | 58.0644ms | 17.2222 Ops/s | 17.0793 Ops/s | |
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] | 8.9486ms | 3.9702ms | 251.8737 Ops/s | 331.9010 Ops/s | |
test_dqn_speed | 12.7651ms | 2.7257ms | 366.8807 Ops/s | 504.6677 Ops/s | |
test_ddpg_speed | 13.3077ms | 4.9427ms | 202.3179 Ops/s | 336.1731 Ops/s | |
test_sac_speed | 18.6421ms | 13.1271ms | 76.1784 Ops/s | 106.9725 Ops/s | |
test_redq_speed | 26.6502ms | 21.5058ms | 46.4990 Ops/s | 58.1203 Ops/s | |
test_redq_deprec_speed | 25.7620ms | 20.2092ms | 49.4824 Ops/s | 71.0173 Ops/s | |
test_td3_speed | 16.3880ms | 14.2218ms | 70.3147 Ops/s | 87.7933 Ops/s | |
test_cql_speed | 48.0665ms | 42.1752ms | 23.7106 Ops/s | 35.6136 Ops/s | |
test_a2c_speed | 15.4078ms | 8.9058ms | 112.2861 Ops/s | 168.7596 Ops/s | |
test_ppo_speed | 20.0720ms | 9.2973ms | 107.5582 Ops/s | 154.1854 Ops/s | |
test_reinforce_speed | 13.2211ms | 6.8383ms | 146.2350 Ops/s | 215.1402 Ops/s | |
test_iql_speed | 36.4883ms | 30.2844ms | 33.0203 Ops/s | 43.4531 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 3.5782ms | 2.7729ms | 360.6281 Ops/s | 355.6967 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 4.9697ms | 2.9464ms | 339.3997 Ops/s | 345.4073 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 4.9432ms | 2.9342ms | 340.8068 Ops/s | 340.4641 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.7508ms | 2.7612ms | 362.1630 Ops/s | 362.1479 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 5.7200ms | 2.9392ms | 340.2327 Ops/s | 343.5287 Ops/s | |
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 6.0146ms | 2.9977ms | 333.5840 Ops/s | 338.8151 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 3.6116ms | 2.8204ms | 354.5613 Ops/s | 366.2153 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 4.6606ms | 2.8800ms | 347.2220 Ops/s | 345.3985 Ops/s | |
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 5.3652ms | 2.9248ms | 341.9059 Ops/s | 339.0270 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] | 3.9663ms | 2.7999ms | 357.1511 Ops/s | 357.9983 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] | 4.9399ms | 2.9634ms | 337.4489 Ops/s | 335.6260 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] | 6.0812ms | 3.0040ms | 332.8844 Ops/s | 337.8363 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] | 3.9558ms | 2.8681ms | 348.6609 Ops/s | 286.9213 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] | 5.4354ms | 2.9740ms | 336.2436 Ops/s | 341.3412 Ops/s | |
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] | 4.9843ms | 2.9554ms | 338.3689 Ops/s | 338.9144 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] | 3.8418ms | 2.8131ms | 355.4757 Ops/s | 367.4240 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] | 6.1892ms | 2.9862ms | 334.8747 Ops/s | 335.5980 Ops/s | |
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] | 4.9501ms | 2.9053ms | 344.2040 Ops/s | 340.0718 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] | 0.2939s | 31.1425ms | 32.1105 Ops/s | 32.2071 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] | 0.1666s | 31.3369ms | 31.9113 Ops/s | 32.8082 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] | 0.1593s | 28.4534ms | 35.1452 Ops/s | 35.7610 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] | 0.1530s | 30.4768ms | 32.8118 Ops/s | 32.8460 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] | 0.1548s | 28.0847ms | 35.6066 Ops/s | 35.7792 Ops/s | |
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] | 0.1669s | 31.2931ms | 31.9559 Ops/s | 32.2206 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] | 0.1678s | 29.1352ms | 34.3227 Ops/s | 35.3259 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] | 0.1649s | 31.4949ms | 31.7511 Ops/s | 32.8513 Ops/s | |
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] | 0.1659s | 31.0163ms | 32.2411 Ops/s | 33.0951 Ops/s |
Quick check: what would be the default behavior of a |
Co-authored-by: Matteo Bettini <55539777+matteobettini@users.noreply.github.com>
Good question Open your script, import torch -> process has X threads (X = num of cpus) |
@matteobettini @skandermoalla I don't really see a proper way of testing this, except with (pointless?) mocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
can't we launch parallel processes and assert on torch.get_num_threads
?
Ok, I implemented some tests! |
Thanks so much for this. Will test with my setup and give feedback! |
@skandermoalla @matteobettini @Vittorio-Caggiano Atari (Pong-v5): Myo: I can (and will) run some more extensive benchmarks. To me the most instructive take is that (beyond the fact that we're now way more competitive) the sync data collector is slower than plain Anyhow, good progress, team! |
Thanks a lot for this! It'd be great if the code of the benchmarks is also made available so that we can compare. |
Co-authored-by: Matteo Bettini <55539777+matteobettini@users.noreply.github.com>
Description
This PR proposed to automatically adapt the number of threads of parallel envs and collectors based on the number of workers that are required.
For each, the workers are set to
num_workers
on the parent env and to1
(if not specified otherwise) on the children.It solves the problem that all workeres were asking for as many threads as the number of CPUs on the machine, thereby maxing the cpu workload.
For a parallel env (N workers) within a collector (M workers), what will happen is that the main process will have
M
processes, and create workers that all have 1 thread. Each worker will then reset its number of threads toN
, and create sub-processes that each have 1 thread only.cc @skandermoalla