Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Threaded collection and parallel envs #1559

Merged
merged 10 commits into from
Sep 22, 2023
Merged

[Feature] Threaded collection and parallel envs #1559

merged 10 commits into from
Sep 22, 2023

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Sep 21, 2023

Description

This PR proposed to automatically adapt the number of threads of parallel envs and collectors based on the number of workers that are required.
For each, the workers are set to num_workers on the parent env and to 1 (if not specified otherwise) on the children.

It solves the problem that all workeres were asking for as many threads as the number of CPUs on the machine, thereby maxing the cpu workload.

For a parallel env (N workers) within a collector (M workers), what will happen is that the main process will have M processes, and create workers that all have 1 thread. Each worker will then reset its number of threads to N, and create sub-processes that each have 1 thread only.

cc @skandermoalla

# Conflicts:
#	test/test_specs.py
#	torchrl/_utils.py
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 21, 2023
@vmoens vmoens added enhancement New feature or request performance Performance issue or suggestion for improvement labels Sep 21, 2023
@github-actions
Copy link

github-actions bot commented Sep 21, 2023

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 89. Improved: $\large\color{#35bf28}2$. Worsened: $\large\color{#d91a1a}16$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 75.7331ms 75.3728ms 13.2674 Ops/s 12.9456 Ops/s $\color{#35bf28}+2.49\%$
test_sync 0.1234s 43.9157ms 22.7709 Ops/s 22.1297 Ops/s $\color{#35bf28}+2.90\%$
test_async 57.3507ms 39.6664ms 25.2103 Ops/s 24.5132 Ops/s $\color{#35bf28}+2.84\%$
test_simple 0.7161s 0.6433s 1.5546 Ops/s 1.5449 Ops/s $\color{#35bf28}+0.62\%$
test_transformed 0.8805s 0.8366s 1.1954 Ops/s 1.2142 Ops/s $\color{#d91a1a}-1.55\%$
test_serial 1.8527s 1.7887s 0.5591 Ops/s 0.5515 Ops/s $\color{#35bf28}+1.37\%$
test_parallel 1.6683s 1.5869s 0.6301 Ops/s 0.6054 Ops/s $\color{#35bf28}+4.08\%$
test_step_mdp_speed[True-True-True-True-True] 0.2046ms 43.8864μs 22.7861 KOps/s 22.4219 KOps/s $\color{#35bf28}+1.62\%$
test_step_mdp_speed[True-True-True-True-False] 0.1660ms 25.0369μs 39.9411 KOps/s 39.3735 KOps/s $\color{#35bf28}+1.44\%$
test_step_mdp_speed[True-True-True-False-True] 0.1878ms 31.1343μs 32.1189 KOps/s 31.8413 KOps/s $\color{#35bf28}+0.87\%$
test_step_mdp_speed[True-True-True-False-False] 0.1818ms 17.2341μs 58.0246 KOps/s 56.9421 KOps/s $\color{#35bf28}+1.90\%$
test_step_mdp_speed[True-True-False-True-True] 0.2078ms 45.5779μs 21.9405 KOps/s 21.6470 KOps/s $\color{#35bf28}+1.36\%$
test_step_mdp_speed[True-True-False-True-False] 0.1911ms 26.6725μs 37.4917 KOps/s 36.7634 KOps/s $\color{#35bf28}+1.98\%$
test_step_mdp_speed[True-True-False-False-True] 0.1888ms 33.0425μs 30.2640 KOps/s 28.5368 KOps/s $\textbf{\color{#35bf28}+6.05\%}$
test_step_mdp_speed[True-True-False-False-False] 0.1784ms 19.4700μs 51.3610 KOps/s 50.8005 KOps/s $\color{#35bf28}+1.10\%$
test_step_mdp_speed[True-False-True-True-True] 0.2058ms 47.6401μs 20.9907 KOps/s 20.5899 KOps/s $\color{#35bf28}+1.95\%$
test_step_mdp_speed[True-False-True-True-False] 0.3140ms 28.9532μs 34.5385 KOps/s 34.6469 KOps/s $\color{#d91a1a}-0.31\%$
test_step_mdp_speed[True-False-True-False-True] 0.1015ms 33.1660μs 30.1514 KOps/s 29.7298 KOps/s $\color{#35bf28}+1.42\%$
test_step_mdp_speed[True-False-True-False-False] 57.8000μs 19.5152μs 51.2422 KOps/s 51.2973 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[True-False-False-True-True] 0.1416ms 49.1496μs 20.3460 KOps/s 20.1609 KOps/s $\color{#35bf28}+0.92\%$
test_step_mdp_speed[True-False-False-True-False] 0.1450ms 30.5690μs 32.7129 KOps/s 32.9290 KOps/s $\color{#d91a1a}-0.66\%$
test_step_mdp_speed[True-False-False-False-True] 65.2010μs 34.6842μs 28.8316 KOps/s 28.7955 KOps/s $\color{#35bf28}+0.13\%$
test_step_mdp_speed[True-False-False-False-False] 58.3010μs 21.0598μs 47.4839 KOps/s 47.3671 KOps/s $\color{#35bf28}+0.25\%$
test_step_mdp_speed[False-True-True-True-True] 0.1669ms 47.2083μs 21.1827 KOps/s 20.9366 KOps/s $\color{#35bf28}+1.18\%$
test_step_mdp_speed[False-True-True-True-False] 0.1249ms 28.6167μs 34.9447 KOps/s 34.3321 KOps/s $\color{#35bf28}+1.78\%$
test_step_mdp_speed[False-True-True-False-True] 69.7010μs 36.9550μs 27.0599 KOps/s 27.0891 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-True-False-False] 0.1084ms 21.2751μs 47.0032 KOps/s 45.5881 KOps/s $\color{#35bf28}+3.10\%$
test_step_mdp_speed[False-True-False-True-True] 0.1387ms 49.4872μs 20.2072 KOps/s 20.1971 KOps/s $\color{#35bf28}+0.05\%$
test_step_mdp_speed[False-True-False-True-False] 56.3000μs 30.4492μs 32.8416 KOps/s 32.5476 KOps/s $\color{#35bf28}+0.90\%$
test_step_mdp_speed[False-True-False-False-True] 0.1121ms 38.9208μs 25.6932 KOps/s 25.9564 KOps/s $\color{#d91a1a}-1.01\%$
test_step_mdp_speed[False-True-False-False-False] 0.1345ms 23.3311μs 42.8612 KOps/s 42.6911 KOps/s $\color{#35bf28}+0.40\%$
test_step_mdp_speed[False-False-True-True-True] 0.1495ms 51.1633μs 19.5452 KOps/s 19.4567 KOps/s $\color{#35bf28}+0.46\%$
test_step_mdp_speed[False-False-True-True-False] 0.1351ms 32.2953μs 30.9643 KOps/s 30.8688 KOps/s $\color{#35bf28}+0.31\%$
test_step_mdp_speed[False-False-True-False-True] 89.4010μs 38.5784μs 25.9212 KOps/s 25.8017 KOps/s $\color{#35bf28}+0.46\%$
test_step_mdp_speed[False-False-True-False-False] 0.1245ms 23.2085μs 43.0877 KOps/s 43.0673 KOps/s $\color{#35bf28}+0.05\%$
test_step_mdp_speed[False-False-False-True-True] 87.0010μs 51.7244μs 19.3332 KOps/s 19.2550 KOps/s $\color{#35bf28}+0.41\%$
test_step_mdp_speed[False-False-False-True-False] 0.1200ms 33.6658μs 29.7037 KOps/s 29.5347 KOps/s $\color{#35bf28}+0.57\%$
test_step_mdp_speed[False-False-False-False-True] 0.1318ms 39.5864μs 25.2612 KOps/s 24.9561 KOps/s $\color{#35bf28}+1.22\%$
test_step_mdp_speed[False-False-False-False-False] 57.7010μs 24.6965μs 40.4916 KOps/s 40.6593 KOps/s $\color{#d91a1a}-0.41\%$
test_values[generalized_advantage_estimate-True-True] 15.2543ms 14.6749ms 68.1434 Ops/s 70.7416 Ops/s $\color{#d91a1a}-3.67\%$
test_values[vec_generalized_advantage_estimate-True-True] 49.7779ms 43.9560ms 22.7500 Ops/s 22.9874 Ops/s $\color{#d91a1a}-1.03\%$
test_values[td0_return_estimate-False-False] 0.7981ms 0.5115ms 1.9551 KOps/s 4.0092 KOps/s $\textbf{\color{#d91a1a}-51.23\%}$
test_values[td1_return_estimate-False-False] 14.6661ms 14.0187ms 71.3331 Ops/s 72.5173 Ops/s $\color{#d91a1a}-1.63\%$
test_values[vec_td1_return_estimate-False-False] 52.9204ms 44.5584ms 22.4425 Ops/s 23.5957 Ops/s $\color{#d91a1a}-4.89\%$
test_values[td_lambda_return_estimate-True-False] 54.5282ms 33.6613ms 29.7077 Ops/s 30.8829 Ops/s $\color{#d91a1a}-3.81\%$
test_values[vec_td_lambda_return_estimate-True-False] 54.7905ms 45.4336ms 22.0102 Ops/s 23.9138 Ops/s $\textbf{\color{#d91a1a}-7.96\%}$
test_gae_speed[generalized_advantage_estimate-False-1-512] 12.0347ms 11.8689ms 84.2535 Ops/s 84.3864 Ops/s $\color{#d91a1a}-0.16\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 6.4456ms 4.2425ms 235.7124 Ops/s 263.8863 Ops/s $\textbf{\color{#d91a1a}-10.68\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 3.9257ms 0.6279ms 1.5927 KOps/s 2.0992 KOps/s $\textbf{\color{#d91a1a}-24.13\%}$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 62.0475ms 58.0644ms 17.2222 Ops/s 17.0793 Ops/s $\color{#35bf28}+0.84\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 8.9486ms 3.9702ms 251.8737 Ops/s 331.9010 Ops/s $\textbf{\color{#d91a1a}-24.11\%}$
test_dqn_speed 12.7651ms 2.7257ms 366.8807 Ops/s 504.6677 Ops/s $\textbf{\color{#d91a1a}-27.30\%}$
test_ddpg_speed 13.3077ms 4.9427ms 202.3179 Ops/s 336.1731 Ops/s $\textbf{\color{#d91a1a}-39.82\%}$
test_sac_speed 18.6421ms 13.1271ms 76.1784 Ops/s 106.9725 Ops/s $\textbf{\color{#d91a1a}-28.79\%}$
test_redq_speed 26.6502ms 21.5058ms 46.4990 Ops/s 58.1203 Ops/s $\textbf{\color{#d91a1a}-20.00\%}$
test_redq_deprec_speed 25.7620ms 20.2092ms 49.4824 Ops/s 71.0173 Ops/s $\textbf{\color{#d91a1a}-30.32\%}$
test_td3_speed 16.3880ms 14.2218ms 70.3147 Ops/s 87.7933 Ops/s $\textbf{\color{#d91a1a}-19.91\%}$
test_cql_speed 48.0665ms 42.1752ms 23.7106 Ops/s 35.6136 Ops/s $\textbf{\color{#d91a1a}-33.42\%}$
test_a2c_speed 15.4078ms 8.9058ms 112.2861 Ops/s 168.7596 Ops/s $\textbf{\color{#d91a1a}-33.46\%}$
test_ppo_speed 20.0720ms 9.2973ms 107.5582 Ops/s 154.1854 Ops/s $\textbf{\color{#d91a1a}-30.24\%}$
test_reinforce_speed 13.2211ms 6.8383ms 146.2350 Ops/s 215.1402 Ops/s $\textbf{\color{#d91a1a}-32.03\%}$
test_iql_speed 36.4883ms 30.2844ms 33.0203 Ops/s 43.4531 Ops/s $\textbf{\color{#d91a1a}-24.01\%}$
test_sample_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.5782ms 2.7729ms 360.6281 Ops/s 355.6967 Ops/s $\color{#35bf28}+1.39\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 4.9697ms 2.9464ms 339.3997 Ops/s 345.4073 Ops/s $\color{#d91a1a}-1.74\%$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 4.9432ms 2.9342ms 340.8068 Ops/s 340.4641 Ops/s $\color{#35bf28}+0.10\%$
test_sample_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.7508ms 2.7612ms 362.1630 Ops/s 362.1479 Ops/s $+0.00\%$
test_sample_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 5.7200ms 2.9392ms 340.2327 Ops/s 343.5287 Ops/s $\color{#d91a1a}-0.96\%$
test_sample_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 6.0146ms 2.9977ms 333.5840 Ops/s 338.8151 Ops/s $\color{#d91a1a}-1.54\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.6116ms 2.8204ms 354.5613 Ops/s 366.2153 Ops/s $\color{#d91a1a}-3.18\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 4.6606ms 2.8800ms 347.2220 Ops/s 345.3985 Ops/s $\color{#35bf28}+0.53\%$
test_sample_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 5.3652ms 2.9248ms 341.9059 Ops/s 339.0270 Ops/s $\color{#35bf28}+0.85\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.9663ms 2.7999ms 357.1511 Ops/s 357.9983 Ops/s $\color{#d91a1a}-0.24\%$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 4.9399ms 2.9634ms 337.4489 Ops/s 335.6260 Ops/s $\color{#35bf28}+0.54\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 6.0812ms 3.0040ms 332.8844 Ops/s 337.8363 Ops/s $\color{#d91a1a}-1.47\%$
test_iterate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.9558ms 2.8681ms 348.6609 Ops/s 286.9213 Ops/s $\textbf{\color{#35bf28}+21.52\%}$
test_iterate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 5.4354ms 2.9740ms 336.2436 Ops/s 341.3412 Ops/s $\color{#d91a1a}-1.49\%$
test_iterate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.9843ms 2.9554ms 338.3689 Ops/s 338.9144 Ops/s $\color{#d91a1a}-0.16\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.8418ms 2.8131ms 355.4757 Ops/s 367.4240 Ops/s $\color{#d91a1a}-3.25\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 6.1892ms 2.9862ms 334.8747 Ops/s 335.5980 Ops/s $\color{#d91a1a}-0.22\%$
test_iterate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.9501ms 2.9053ms 344.2040 Ops/s 340.0718 Ops/s $\color{#35bf28}+1.22\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.2939s 31.1425ms 32.1105 Ops/s 32.2071 Ops/s $\color{#d91a1a}-0.30\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 0.1666s 31.3369ms 31.9113 Ops/s 32.8082 Ops/s $\color{#d91a1a}-2.73\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 0.1593s 28.4534ms 35.1452 Ops/s 35.7610 Ops/s $\color{#d91a1a}-1.72\%$
test_populate_rb[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1530s 30.4768ms 32.8118 Ops/s 32.8460 Ops/s $\color{#d91a1a}-0.10\%$
test_populate_rb[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 0.1548s 28.0847ms 35.6066 Ops/s 35.7792 Ops/s $\color{#d91a1a}-0.48\%$
test_populate_rb[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 0.1669s 31.2931ms 31.9559 Ops/s 32.2206 Ops/s $\color{#d91a1a}-0.82\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1678s 29.1352ms 34.3227 Ops/s 35.3259 Ops/s $\color{#d91a1a}-2.84\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1649s 31.4949ms 31.7511 Ops/s 32.8513 Ops/s $\color{#d91a1a}-3.35\%$
test_populate_rb[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 0.1659s 31.0163ms 32.2411 Ops/s 33.0951 Ops/s $\color{#d91a1a}-2.58\%$

@skandermoalla
Copy link
Contributor

Quick check: what would be the default behavior of a SyncDataCollector with a ParallelEnv passed to its constructor?

Co-authored-by: Matteo Bettini <55539777+matteobettini@users.noreply.github.com>
@vmoens
Copy link
Contributor Author

vmoens commented Sep 21, 2023

Quick check: what would be the default behavior of a SyncDataCollector with a ParallelEnv passed to its constructor?

Good question

Open your script, import torch -> process has X threads (X = num of cpus)
create parallel env -> process threads reduced to N (N = num workers of the parallel env)
sync data collector -> no effect, still N threads
launches N procs, each with 1 thread

@vmoens vmoens marked this pull request as ready for review September 21, 2023 18:12
@vmoens
Copy link
Contributor Author

vmoens commented Sep 21, 2023

@matteobettini @skandermoalla I don't really see a proper way of testing this, except with (pointless?) mocks
Agree to leave it as it is?

Copy link
Contributor

@matteobettini matteobettini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

can't we launch parallel processes and assert on torch.get_num_threads?

@vmoens
Copy link
Contributor Author

vmoens commented Sep 22, 2023

Ok, I implemented some tests!

@vmoens vmoens merged commit 09e148b into main Sep 22, 2023
@vmoens vmoens deleted the threads_mp branch September 22, 2023 10:41
@skandermoalla
Copy link
Contributor

Thanks so much for this. Will test with my setup and give feedback!

@vmoens
Copy link
Contributor Author

vmoens commented Sep 22, 2023

@skandermoalla @matteobettini @Vittorio-Caggiano
With my setup, I get roughly these numbers in terms of throughput (on #1539 + this)
Always 32 procs, a machine with 96 cpus and 1 A100

Atari (Pong-v5):
TorchRL ParallelEnv.rollout: 10281.7697
TorchRL ParallelEnv within SyncDataCollector (to test collector overhead): 8217.6762
TorchRL GymEnv + gym async env + env.rollout: 9554.7727
TorchRL ParallelEnv (N=4/worker) across 8 collectors, async: 22023.7621
TorchRL GymEnv + gym async env (N=4/worker) across 8 collectors, async: 17997.9470
TorchRL ParallelEnv (N=4/worker) across 8 collectors, async: 10806.8827
TorchRL GymEnv + gym async env (N=4/worker) across 8 collectors, async: 10141.7132

Myo:
TorchRL ParallelEnv.rollout: 12346.7140
TorchRL ParallelEnv within SyncDataCollector (to test collector overhead): 10127.3826
TorchRL GymEnv + gym async env + env.rollout: 7444.5473
TorchRL ParallelEnv (N=4/worker) across 8 collectors, async: 18115.6079
TorchRL GymEnv + gym async env (N=4/worker) across 8 collectors, async: 14534.9367
TorchRL ParallelEnv (N=4/worker) across 8 collectors, async: 12997.6912
TorchRL GymEnv + gym async env (N=4/worker) across 8 collectors, async: 11727.7253

I can (and will) run some more extensive benchmarks.

To me the most instructive take is that (beyond the fact that we're now way more competitive) the sync data collector is slower than plain env.rollout so there is some overhead there that needs to be taken care of. The other takeaway is that sync data collectors don't really help here

Anyhow, good progress, team!

@skandermoalla
Copy link
Contributor

Thanks a lot for this! It'd be great if the code of the benchmarks is also made available so that we can compare.

vmoens added a commit to hyerra/rl that referenced this pull request Oct 10, 2023
Co-authored-by: Matteo Bettini <55539777+matteobettini@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request performance Performance issue or suggestion for improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants