Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] RB MultiStep transform #2008

Merged
merged 10 commits into from
Mar 18, 2024
Merged

[Feature] RB MultiStep transform #2008

merged 10 commits into from
Mar 18, 2024

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Mar 11, 2024

  • Allow rbs to handle extend with no data (will/may happen during the first iterations when calling extend / add)
  • Design class
    • Handle change of horizon correctly
  • Test equivalence with MultiStep in collectors
  • Document feature + compare with collector version

Test script

import torch

from torchrl.envs import GymEnv, TransformedEnv, StepCounter, SerialEnv
from torchrl.envs.transforms.rb_transforms import MultiStepTransform
from tensordict.utils import assert_allclose_td

# env = TransformedEnv(SerialEnv(2, lambda:GymEnv("CartPole-v1")), StepCounter())
env = TransformedEnv(GymEnv("CartPole-v1"), StepCounter())

env.set_seed(0)
torch.manual_seed(0)

t = MultiStepTransform(3, 0.98)

outs_2 = []
td = env.reset()
for _ in range(1):
    rollout = env.rollout(250, auto_reset=False, tensordict=td, break_when_any_done=False)
    out = t._inv_call(rollout)
    td = rollout[..., -1]
    outs_2.append(out)

outs_2 = torch.cat(outs_2, -1).split([47, 50, 50, 50, 50], -1)

t = MultiStepTransform(3, 0.98)

env.set_seed(0)
torch.manual_seed(0)

outs = []
td = env.reset()
for i in range(5):
    rollout = env.rollout(50, auto_reset=False, tensordict=td, break_when_any_done=False)
    out = t._inv_call(rollout)
    assert_allclose_td(out, outs_2[i])
    td = rollout[..., -1]["next"]
    outs.append(out)

outs = torch.cat(outs, -1)

cc @AechPro

Copy link

pytorch-bot bot commented Mar 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2008

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Unrelated Failure

As of commit ddfce88 with merge base 2b8450c (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 11, 2024
Copy link

github-actions bot commented Mar 11, 2024

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 91. Improved: $\large\color{#35bf28}9$. Worsened: $\large\color{#d91a1a}9$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 55.2397ms 54.3096ms 18.4129 Ops/s 17.0154 Ops/s $\textbf{\color{#35bf28}+8.21\%}$
test_sync 49.1180ms 30.5735ms 32.7081 Ops/s 33.3896 Ops/s $\color{#d91a1a}-2.04\%$
test_async 55.8512ms 28.8483ms 34.6641 Ops/s 34.8361 Ops/s $\color{#d91a1a}-0.49\%$
test_simple 0.4021s 0.3431s 2.9150 Ops/s 2.8794 Ops/s $\color{#35bf28}+1.24\%$
test_transformed 0.5228s 0.4774s 2.0945 Ops/s 2.0813 Ops/s $\color{#35bf28}+0.64\%$
test_serial 1.2587s 1.2006s 0.8329 Ops/s 0.8088 Ops/s $\color{#35bf28}+2.98\%$
test_parallel 1.0810s 1.0446s 0.9573 Ops/s 0.9474 Ops/s $\color{#35bf28}+1.04\%$
test_step_mdp_speed[True-True-True-True-True] 0.1357ms 21.1236μs 47.3404 KOps/s 47.9664 KOps/s $\color{#d91a1a}-1.30\%$
test_step_mdp_speed[True-True-True-True-False] 39.6240μs 12.8604μs 77.7584 KOps/s 78.8775 KOps/s $\color{#d91a1a}-1.42\%$
test_step_mdp_speed[True-True-True-False-True] 67.7060μs 12.5542μs 79.6543 KOps/s 82.1345 KOps/s $\color{#d91a1a}-3.02\%$
test_step_mdp_speed[True-True-True-False-False] 28.2420μs 7.5815μs 131.9006 KOps/s 134.5477 KOps/s $\color{#d91a1a}-1.97\%$
test_step_mdp_speed[True-True-False-True-True] 64.2100μs 22.5734μs 44.2999 KOps/s 45.3420 KOps/s $\color{#d91a1a}-2.30\%$
test_step_mdp_speed[True-True-False-True-False] 51.1950μs 13.9565μs 71.6514 KOps/s 72.0603 KOps/s $\color{#d91a1a}-0.57\%$
test_step_mdp_speed[True-True-False-False-True] 47.4380μs 13.7273μs 72.8475 KOps/s 75.0534 KOps/s $\color{#d91a1a}-2.94\%$
test_step_mdp_speed[True-True-False-False-False] 22.7920μs 8.8217μs 113.3565 KOps/s 115.2522 KOps/s $\color{#d91a1a}-1.64\%$
test_step_mdp_speed[True-False-True-True-True] 84.8280μs 23.5563μs 42.4515 KOps/s 42.7198 KOps/s $\color{#d91a1a}-0.63\%$
test_step_mdp_speed[True-False-True-True-False] 58.5290μs 15.3836μs 65.0044 KOps/s 66.1872 KOps/s $\color{#d91a1a}-1.79\%$
test_step_mdp_speed[True-False-True-False-True] 66.8340μs 13.5837μs 73.6178 KOps/s 70.8827 KOps/s $\color{#35bf28}+3.86\%$
test_step_mdp_speed[True-False-True-False-False] 27.8610μs 8.8230μs 113.3406 KOps/s 117.4087 KOps/s $\color{#d91a1a}-3.46\%$
test_step_mdp_speed[True-False-False-True-True] 71.4330μs 24.9371μs 40.1009 KOps/s 40.9825 KOps/s $\color{#d91a1a}-2.15\%$
test_step_mdp_speed[True-False-False-True-False] 60.6630μs 16.4484μs 60.7963 KOps/s 62.0824 KOps/s $\color{#d91a1a}-2.07\%$
test_step_mdp_speed[True-False-False-False-True] 34.2440μs 14.7981μs 67.5762 KOps/s 69.2625 KOps/s $\color{#d91a1a}-2.43\%$
test_step_mdp_speed[True-False-False-False-False] 54.1510μs 9.8606μs 101.4142 KOps/s 103.0649 KOps/s $\color{#d91a1a}-1.60\%$
test_step_mdp_speed[False-True-True-True-True] 56.9060μs 23.7541μs 42.0979 KOps/s 42.9149 KOps/s $\color{#d91a1a}-1.90\%$
test_step_mdp_speed[False-True-True-True-False] 59.9410μs 15.3455μs 65.1656 KOps/s 66.1454 KOps/s $\color{#d91a1a}-1.48\%$
test_step_mdp_speed[False-True-True-False-True] 41.1060μs 15.7380μs 63.5403 KOps/s 65.1623 KOps/s $\color{#d91a1a}-2.49\%$
test_step_mdp_speed[False-True-True-False-False] 58.3980μs 9.9359μs 100.6455 KOps/s 102.7644 KOps/s $\color{#d91a1a}-2.06\%$
test_step_mdp_speed[False-True-False-True-True] 37.1890μs 25.2497μs 39.6045 KOps/s 40.7799 KOps/s $\color{#d91a1a}-2.88\%$
test_step_mdp_speed[False-True-False-True-False] 62.0860μs 16.2656μs 61.4794 KOps/s 61.6155 KOps/s $\color{#d91a1a}-0.22\%$
test_step_mdp_speed[False-True-False-False-True] 38.1710μs 16.8938μs 59.1935 KOps/s 59.9536 KOps/s $\color{#d91a1a}-1.27\%$
test_step_mdp_speed[False-True-False-False-False] 53.5790μs 11.0696μs 90.3374 KOps/s 91.5644 KOps/s $\color{#d91a1a}-1.34\%$
test_step_mdp_speed[False-False-True-True-True] 80.1690μs 25.9467μs 38.5406 KOps/s 39.1314 KOps/s $\color{#d91a1a}-1.51\%$
test_step_mdp_speed[False-False-True-True-False] 57.9980μs 17.7593μs 56.3086 KOps/s 57.0719 KOps/s $\color{#d91a1a}-1.34\%$
test_step_mdp_speed[False-False-True-False-True] 68.8580μs 16.9200μs 59.1016 KOps/s 59.9579 KOps/s $\color{#d91a1a}-1.43\%$
test_step_mdp_speed[False-False-True-False-False] 56.2440μs 11.1648μs 89.5676 KOps/s 91.9210 KOps/s $\color{#d91a1a}-2.56\%$
test_step_mdp_speed[False-False-False-True-True] 73.6370μs 27.0882μs 36.9164 KOps/s 37.2905 KOps/s $\color{#d91a1a}-1.00\%$
test_step_mdp_speed[False-False-False-True-False] 56.3040μs 18.7844μs 53.2357 KOps/s 53.9330 KOps/s $\color{#d91a1a}-1.29\%$
test_step_mdp_speed[False-False-False-False-True] 63.5980μs 17.9066μs 55.8453 KOps/s 56.4411 KOps/s $\color{#d91a1a}-1.06\%$
test_step_mdp_speed[False-False-False-False-False] 60.0710μs 12.1551μs 82.2698 KOps/s 83.7082 KOps/s $\color{#d91a1a}-1.72\%$
test_values[generalized_advantage_estimate-True-True] 9.6162ms 9.3432ms 107.0297 Ops/s 101.7809 Ops/s $\textbf{\color{#35bf28}+5.16\%}$
test_values[vec_generalized_advantage_estimate-True-True] 38.9748ms 35.1340ms 28.4625 Ops/s 29.7484 Ops/s $\color{#d91a1a}-4.32\%$
test_values[td0_return_estimate-False-False] 0.1956ms 0.1675ms 5.9710 KOps/s 5.1201 KOps/s $\textbf{\color{#35bf28}+16.62\%}$
test_values[td1_return_estimate-False-False] 25.9116ms 23.5836ms 42.4023 Ops/s 41.9541 Ops/s $\color{#35bf28}+1.07\%$
test_values[vec_td1_return_estimate-False-False] 36.8319ms 35.6383ms 28.0597 Ops/s 29.6684 Ops/s $\textbf{\color{#d91a1a}-5.42\%}$
test_values[td_lambda_return_estimate-True-False] 36.5092ms 33.6616ms 29.7075 Ops/s 29.5712 Ops/s $\color{#35bf28}+0.46\%$
test_values[vec_td_lambda_return_estimate-True-False] 36.7253ms 35.5446ms 28.1337 Ops/s 29.8620 Ops/s $\textbf{\color{#d91a1a}-5.79\%}$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.2880ms 8.1641ms 122.4876 Ops/s 121.1500 Ops/s $\color{#35bf28}+1.10\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 2.1138ms 1.8609ms 537.3641 Ops/s 492.3930 Ops/s $\textbf{\color{#35bf28}+9.13\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.4161ms 0.3398ms 2.9430 KOps/s 2.8458 KOps/s $\color{#35bf28}+3.42\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 48.4963ms 46.7232ms 21.4026 Ops/s 24.5081 Ops/s $\textbf{\color{#d91a1a}-12.67\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 3.7285ms 3.0080ms 332.4431 Ops/s 327.7110 Ops/s $\color{#35bf28}+1.44\%$
test_dqn_speed 6.9942ms 1.3600ms 735.3133 Ops/s 723.2078 Ops/s $\color{#35bf28}+1.67\%$
test_ddpg_speed 3.3529ms 2.6785ms 373.3496 Ops/s 368.9682 Ops/s $\color{#35bf28}+1.19\%$
test_sac_speed 9.2023ms 8.1403ms 122.8459 Ops/s 120.2305 Ops/s $\color{#35bf28}+2.18\%$
test_redq_speed 14.1571ms 12.9521ms 77.2076 Ops/s 75.4805 Ops/s $\color{#35bf28}+2.29\%$
test_redq_deprec_speed 14.2136ms 13.0098ms 76.8653 Ops/s 75.8306 Ops/s $\color{#35bf28}+1.36\%$
test_td3_speed 9.6185ms 8.0905ms 123.6025 Ops/s 121.5871 Ops/s $\color{#35bf28}+1.66\%$
test_cql_speed 36.8606ms 35.7516ms 27.9708 Ops/s 27.5889 Ops/s $\color{#35bf28}+1.38\%$
test_a2c_speed 77.8576ms 7.8578ms 127.2617 Ops/s 134.9268 Ops/s $\textbf{\color{#d91a1a}-5.68\%}$
test_ppo_speed 9.0401ms 7.6721ms 130.3429 Ops/s 131.0389 Ops/s $\color{#d91a1a}-0.53\%$
test_reinforce_speed 7.3551ms 6.6135ms 151.2060 Ops/s 152.7481 Ops/s $\color{#d91a1a}-1.01\%$
test_iql_speed 33.3002ms 32.0147ms 31.2356 Ops/s 30.4422 Ops/s $\color{#35bf28}+2.61\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.5435ms 2.2603ms 442.4180 Ops/s 438.6506 Ops/s $\color{#35bf28}+0.86\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.9111ms 0.5026ms 1.9896 KOps/s 1.8097 KOps/s $\textbf{\color{#35bf28}+9.94\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6353ms 0.4724ms 2.1168 KOps/s 1.9480 KOps/s $\textbf{\color{#35bf28}+8.67\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 2.5590ms 2.2097ms 452.5507 Ops/s 449.6242 Ops/s $\color{#35bf28}+0.65\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.9793ms 0.4897ms 2.0421 KOps/s 2.0491 KOps/s $\color{#d91a1a}-0.35\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6426ms 0.4669ms 2.1418 KOps/s 2.1273 KOps/s $\color{#35bf28}+0.68\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.5997ms 1.2705ms 787.0846 Ops/s 748.4306 Ops/s $\textbf{\color{#35bf28}+5.16\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.4226ms 1.2057ms 829.3599 Ops/s 802.6966 Ops/s $\color{#35bf28}+3.32\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.4830ms 2.3388ms 427.5640 Ops/s 437.2774 Ops/s $\color{#d91a1a}-2.22\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 95.1741ms 0.6856ms 1.4585 KOps/s 1.6173 KOps/s $\textbf{\color{#d91a1a}-9.82\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8731ms 0.5811ms 1.7210 KOps/s 1.7004 KOps/s $\color{#35bf28}+1.21\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 5.8809ms 2.2455ms 445.3319 Ops/s 449.0400 Ops/s $\color{#d91a1a}-0.83\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.5973ms 0.5003ms 1.9987 KOps/s 2.0056 KOps/s $\color{#d91a1a}-0.34\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.7021ms 0.4820ms 2.0749 KOps/s 2.1148 KOps/s $\color{#d91a1a}-1.89\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.5995ms 2.3980ms 417.0226 Ops/s 443.2812 Ops/s $\textbf{\color{#d91a1a}-5.92\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6752ms 0.4897ms 2.0423 KOps/s 2.0323 KOps/s $\color{#35bf28}+0.49\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6713ms 0.4674ms 2.1396 KOps/s 2.1184 KOps/s $\color{#35bf28}+1.00\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.5745ms 2.3818ms 419.8487 Ops/s 414.8421 Ops/s $\color{#35bf28}+1.21\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.1136ms 0.6114ms 1.6356 KOps/s 1.6219 KOps/s $\color{#35bf28}+0.84\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7501ms 0.5818ms 1.7188 KOps/s 1.6755 KOps/s $\color{#35bf28}+2.58\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1024s 7.5131ms 133.1017 Ops/s 137.2360 Ops/s $\color{#d91a1a}-3.01\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 14.4365ms 12.0554ms 82.9502 Ops/s 83.6324 Ops/s $\color{#d91a1a}-0.82\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 3.7645ms 1.1416ms 875.9338 Ops/s 949.1960 Ops/s $\textbf{\color{#d91a1a}-7.72\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 87.3050ms 5.3301ms 187.6152 Ops/s 135.7376 Ops/s $\textbf{\color{#35bf28}+38.22\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 14.4403ms 12.0072ms 83.2834 Ops/s 83.9775 Ops/s $\color{#d91a1a}-0.83\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 3.6615ms 1.1192ms 893.4643 Ops/s 956.7210 Ops/s $\textbf{\color{#d91a1a}-6.61\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 90.9132ms 7.5499ms 132.4516 Ops/s 163.7701 Ops/s $\textbf{\color{#d91a1a}-19.12\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 14.5945ms 12.2862ms 81.3922 Ops/s 68.5295 Ops/s $\textbf{\color{#35bf28}+18.77\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.2413ms 1.3760ms 726.7642 Ops/s 732.4263 Ops/s $\color{#d91a1a}-0.77\%$

Copy link

github-actions bot commented Mar 11, 2024

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 94. Improved: $\large\color{#35bf28}7$. Worsened: $\large\color{#d91a1a}2$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 0.1026s 0.1008s 9.9224 Ops/s 8.9794 Ops/s $\textbf{\color{#35bf28}+10.50\%}$
test_sync 93.2983ms 90.4722ms 11.0531 Ops/s 10.8826 Ops/s $\color{#35bf28}+1.57\%$
test_async 0.1762s 89.0927ms 11.2243 Ops/s 11.2765 Ops/s $\color{#d91a1a}-0.46\%$
test_single_pixels 0.1816s 0.1181s 8.4683 Ops/s 8.8167 Ops/s $\color{#d91a1a}-3.95\%$
test_sync_pixels 69.0961ms 67.7455ms 14.7611 Ops/s 14.7793 Ops/s $\color{#d91a1a}-0.12\%$
test_async_pixels 0.1225s 55.4467ms 18.0353 Ops/s 17.5950 Ops/s $\color{#35bf28}+2.50\%$
test_simple 0.7276s 0.6723s 1.4875 Ops/s 1.4588 Ops/s $\color{#35bf28}+1.97\%$
test_transformed 0.9257s 0.8713s 1.1477 Ops/s 1.1124 Ops/s $\color{#35bf28}+3.17\%$
test_serial 2.1422s 2.0915s 0.4781 Ops/s 0.4664 Ops/s $\color{#35bf28}+2.51\%$
test_parallel 1.8980s 1.8659s 0.5359 Ops/s 0.5502 Ops/s $\color{#d91a1a}-2.59\%$
test_step_mdp_speed[True-True-True-True-True] 87.2150μs 32.6232μs 30.6530 KOps/s 30.5787 KOps/s $\color{#35bf28}+0.24\%$
test_step_mdp_speed[True-True-True-True-False] 38.5830μs 19.6904μs 50.7861 KOps/s 51.1338 KOps/s $\color{#d91a1a}-0.68\%$
test_step_mdp_speed[True-True-True-False-True] 46.9420μs 18.7218μs 53.4136 KOps/s 53.2623 KOps/s $\color{#35bf28}+0.28\%$
test_step_mdp_speed[True-True-True-False-False] 37.3310μs 11.1527μs 89.6646 KOps/s 89.0231 KOps/s $\color{#35bf28}+0.72\%$
test_step_mdp_speed[True-True-False-True-True] 58.4630μs 34.4433μs 29.0332 KOps/s 28.9692 KOps/s $\color{#35bf28}+0.22\%$
test_step_mdp_speed[True-True-False-True-False] 38.5420μs 21.2697μs 47.0152 KOps/s 47.1266 KOps/s $\color{#d91a1a}-0.24\%$
test_step_mdp_speed[True-True-False-False-True] 36.7630μs 20.4306μs 48.9462 KOps/s 49.8005 KOps/s $\color{#d91a1a}-1.72\%$
test_step_mdp_speed[True-True-False-False-False] 36.7520μs 13.0397μs 76.6891 KOps/s 76.4129 KOps/s $\color{#35bf28}+0.36\%$
test_step_mdp_speed[True-False-True-True-True] 58.0240μs 36.4431μs 27.4400 KOps/s 27.7327 KOps/s $\color{#d91a1a}-1.06\%$
test_step_mdp_speed[True-False-True-True-False] 42.9730μs 23.2366μs 43.0356 KOps/s 42.8265 KOps/s $\color{#35bf28}+0.49\%$
test_step_mdp_speed[True-False-True-False-True] 45.1030μs 20.2512μs 49.3798 KOps/s 49.2630 KOps/s $\color{#35bf28}+0.24\%$
test_step_mdp_speed[True-False-True-False-False] 27.7520μs 12.9574μs 77.1762 KOps/s 77.1979 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[True-False-False-True-True] 64.2140μs 38.3790μs 26.0559 KOps/s 26.6114 KOps/s $\color{#d91a1a}-2.09\%$
test_step_mdp_speed[True-False-False-True-False] 49.4130μs 25.2262μs 39.6413 KOps/s 40.2207 KOps/s $\color{#d91a1a}-1.44\%$
test_step_mdp_speed[True-False-False-False-True] 47.3130μs 21.9061μs 45.6495 KOps/s 45.6322 KOps/s $\color{#35bf28}+0.04\%$
test_step_mdp_speed[True-False-False-False-False] 38.4620μs 14.6481μs 68.2681 KOps/s 67.9560 KOps/s $\color{#35bf28}+0.46\%$
test_step_mdp_speed[False-True-True-True-True] 60.2940μs 36.3306μs 27.5250 KOps/s 27.7301 KOps/s $\color{#d91a1a}-0.74\%$
test_step_mdp_speed[False-True-True-True-False] 55.1330μs 23.1913μs 43.1196 KOps/s 43.0816 KOps/s $\color{#35bf28}+0.09\%$
test_step_mdp_speed[False-True-True-False-True] 53.9630μs 24.5690μs 40.7016 KOps/s 42.2944 KOps/s $\color{#d91a1a}-3.77\%$
test_step_mdp_speed[False-True-True-False-False] 37.7930μs 14.7929μs 67.6001 KOps/s 67.3720 KOps/s $\color{#35bf28}+0.34\%$
test_step_mdp_speed[False-True-False-True-True] 69.6240μs 38.6807μs 25.8527 KOps/s 26.0104 KOps/s $\color{#d91a1a}-0.61\%$
test_step_mdp_speed[False-True-False-True-False] 46.9130μs 25.3339μs 39.4728 KOps/s 39.6026 KOps/s $\color{#d91a1a}-0.33\%$
test_step_mdp_speed[False-True-False-False-True] 43.2810μs 25.8333μs 38.7098 KOps/s 38.2899 KOps/s $\color{#35bf28}+1.10\%$
test_step_mdp_speed[False-True-False-False-False] 37.5420μs 16.7718μs 59.6239 KOps/s 60.0938 KOps/s $\color{#d91a1a}-0.78\%$
test_step_mdp_speed[False-False-True-True-True] 64.4130μs 39.7703μs 25.1444 KOps/s 24.8837 KOps/s $\color{#35bf28}+1.05\%$
test_step_mdp_speed[False-False-True-True-False] 49.8730μs 27.0405μs 36.9815 KOps/s 36.8751 KOps/s $\color{#35bf28}+0.29\%$
test_step_mdp_speed[False-False-True-False-True] 52.8720μs 26.1020μs 38.3112 KOps/s 38.3862 KOps/s $\color{#d91a1a}-0.20\%$
test_step_mdp_speed[False-False-True-False-False] 35.0010μs 16.7248μs 59.7915 KOps/s 60.0534 KOps/s $\color{#d91a1a}-0.44\%$
test_step_mdp_speed[False-False-False-True-True] 58.8740μs 41.8937μs 23.8699 KOps/s 24.2384 KOps/s $\color{#d91a1a}-1.52\%$
test_step_mdp_speed[False-False-False-True-False] 52.0730μs 28.7207μs 34.8181 KOps/s 34.5716 KOps/s $\color{#35bf28}+0.71\%$
test_step_mdp_speed[False-False-False-False-True] 48.7430μs 27.6634μs 36.1488 KOps/s 36.2752 KOps/s $\color{#d91a1a}-0.35\%$
test_step_mdp_speed[False-False-False-False-False] 37.7720μs 18.2018μs 54.9396 KOps/s 54.3757 KOps/s $\color{#35bf28}+1.04\%$
test_values[generalized_advantage_estimate-True-True] 25.8758ms 24.6231ms 40.6123 Ops/s 39.2225 Ops/s $\color{#35bf28}+3.54\%$
test_values[vec_generalized_advantage_estimate-True-True] 95.8533ms 3.4766ms 287.6374 Ops/s 307.2527 Ops/s $\textbf{\color{#d91a1a}-6.38\%}$
test_values[td0_return_estimate-False-False] 93.2740μs 64.1303μs 15.5932 KOps/s 15.0885 KOps/s $\color{#35bf28}+3.34\%$
test_values[td1_return_estimate-False-False] 52.2964ms 51.6854ms 19.3478 Ops/s 18.1430 Ops/s $\textbf{\color{#35bf28}+6.64\%}$
test_values[vec_td1_return_estimate-False-False] 1.9428ms 1.7481ms 572.0559 Ops/s 566.5981 Ops/s $\color{#35bf28}+0.96\%$
test_values[td_lambda_return_estimate-True-False] 83.1762ms 82.5093ms 12.1198 Ops/s 11.4121 Ops/s $\textbf{\color{#35bf28}+6.20\%}$
test_values[vec_td_lambda_return_estimate-True-False] 2.0434ms 1.7486ms 571.8803 Ops/s 567.6234 Ops/s $\color{#35bf28}+0.75\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 23.0065ms 22.7172ms 44.0195 Ops/s 42.6969 Ops/s $\color{#35bf28}+3.10\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 0.8722ms 0.6897ms 1.4498 KOps/s 1.4416 KOps/s $\color{#35bf28}+0.57\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7120ms 0.6399ms 1.5628 KOps/s 1.5468 KOps/s $\color{#35bf28}+1.04\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.4954ms 1.4436ms 692.7072 Ops/s 688.9805 Ops/s $\color{#35bf28}+0.54\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.9430ms 0.6635ms 1.5071 KOps/s 1.4986 KOps/s $\color{#35bf28}+0.56\%$
test_dqn_speed 7.9962ms 1.4425ms 693.2542 Ops/s 683.3785 Ops/s $\color{#35bf28}+1.45\%$
test_ddpg_speed 2.9893ms 2.7240ms 367.1091 Ops/s 364.3629 Ops/s $\color{#35bf28}+0.75\%$
test_sac_speed 8.4200ms 8.0084ms 124.8693 Ops/s 121.5691 Ops/s $\color{#35bf28}+2.71\%$
test_redq_speed 11.1589ms 10.0636ms 99.3679 Ops/s 95.9891 Ops/s $\color{#35bf28}+3.52\%$
test_redq_deprec_speed 11.4745ms 10.7043ms 93.4203 Ops/s 88.2921 Ops/s $\textbf{\color{#35bf28}+5.81\%}$
test_td3_speed 8.2623ms 7.9278ms 126.1380 Ops/s 122.4805 Ops/s $\color{#35bf28}+2.99\%$
test_cql_speed 26.3237ms 25.2289ms 39.6370 Ops/s 39.0922 Ops/s $\color{#35bf28}+1.39\%$
test_a2c_speed 6.7543ms 5.4871ms 182.2458 Ops/s 180.2229 Ops/s $\color{#35bf28}+1.12\%$
test_ppo_speed 6.7604ms 5.7417ms 174.1650 Ops/s 169.8639 Ops/s $\color{#35bf28}+2.53\%$
test_reinforce_speed 4.6484ms 4.4443ms 225.0073 Ops/s 222.2596 Ops/s $\color{#35bf28}+1.24\%$
test_iql_speed 19.7389ms 19.1285ms 52.2780 Ops/s 51.2922 Ops/s $\color{#35bf28}+1.92\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.0869ms 2.8796ms 347.2729 Ops/s 344.2655 Ops/s $\color{#35bf28}+0.87\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.6603ms 0.5376ms 1.8603 KOps/s 1.8548 KOps/s $\color{#35bf28}+0.30\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 4.3175ms 0.5163ms 1.9368 KOps/s 1.9228 KOps/s $\color{#35bf28}+0.73\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1610ms 2.8885ms 346.2012 Ops/s 346.8901 Ops/s $\color{#d91a1a}-0.20\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6728ms 0.5284ms 1.8926 KOps/s 1.8805 KOps/s $\color{#35bf28}+0.64\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.5575ms 0.5161ms 1.9375 KOps/s 1.9579 KOps/s $\color{#d91a1a}-1.04\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6204ms 1.5150ms 660.0794 Ops/s 661.7739 Ops/s $\color{#d91a1a}-0.26\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 5.3627ms 1.4471ms 691.0345 Ops/s 693.8739 Ops/s $\color{#d91a1a}-0.41\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.0901ms 3.0071ms 332.5457 Ops/s 332.0259 Ops/s $\color{#35bf28}+0.16\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.4649ms 0.6579ms 1.5199 KOps/s 1.3228 KOps/s $\textbf{\color{#35bf28}+14.90\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8365ms 0.6360ms 1.5723 KOps/s 1.5642 KOps/s $\color{#35bf28}+0.52\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.9517ms 2.8778ms 347.4826 Ops/s 345.1936 Ops/s $\color{#35bf28}+0.66\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.2415ms 0.5431ms 1.8413 KOps/s 1.8469 KOps/s $\color{#d91a1a}-0.30\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.7203ms 0.5170ms 1.9343 KOps/s 1.9338 KOps/s $\color{#35bf28}+0.02\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.2246ms 2.9180ms 342.7000 Ops/s 345.6547 Ops/s $\color{#d91a1a}-0.85\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6193ms 0.5298ms 1.8875 KOps/s 1.8886 KOps/s $\color{#d91a1a}-0.06\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.5273ms 0.5130ms 1.9494 KOps/s 1.9484 KOps/s $\color{#35bf28}+0.05\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1921ms 3.0077ms 332.4791 Ops/s 330.8049 Ops/s $\color{#35bf28}+0.51\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8840ms 0.6633ms 1.5075 KOps/s 1.4995 KOps/s $\color{#35bf28}+0.54\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7743ms 0.6394ms 1.5640 KOps/s 1.5560 KOps/s $\color{#35bf28}+0.51\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1058s 8.7414ms 114.3980 Ops/s 111.9774 Ops/s $\color{#35bf28}+2.16\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 16.6530ms 14.3538ms 69.6681 Ops/s 68.1723 Ops/s $\color{#35bf28}+2.19\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 1.1402ms 1.0397ms 961.7768 Ops/s 834.3765 Ops/s $\textbf{\color{#35bf28}+15.27\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1001s 6.7293ms 148.6036 Ops/s 149.0107 Ops/s $\color{#d91a1a}-0.27\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 16.6246ms 14.3152ms 69.8556 Ops/s 68.3636 Ops/s $\color{#35bf28}+2.18\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 2.4998ms 1.1988ms 834.1687 Ops/s 760.5105 Ops/s $\textbf{\color{#35bf28}+9.69\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1031s 9.0474ms 110.5291 Ops/s 111.6162 Ops/s $\color{#d91a1a}-0.97\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 17.2234ms 14.6359ms 68.3250 Ops/s 67.0633 Ops/s $\color{#35bf28}+1.88\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 7.8760ms 1.6455ms 607.7174 Ops/s 657.8848 Ops/s $\textbf{\color{#d91a1a}-7.63\%}$

@vmoens vmoens added the enhancement New feature or request label Mar 12, 2024
@vmoens
Copy link
Contributor Author

vmoens commented Mar 12, 2024

@AechPro I did not change the n_steps argument as you suggested for consistency within torchrl.
I'm open to change both but I'm not sure how to proceed to make things non-bc breaking

I'll ping you when the preview of the doc is built for you to check if things make sense, in the meantime you can look at the diff

@AechPro
Copy link

AechPro commented Mar 12, 2024

@vmoens I understand that backwards compatibility is important, but I think if we expect torchrl to be largely used by a research community it seems a bit strange for the n_steps argument to behave in a slightly unexpected way.

Consider the case where a user of torchrl is re-implementing an algorithm like RAINBOW in good faith, and they naturally set the n_step argument to 3 as suggested by the paper. Using the current behavior this would actually produce a multi-step return estimate with n=4 according to the algorithm implemented in the paper, which would meaningfully change the performance of the algorithm. I can imagine our hypothetical torchrl user becoming quite confused when they cannot replicate the results of RAINBOW despite having (seemingly) implemented everything correctly here.

Further, if we expect new algorithms to be written using torchrl, this sort of discrepancy in the meaning of the multi-step parameter could introduce a point of confusion in the opposite direction: if a person not using torchrl is attempting to implement a paper which describes results from an algorithm implemented with torchrl using the multi-step return object, that person may find it difficult to replicate the results of the torchrl algorithm for the same reason as our hypothetical RAINBOW user.

In my opinion preserving the meaning of the parameter in the multi-step return algorithm is more important than backwards compatibility because of the two examples I presented above, but I understand others may feel differently. At the very least I would strongly advocate for some easy to see documentation about this behavior whenever there is a tutorial using the MultiStep object and in the docstring for the object itself so we can mitigate these two potential issues.

@vmoens
Copy link
Contributor Author

vmoens commented Mar 12, 2024

Makes sense! Then let's make this n_step congruent with what's expected by the community

@AechPro
Copy link

AechPro commented Mar 12, 2024

I'm having a little trouble understanding the output of the example. We have the MultiStepTransform with n_steps=3, then we sample some timesteps and slice the first 5 entries from our replay buffer. The first entry in that slice is reporting a step count of 9, which I assume is supposed to be equal to 5+n, which would be 5 + 3 = 8 but that would still be an incorrect behavior. The first timestep emitted by the transform should be timestep 1 (or zero if you start counting at zero), because that contains the state from which we're computing the n-step return, so I would expect rb[:]["step_count"][:, 0] to be 1 (or zero) and the final entry at rb[:]["step_count"][:, -1] to be T-n+1 because the internal buffers should only need to contain n-1 waiting timesteps unless there is a terminal state at the end of the internal buffer, in which case it should just compute all of the remaining possible returns at each of those timesteps.

I'm sure I must be misunderstanding what's happening in the example. Could you clarify?

@vmoens
Copy link
Contributor Author

vmoens commented Mar 12, 2024

Let me clarify:
Here we look at the step count at the root:

>>> print("step_count", rb[:]["step_count"][:, :5])
step_count tensor([[[ 9],
         [10],
         [11],
         [12],
         [13]],

        [[12],
         [13],
         [14],
         [15],
         [16]]])

Env 0 has steps [9, 10, 11...] and env 1 [12, 13, 14,...]

Then we look at the "next" entry to see the shift.
Without MultiStep, we would have [10, 11, 12, ...] for env 0 and [13, 14, 15, ...] for env 1
Because we use multi-step with a shift of 3 we have [13, 14, 15, ...] and [16, 17, 18, ...] resp.

>>> print("next step_count", rb[:]["next", "step_count"][:, :5])
next step_count tensor([[[13],
         [14],
         [15],
         [16],
         [17]],

        [[16],
         [17],
         [18],
         [19],
         [20]]])

Note that we're looking at the replay buffer content so it doesn't really matter what those values are (ie it's expected that it doesn't start at 0).

For a single env and n=3, you would have this data structure accessible in the buffer

done:        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
step_count:        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
next, step_count_orig:        [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
next, step_count:        [4, 5, 6, 7, 8, 9, 10, 10, 10, 10]

If you think this isn't the desired behaviour I'd be happy to read what you think would be expected, but to me it looks like what multi-step requires.

@AechPro
Copy link

AechPro commented Mar 12, 2024

Ah-hah, my apologies for misunderstanding. Thanks for the clarification, this looks good!

@AechPro
Copy link

AechPro commented Mar 13, 2024

Although if you are looking to implement the change to the n_step parameter we spoke about earlier right now then I believe the example with 1 env should be as follows:

done:                        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
step_count:                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
next, step_count_orig:       [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
next, step_count:            [3, 4, 5, 6, 7, 8, 9, 10, 10, 10]

Because starting from state 0 we will compute the return estimate as the sum of rewards starting with timestep 0, then 1, then 2, and finally the learning algorithm will need to bootstrap from the state encountered at timestep 3.

@vmoens
Copy link
Contributor Author

vmoens commented Mar 13, 2024

So given what you're saying, what it should be is:
n_steps=0 => error
n_steps=1 => no transform
n_steps=2 => 1 shift in the future
etc

I will change that thx

@AechPro
Copy link

AechPro commented Mar 13, 2024

I'm not sure what the right outcome for n_steps=0 is. It seems reasonable to think we should set all of the rewards to zero and then set all the next states to the current states (i.e. step_count = t and next, step_count = t) because this would turn the equation return_estimates = batch["rewards"] + batch["next]["gammas"] * value_estimator(batch["next"]["observation"]) into return_estimates = 0 + 1 * value_estimator(batch["next"]["observation"]) = value_estimator(batch["observation"]) provided we also set the gamma values to 1. This makes some amount of sense because the value of n_steps can be taken to mean the number of reward terms we need to incorporate before using an estimator of the value function in our return estimate, so zero there would just lead us to use our value estimator completely with no reward terms at all

With that said I'm not aware of anything in the literature using n=0, and I wonder if it would be confusing or if it's even necessary.

@vmoens
Copy link
Contributor Author

vmoens commented Mar 13, 2024

The way I see it n=0 is equivalent to "doing nothing" which can be achieved by... doing nothing haha

@AechPro
Copy link

AechPro commented Mar 13, 2024

LOL 😅 yeah it makes sense to do it that way. I was just imagining a scenario where maybe a user is doing some sort of hyper-parameter investigation and they wanted to vary the value of n from 0 to some number without changing anything in the underlying learning algorithm, which is maybe a realistic thing to want if someone is interested in measuring the impact of rewards on the return estimate.

@vmoens
Copy link
Contributor Author

vmoens commented Mar 13, 2024

I guess they will have to start from 1 :)

I think accounting for those edge cases causes more hustle and loads the doc while bringing little value in practice (+ requires proper testing of a behaviour that is even poorly defined on our end!)

@AechPro
Copy link

AechPro commented Mar 13, 2024

Yeah, fair enough. Let's keep it as you suggested then!

@vmoens vmoens marked this pull request as ready for review March 18, 2024 08:51
@vmoens
Copy link
Contributor Author

vmoens commented Mar 18, 2024

cc @agarwl this is an implementation of multi-step that allows to dynamically change the horizon during training

@vmoens vmoens merged commit e3b66bb into main Mar 18, 2024
45 of 52 checks passed
@vmoens vmoens deleted the void-add-extend branch March 18, 2024 08:52
SandishKumarHN pushed a commit to SandishKumarHN/rl that referenced this pull request Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants