Refactor value based methods #102

vwxyzjn · 2022-01-28T03:15:16Z

Continue on #79. This PR refactors the value based methods. Specifically, we

Adopted the replay buffer from Stable-baselines 3. SB3's replay buffer is well-implemented and its behavior is intuitive. That is, we can imagine what happens when we do buffer.sample() compared to something like agent.learn(), which is much more abstract. To this end, I think it's worth to adopt SB3's replay buffer. In the future, when SB3 introduce prioritized replay buffer ([question] HER and prioritized experience replay hill-a/stable-baselines#751), it will also make it easier for us to adopt.
General format refactoring.

match DDPG performance
match TD3 performance
match C51 performance
match DQN performance

gitpod-io · 2022-01-28T03:15:23Z

vwxyzjn · 2022-01-29T04:35:06Z

Some benchmarked experiments:

https://wandb.ai/costa-huang/cleanRL/reports/Regression-Report--VmlldzoxNDI1MTE4

vwxyzjn · 2022-01-30T15:12:10Z

Found a regression with C51 probably due to different learning rate and target-network-frequency

vwxyzjn · 2022-01-30T15:13:31Z

Found a regression with DQN, probably due to different learning rate

cleanrl/ddpg_continuous_action.py

vwxyzjn · 2022-01-31T20:37:18Z

The regression report checks through

@dosssman all good on your end?

dosssman · 2022-02-01T10:28:16Z

I have been looking at the continous action space methods which I am more familiar with.
I think there was some superfluous code in the TD3 scripts, which is addressed in #106 .
Also added refactored SAC with some additional tweaks that could be incorporated, while we are at it.

Will further check the DQN / C51 like discrete action space further in the week.

In any case, great work as always.

* Fixed 'optimize the midel' typo in all files * Fixed 'optimize the midel' typo in offline scripts too * TD3: removed DDPG's update code from the training loop * Refactored sac_continuous, with preliminary tests working

vwxyzjn · 2022-02-01T19:36:47Z

Thanks @dosssman for the detailed check on TD3, after this fix, TD3's performance (green line on the right) is even better

dosssman · 2022-02-01T22:31:10Z

Great. Thanks for running the experiments to check those.
On my side I have only tested SAC refactored (orange line) and the previous version (blue line), albeit with only one seed.

By the way, is the wandb report which the screenshot above come from open ? I could not find it in the cleanrl project so far.

vwxyzjn · 2022-02-01T22:31:47Z

It's here: https://wandb.ai/costa-huang/cleanRL/reports/Regression-Report--VmlldzoxNDI1MTE4

vwxyzjn · 2022-02-03T01:49:24Z

Merging as is so that I could introduce formatting piplines after discussing with @dosssman (great work btw). And if there are remaining issues we can open new PRs.

Refactor value based methods

e6ee562

vwxyzjn added 6 commits January 27, 2022 22:18

fix test cases

d5a2b5e

fix test cases

f84d3ae

refactor ddpg

71ca5f4

quick fix

6fff922

refactor td3

5768ccf

Update lock files

a602dd2

vwxyzjn requested a review from dosssman January 28, 2022 21:03

Quick fix

91d19a7

Fix learning rate and target-network-frequency

dfd7c4b

vwxyzjn added 3 commits January 30, 2022 10:13

fix learning rate

6c0b1e6

Reproduce past results: epsilon and log clipping

f67ce41

Fix learning rate

c60e6ec

dosssman reviewed Jan 31, 2022

View reviewed changes

cleanrl/ddpg_continuous_action.py Show resolved Hide resolved

vwxyzjn added 3 commits January 31, 2022 09:14

Quick fix

0612e38

quick update

cac1707

Quick update

0747aaa

vwxyzjn mentioned this pull request Jan 31, 2022

Numerical instability in C51 #105

Open

dosssman mentioned this pull request Feb 1, 2022

Refactor value tweaks for TD3 and SAC #106

Merged

Refactor value tweaks for TD3 and SAC (#106)

6107b31

* Fixed 'optimize the midel' typo in all files * Fixed 'optimize the midel' typo in offline scripts too * TD3: removed DDPG's update code from the training loop * Refactored sac_continuous, with preliminary tests working

vwxyzjn mentioned this pull request Feb 1, 2022

Port over soft actor critic vwxyzjn/vectorized-value-methods#8

Closed

vwxyzjn merged commit cec84cc into master Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor value based methods #102

Refactor value based methods #102

vwxyzjn commented Jan 28, 2022 •

edited

Loading

gitpod-io bot commented Jan 28, 2022

vwxyzjn commented Jan 29, 2022 •

edited

Loading

vwxyzjn commented Jan 30, 2022

vwxyzjn commented Jan 30, 2022

vwxyzjn commented Jan 31, 2022

dosssman commented Feb 1, 2022

vwxyzjn commented Feb 1, 2022

dosssman commented Feb 1, 2022

vwxyzjn commented Feb 1, 2022

vwxyzjn commented Feb 3, 2022

Refactor value based methods #102

Refactor value based methods #102

Conversation

vwxyzjn commented Jan 28, 2022 • edited Loading

gitpod-io bot commented Jan 28, 2022

vwxyzjn commented Jan 29, 2022 • edited Loading

vwxyzjn commented Jan 30, 2022

vwxyzjn commented Jan 30, 2022

vwxyzjn commented Jan 31, 2022

dosssman commented Feb 1, 2022

vwxyzjn commented Feb 1, 2022

dosssman commented Feb 1, 2022

vwxyzjn commented Feb 1, 2022

vwxyzjn commented Feb 3, 2022

vwxyzjn commented Jan 28, 2022 •

edited

Loading

vwxyzjn commented Jan 29, 2022 •

edited

Loading