Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor value based methods #102

Merged
merged 16 commits into from
Feb 3, 2022
Merged

Refactor value based methods #102

merged 16 commits into from
Feb 3, 2022

Conversation

vwxyzjn
Copy link
Owner

@vwxyzjn vwxyzjn commented Jan 28, 2022

Continue on #79. This PR refactors the value based methods. Specifically, we

  1. Adopted the replay buffer from Stable-baselines 3. SB3's replay buffer is well-implemented and its behavior is intuitive. That is, we can imagine what happens when we do buffer.sample() compared to something like agent.learn(), which is much more abstract. To this end, I think it's worth to adopt SB3's replay buffer. In the future, when SB3 introduce prioritized replay buffer ([question] HER and prioritized experience replay hill-a/stable-baselines#751), it will also make it easier for us to adopt.
  2. General format refactoring.
  • match DDPG performance
  • match TD3 performance
  • match C51 performance
  • match DQN performance

@gitpod-io
Copy link

gitpod-io bot commented Jan 28, 2022

@vwxyzjn vwxyzjn requested a review from dosssman January 28, 2022 21:03
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Jan 29, 2022

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Jan 30, 2022

Found a regression with C51 probably due to different learning rate and target-network-frequency

image

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Jan 30, 2022

Found a regression with DQN, probably due to different learning rate

image

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Jan 31, 2022

The regression report checks through

image

@dosssman all good on your end?

@dosssman
Copy link
Collaborator

dosssman commented Feb 1, 2022

I have been looking at the continous action space methods which I am more familiar with.
I think there was some superfluous code in the TD3 scripts, which is addressed in #106 .
Also added refactored SAC with some additional tweaks that could be incorporated, while we are at it.

Will further check the DQN / C51 like discrete action space further in the week.

In any case, great work as always.

* Fixed 'optimize the midel' typo in all files

* Fixed 'optimize the midel' typo in offline scripts too

* TD3: removed DDPG's update code from the training loop

* Refactored sac_continuous, with preliminary tests working
@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Feb 1, 2022

Thanks @dosssman for the detailed check on TD3, after this fix, TD3's performance (green line on the right) is even better

image

@dosssman
Copy link
Collaborator

dosssman commented Feb 1, 2022

Great. Thanks for running the experiments to check those.
On my side I have only tested SAC refactored (orange line) and the previous version (blue line), albeit with only one seed.
image

By the way, is the wandb report which the screenshot above come from open ? I could not find it in the cleanrl project so far.

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Feb 1, 2022

It's here: https://wandb.ai/costa-huang/cleanRL/reports/Regression-Report--VmlldzoxNDI1MTE4

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Feb 3, 2022

Merging as is so that I could introduce formatting piplines after discussing with @dosssman (great work btw). And if there are remaining issues we can open new PRs.

@vwxyzjn vwxyzjn merged commit cec84cc into master Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants