forked from vwxyzjn/cleanrl
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add ppo.py documetnation (vwxyzjn#120)
* Add PPO documetnation * Add test * Update docs * refactor * Update documentation * update docs * Quick fix
- Loading branch information
Showing
16 changed files
with
283 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Proximal Policy Optimization Benchmark | ||
|
||
This repository contains instructions to reproduce our PPO experiments done with CleanRL and `openai/baselines`. | ||
|
||
## Install CleanRL | ||
|
||
Prerequisites: | ||
* Python 3.8+ | ||
* [Poetry](https://python-poetry.org) | ||
|
||
Install dependencies: | ||
|
||
```bash | ||
git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl | ||
poetry install | ||
``` | ||
|
||
## Reproduce CleanRL's PPO Benchmark | ||
|
||
Follow the scripts at the `cleanrl` sub-folder. Note that you may need to overwrite the `--wandb-entity cleanrl` to your own W&B entity. | ||
|
||
```bash | ||
# reproduce the classic control experiments | ||
bash cleanrl/classic_control.sh | ||
``` | ||
|
||
## Install `openai/baselines` | ||
|
||
Follow the instructions at our fork https://github.com/vwxyzjn/baselines to install. | ||
|
||
## Reproduce CleanRL's PPO Benchmark | ||
|
||
Follow the scripts at the `baselines` sub-folder. Note that you may need to overwrite the `WANDB_ENTITY=cleanrl` to your own W&B entity. | ||
|
||
```bash | ||
# reproduce the classic control experiments | ||
bash cleanrl/classic_control.sh | ||
``` |
84 changes: 84 additions & 0 deletions
84
benchmark/ppo/baselines/classic_control_separate_networks.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# CartPole-v1 | ||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--num_env 4 \ | ||
--env=CartPole-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 1 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=CartPole-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 2 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=CartPole-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 3 | ||
|
||
# Acrobot-v1 | ||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=Acrobot-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 1 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=Acrobot-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 2 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=Acrobot-v1 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 3 | ||
|
||
# MountainCar-v0 | ||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=MountainCar-v0 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 1 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=MountainCar-v0 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 2 | ||
|
||
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \ | ||
--alg=ppo2 \ | ||
--num_timesteps=500000 \ | ||
--env=MountainCar-v0 \ | ||
--network mlp \ | ||
--value_network='copy' \ | ||
--track \ | ||
--seed 3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 | ||
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Overview | ||
|
||
| Algorithm | Variants Implemented | | ||
| ----------- | ----------- | | ||
| ✅ [Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf) | :material-github: [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppopy) | | ||
| | :material-github: [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_ataripy) | | ||
| | :material-github: [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_continuous_actionpy) | | ||
| | :material-github: [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py) | | ||
| | :material-github: [`ppo_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_procgen.py) | | ||
| ✅ [Deep Q-Learning (DQN)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) | :material-github: [`dqn.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) | | ||
| | :material-github: [`dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) | | ||
| ✅ [Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) | :material-github: [`c51.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51.py) | | ||
| | :material-github: [`c51_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51_atari.py) | | ||
| ✅ [Apex Deep Q-Learning (Apex-DQN)](https://arxiv.org/pdf/1803.00933.pdf) | :material-github: [`apex_dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari.py) | | ||
| ✅ [Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf) | :material-github: [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) | | ||
| ✅ [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) | :material-github: [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) | | ||
| ✅ [Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://arxiv.org/pdf/1802.09477.pdf) | :material-github: [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) | | ||
|
Oops, something went wrong.