Add ppo.py documetnation (vwxyzjn#120)

* Add PPO documetnation * Add test * Update docs * refactor * Update documentation * update docs * Quick fix
RL-code-lib · Feb 26, 2022 · b63315b · b63315b
1 parent 3437186
commit b63315b
Show file tree

Hide file tree

Showing 16 changed files with 283 additions and 30 deletions.
diff --git a/README.md b/README.md
@@ -108,13 +108,23 @@ You may also use a prebuilt development environment hosted in Gitpod:
 [![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/vwxyzjn/cleanrl)
 
 ## Algorithms Implemented
-- [x] Deep Q-Learning (DQN)
-- [x] Categorical DQN (C51)
-- [x] Proximal Policy Gradient (PPO) 
-- [x] Soft Actor Critic (SAC)
-- [x] Deep Deterministic Policy Gradient (DDPG)
-- [x] Twin Delayed Deep Deterministic Policy Gradient (TD3)
-- [x] Apex Deep Q-Learning (Apex-DQN)
+
+| Algorithm      | Variants Implemented |
+| ----------- | ----------- |
+| ✅ [Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf)  | [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py),  [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy) |
+| | [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py),  [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_ataripy) |
+| | [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py),  [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_continuous_actionpy) |
+| | [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py) |
+| | [`ppo_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_procgen.py) |
+| ✅ [Deep Q-Learning (DQN)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) | [`dqn.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) |
+| | [`dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) |
+| ✅ [Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) | [`c51.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51.py) |
+| | [`c51_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51_atari.py) |
+| ✅ [Apex Deep Q-Learning (Apex-DQN)](https://arxiv.org/pdf/1803.00933.pdf) | [`apex_dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari.py) |
+| ✅ [Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf) | [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) |
+| ✅ [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) | [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) |
+| ✅ [Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://arxiv.org/pdf/1802.09477.pdf) | [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) |
+
 
 ## Open RL Benchmark
 

diff --git a/benchmark/ppo/README.md b/benchmark/ppo/README.md
@@ -0,0 +1,38 @@
+# Proximal Policy Optimization Benchmark
+
+This repository contains instructions to reproduce our PPO experiments done with CleanRL and `openai/baselines`.
+
+## Install CleanRL
+
+Prerequisites:
+* Python 3.8+
+* [Poetry](https://python-poetry.org)
+
+Install dependencies:
+
+```bash
+git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
+poetry install
+```
+
+## Reproduce CleanRL's PPO Benchmark
+
+Follow the scripts at the `cleanrl` sub-folder. Note that you may need to overwrite the `--wandb-entity cleanrl` to your own W&B entity.
+
+```bash
+# reproduce the classic control experiments
+bash cleanrl/classic_control.sh
+```
+
+## Install `openai/baselines`
+
+Follow the instructions at our fork https://github.com/vwxyzjn/baselines to install.
+
+## Reproduce CleanRL's PPO Benchmark
+
+Follow the scripts at the `baselines` sub-folder. Note that you may need to overwrite the `WANDB_ENTITY=cleanrl` to your own W&B entity.
+
+```bash
+# reproduce the classic control experiments
+bash cleanrl/classic_control.sh
+```
diff --git a/benchmark/ppo/baselines/classic_control_separate_networks.sh b/benchmark/ppo/baselines/classic_control_separate_networks.sh
@@ -0,0 +1,84 @@
+# CartPole-v1
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --num_env 4 \
+    --env=CartPole-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 1
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=CartPole-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 2
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=CartPole-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 3
+
+# Acrobot-v1
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=Acrobot-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 1
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=Acrobot-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 2
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=Acrobot-v1 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 3
+
+# MountainCar-v0
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=MountainCar-v0 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 1
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=MountainCar-v0 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 2
+
+CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
+    --alg=ppo2 \
+    --num_timesteps=500000 \
+    --env=MountainCar-v0 \
+    --network mlp \
+    --value_network='copy' \
+    --track \
+    --seed 3
diff --git a/benchmark/ppo/cleanrl/classic_control.sh b/benchmark/ppo/cleanrl/classic_control.sh
@@ -0,0 +1,9 @@
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
+OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
diff --git a/cleanrl/ppo.py b/cleanrl/ppo.py
@@ -1,3 +1,6 @@
+# docs and experiment results can be found at
+# https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy
+
 import argparse
 import os
 import random

diff --git a/docs/index.md b/docs/index.md
@@ -26,7 +26,7 @@ CleanRL is a Deep Reinforcement Learning library that provides high-quality sing
 * Experiment Management with [Weights and Biases](https://wandb.ai/site)
 * Cloud Integration with Docker and AWS 
 
-You can read more about CleanRL in our [technical paper]((https://arxiv.org/abs/2111.08819)) and [documentation](https://docs.cleanrl.dev/).
+You can read more about CleanRL in our [technical paper](https://arxiv.org/abs/2111.08819) and [documentation](https://docs.cleanrl.dev/).
 
 Good luck have fun 🚀
 

diff --git a/docs/rl-algorithms/overview.md b/docs/rl-algorithms/overview.md
@@ -0,0 +1,18 @@
+# Overview
+
+| Algorithm      | Variants Implemented |
+| ----------- | ----------- |
+| ✅ [Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf)  | :material-github: [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppopy) |
+| | :material-github: [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_ataripy) |
+| | :material-github: [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_continuous_actionpy) |
+| | :material-github: [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py) |
+| | :material-github: [`ppo_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_procgen.py) |
+| ✅ [Deep Q-Learning (DQN)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) | :material-github: [`dqn.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) |
+| | :material-github: [`dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) |
+| ✅ [Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) | :material-github: [`c51.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51.py) |
+| | :material-github: [`c51_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51_atari.py) |
+| ✅ [Apex Deep Q-Learning (Apex-DQN)](https://arxiv.org/pdf/1803.00933.pdf) | :material-github: [`apex_dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari.py) |
+| ✅ [Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf) | :material-github: [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) |
+| ✅ [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) | :material-github: [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) |
+| ✅ [Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://arxiv.org/pdf/1802.09477.pdf) | :material-github: [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) |
+