Skip to content

Commit

Permalink
Add ppo.py documetnation (vwxyzjn#120)
Browse files Browse the repository at this point in the history
* Add PPO documetnation

* Add test

* Update docs

* refactor

* Update documentation

* update docs

* Quick fix
  • Loading branch information
vwxyzjn authored Feb 26, 2022
1 parent 3437186 commit b63315b
Show file tree
Hide file tree
Showing 16 changed files with 283 additions and 30 deletions.
24 changes: 17 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,23 @@ You may also use a prebuilt development environment hosted in Gitpod:
[![Open in Gitpod](https://gitpod.io/button/open-in-gitpod.svg)](https://gitpod.io/#https://github.com/vwxyzjn/cleanrl)

## Algorithms Implemented
- [x] Deep Q-Learning (DQN)
- [x] Categorical DQN (C51)
- [x] Proximal Policy Gradient (PPO)
- [x] Soft Actor Critic (SAC)
- [x] Deep Deterministic Policy Gradient (DDPG)
- [x] Twin Delayed Deep Deterministic Policy Gradient (TD3)
- [x] Apex Deep Q-Learning (Apex-DQN)

| Algorithm | Variants Implemented |
| ----------- | ----------- |
|[Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf) | [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy) |
| | [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_ataripy) |
| | [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py), [docs](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_continuous_actionpy) |
| | [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py) |
| | [`ppo_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_procgen.py) |
|[Deep Q-Learning (DQN)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) | [`dqn.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) |
| | [`dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) |
|[Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) | [`c51.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51.py) |
| | [`c51_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51_atari.py) |
|[Apex Deep Q-Learning (Apex-DQN)](https://arxiv.org/pdf/1803.00933.pdf) | [`apex_dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari.py) |
|[Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf) | [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) |
|[Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) | [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) |
|[Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://arxiv.org/pdf/1802.09477.pdf) | [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) |


## Open RL Benchmark

Expand Down
38 changes: 38 additions & 0 deletions benchmark/ppo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Proximal Policy Optimization Benchmark

This repository contains instructions to reproduce our PPO experiments done with CleanRL and `openai/baselines`.

## Install CleanRL

Prerequisites:
* Python 3.8+
* [Poetry](https://python-poetry.org)

Install dependencies:

```bash
git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
poetry install
```

## Reproduce CleanRL's PPO Benchmark

Follow the scripts at the `cleanrl` sub-folder. Note that you may need to overwrite the `--wandb-entity cleanrl` to your own W&B entity.

```bash
# reproduce the classic control experiments
bash cleanrl/classic_control.sh
```

## Install `openai/baselines`

Follow the instructions at our fork https://github.com/vwxyzjn/baselines to install.

## Reproduce CleanRL's PPO Benchmark

Follow the scripts at the `baselines` sub-folder. Note that you may need to overwrite the `WANDB_ENTITY=cleanrl` to your own W&B entity.

```bash
# reproduce the classic control experiments
bash cleanrl/classic_control.sh
```
84 changes: 84 additions & 0 deletions benchmark/ppo/baselines/classic_control_separate_networks.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# CartPole-v1
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--num_env 4 \
--env=CartPole-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 1

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=CartPole-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 2

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=CartPole-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 3

# Acrobot-v1
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=Acrobot-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 1

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=Acrobot-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 2

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=Acrobot-v1 \
--network mlp \
--value_network='copy' \
--track \
--seed 3

# MountainCar-v0
CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=MountainCar-v0 \
--network mlp \
--value_network='copy' \
--track \
--seed 1

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=MountainCar-v0 \
--network mlp \
--value_network='copy' \
--track \
--seed 2

CUDA_VISIBLE_DEVICES="-1" WANDB_PROJECT=openai-baselines-benchmark WANDB_ENTITY=cleanrl OPENAI_LOGDIR=$PWD/runs OPENAI_LOG_FORMAT=tensorboard poetry run python -m baselines.run_separate_networks \
--alg=ppo2 \
--num_timesteps=500000 \
--env=MountainCar-v0 \
--network mlp \
--value_network='copy' \
--track \
--seed 3
9 changes: 9 additions & 0 deletions benchmark/ppo/cleanrl/classic_control.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
3 changes: 3 additions & 0 deletions cleanrl/ppo.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
# docs and experiment results can be found at
# https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy

import argparse
import os
import random
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ CleanRL is a Deep Reinforcement Learning library that provides high-quality sing
* Experiment Management with [Weights and Biases](https://wandb.ai/site)
* Cloud Integration with Docker and AWS

You can read more about CleanRL in our [technical paper]((https://arxiv.org/abs/2111.08819)) and [documentation](https://docs.cleanrl.dev/).
You can read more about CleanRL in our [technical paper](https://arxiv.org/abs/2111.08819) and [documentation](https://docs.cleanrl.dev/).

Good luck have fun 🚀

Expand Down
18 changes: 18 additions & 0 deletions docs/rl-algorithms/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Overview

| Algorithm | Variants Implemented |
| ----------- | ----------- |
|[Proximal Policy Gradient (PPO)](https://arxiv.org/pdf/1707.06347.pdf) | :material-github: [`ppo.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppopy) |
| | :material-github: [`ppo_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_ataripy) |
| | :material-github: [`ppo_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_continuous_action.py), :material-file-document: [docs](/rl-algorithms/ppo/#ppo_continuous_actionpy) |
| | :material-github: [`ppo_atari_lstm.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari_lstm.py) |
| | :material-github: [`ppo_procgen.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_procgen.py) |
|[Deep Q-Learning (DQN)](https://web.stanford.edu/class/psych209/Readings/MnihEtAlHassibis15NatureControlDeepRL.pdf) | :material-github: [`dqn.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn.py) |
| | :material-github: [`dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py) |
|[Categorical DQN (C51)](https://arxiv.org/pdf/1707.06887.pdf) | :material-github: [`c51.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51.py) |
| | :material-github: [`c51_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/c51_atari.py) |
|[Apex Deep Q-Learning (Apex-DQN)](https://arxiv.org/pdf/1803.00933.pdf) | :material-github: [`apex_dqn_atari.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari.py) |
|[Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf) | :material-github: [`sac_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/sac_continuous_action.py) |
|[Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) | :material-github: [`ddpg_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ddpg_continuous_action.py) |
|[Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://arxiv.org/pdf/1802.09477.pdf) | :material-github: [`td3_continuous_action.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/td3_continuous_action.py) |

Loading

0 comments on commit b63315b

Please sign in to comment.