Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ppo.py documentation #120

Merged
merged 7 commits into from
Feb 26, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update docs
  • Loading branch information
vwxyzjn committed Feb 22, 2022
commit 774f6e3a945d91e0bac21a652975d1844faf36a8
25 changes: 25 additions & 0 deletions benchmark/ppo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Proximal Policy Optimization Benchmark

This repository contains instructions to reproduce our PPO benchmark.

## Get started

Prerequisites:
* Python 3.8+
* [Poetry](https://python-poetry.org)

Install dependencies:

```bash
git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
poetry install
```

## Reproduce CleanRL's PPO Benchmark

```bash
# reproduce the classic control experiments
bash classic_control.sh

# reproduce the Atari experiments
```
9 changes: 9 additions & 0 deletions benchmark/ppo/classic_control.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id CartPole-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id Acrobot-v1 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 1 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 2 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
OMP_NUM_THREADS=1 poetry run python cleanrl/ppo.py --env-id MountainCar-v0 --track --capture-video --seed 3 --wandb-entity cleanrl --wandb-project-name benchmark --cuda False --total-timesteps 500000
53 changes: 27 additions & 26 deletions docs/rl-algorithms/ppo.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,5 @@
# Proximal Policy Gradient (PPO)

<style>
.grid-container {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
image-rendering: -webkit-optimize-contrast;
grid-gap: 50px;
}
</style>


## Overview

Expand Down Expand Up @@ -36,47 +27,57 @@ The [ppo.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py) has
* Works with the `Discerete` action space
* Works with envs like `CartPole-v1`

### Usage

```bash
poetry install
python cleanrl/ppo.py --help
python cleanrl/ppo.py --env-id CartPole-v1
```

### Implementation details

[ppo.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py) includes the 11 core implementation details:

1. Vectorized architecture ([common/cmd_util.py#L22](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/cmd_util.py#L22))
1. Orthogonal Initialization of Weights and Constant Initialization of biases ([a2c/utils.py#L58)](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/a2c/utils.py#L58))
1. The Adam Optimizer's Epsilon Parameter ([ppo2/model.py#L100](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L100))
1. Adam Learning Rate Annealing ([ppo2/ppo2.py#L133-L135](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/ppo2.py#L133-L135))
1. Generalized Advantage Estimation ([ppo2/runner.py#L56-L65](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/runner.py#L56-L65))
1. Mini-batch Updates ([ppo2/ppo2.py#L157-L166](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/ppo2.py#L157-L166))
1. Normalization of Advantages ([ppo2/model.py#L139](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L139))
1. Clipped surrogate objective ([ppo2/model.py#L81-L86](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L81-L86))
1. Value Function Loss Clipping ([ppo2/model.py#L68-L75](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L68-L75))
1. Overall Loss and Entropy Bonus ([ppo2/model.py#L91](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91))
1. Global Gradient Clipping ([ppo2/model.py#L102-L108](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L102-L108))
1. Vectorized architecture (:material-github: [common/cmd_util.py#L22](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/common/cmd_util.py#L22))
1. Orthogonal Initialization of Weights and Constant Initialization of biases (:material-github: [a2c/utils.py#L58)](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/a2c/utils.py#L58))
1. The Adam Optimizer's Epsilon Parameter (:material-github: [ppo2/model.py#L100](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L100))
1. Adam Learning Rate Annealing (:material-github: [ppo2/ppo2.py#L133-L135](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/ppo2.py#L133-L135))
1. Generalized Advantage Estimation (:material-github: [ppo2/runner.py#L56-L65](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/runner.py#L56-L65))
1. Mini-batch Updates (:material-github: [ppo2/ppo2.py#L157-L166](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/ppo2.py#L157-L166))
1. Normalization of Advantages (:material-github: [ppo2/model.py#L139](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L139))
1. Clipped surrogate objective (:material-github: [ppo2/model.py#L81-L86](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L81-L86))
1. Value Function Loss Clipping (:material-github: [ppo2/model.py#L68-L75](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L68-L75))
1. Overall Loss and Entropy Bonus (:material-github: [ppo2/model.py#L91](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91))
1. Global Gradient Clipping (:material-github: [ppo2/model.py#L102-L108](https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L102-L108))


### Experiment results

Below are the average episodic return for `ppo.py`. We compared the results
against `openai/baselies`' PPO
We conducted experiments based on :material-github: [3437186](https://github.com/vwxyzjn/cleanrl/commit/343718608eb27787039e0187322ffa4b7220e1da), and below are the average episodic returns for `ppo.py`. We compared the results against `openai/baselies`' PPO

| Environment | `ppo.py` | `openai/baselies`' PPO
| ----------- | ----------- | ----------- |
| CartPole-v1 | 488.75 ± 18.40 |497.54 ± 4.02 |
| Acrobot-v1 | -82.48 ± 5.93 | -81.82 ± 5.58 |
| MountainCar-v0 | -200.00 ± 0.00 | -200.00 ± 0.00 |

<!-- <iframe src="https://wandb.ai/vwxyzjn/ppo-details/reports/ppo-py-benchmark--VmlldzoxNTkxOTc4#benchmark-results:" style="width:100%; height:500px" title="CleanRL CartPole-v1 Example"></iframe> -->

Learning curves:

<div class="grid-container">
<img src="https://ppo-details.cleanrl.dev//public/images/2021-11-5-ppo-implementation-details//CartPole-v1.png">
<img src="../ppo/CartPole-v1.png">

<img src="https://ppo-details.cleanrl.dev//public/images/2021-11-5-ppo-implementation-details//Acrobot-v1.png">
<img src="../ppo/Acrobot-v1.png">

<img src="https://ppo-details.cleanrl.dev//public/images/2021-11-5-ppo-implementation-details//MountainCar-v0.png">
<img src="../ppo/MountainCar-v0.png">
</div>


Tracked experiments and game play videos:

<iframe src="https://wandb.ai/cleanrl/benchmark/reports/ppo-py-v1-Classic-Control---VmlldzoxNTk2NjE4" style="width:100%; height:500px" title="CleanRL CartPole-v1 Example"></iframe>

### Video tutorial

If you'd like to learn `ppo.py` in-depth, consider checking out the following video tutorial:
Expand Down
Binary file added docs/rl-algorithms/ppo/Acrobot-v1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo/CartPole-v1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo/MountainCar-v0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/stylesheets/extra.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.grid-container {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(240px, 1fr));
image-rendering: -webkit-optimize-contrast;
grid-gap: 50px;
}
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ markdown_extensions:
- pymdownx.superfences
- attr_list
- md_in_html
- pymdownx.emoji:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
# - toc:
# permalink: true
# - markdown.extensions.codehilite:
Expand Down