vpg

Vanilla Policy Gradients

This is the standard vanilla policy gradients with stochastic policies, either continuous or discrete. I based this off of CS 294-112 starter code.

I'm using Python 3.5.2 and Tensorflow 1.2.0. This code will not work with Python 2.7.x. Note to self: when running bash scripts in GNU screen mode, be sure to source my Python 3 conda environment.

Simple Baselines

CartPole-v0

Based on bash_scripts/CartPole-v0.sh:

Architectures:

Policy: (input) - 50 - (output), tanh
NN vf: (input) - 50 - 50 - (output), tanh

Pendulum-v0

Based on bash_scripts/Pendulum-v0.sh:

Architectures:

Policy: (input) - 32 - 32 - (output), relu
NN vf: (input) - 50 - 50 - (output), tanh

I think it looks OK. Pendulum is a bit tricky to solve because it requires an adaptive learning rate but I still get close to about -100 or so. I'm not sure what the theoretical best solution is; maybe zero, but that seems impossible. The neural network is only slightly better with these results (I guess?) because the problem is so simple. The action dimension is just one.

TODO haven't tested with these with new API ...

MuJoCo Baselines

Tested on in alphabetical order:

HalfCheetah-v1
Hopper-v1
Walker2d-v1

HalfCheetah-v1

The raw runs based on bash_scripts/halfcheetah.sh:

And the smooth runs:

The GAIL paper said HalfCheetah-v1 should get around 4463.46 ± 105.83 and in fact we are almost getting to that level. That's interesting.

What's confusing is that the explained variance for the linear case seems to be terrible. Then why is the linear VF even working, and why is it just barely worse than the NN value function? Hmmm ... I may want to catch a video of this in action.

Hopper-v1

I used the script in bash_scripts/hopper.sh. Here are the raw results:

And now the smoothed versions:

(Ho & Ermon 2016) showed in the GAIL paper that Hopper-v1 should get 3571.38 with a standard deviation of 184.20 so ... yeah, these results are a bit sub-par! But at least they are learning something. Maybe my version of TRPO will do better.

Walker2d-v1

Next, Walker2d-v1. The raw runs based on bash_scripts/walker.sh:

And the smooth runs:

The GAIL paper said Walker-v1 should get around 6717.08 p/m 845.62, but that might not be the same as Walker2d-v1. I'm not sure ... and the code Jonathan Ho has for imitation learning doesn't do as well.

Name		Name	Last commit message	Last commit date
parent directory ..
bash_scripts		bash_scripts
figures		figures
README.md		README.md
main.py		main.py
plot_learning_curves.py		plot_learning_curves.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vpg

vpg

README.md

Vanilla Policy Gradients

Simple Baselines

CartPole-v0

Pendulum-v0

MuJoCo Baselines

HalfCheetah-v1

Hopper-v1

Walker2d-v1

Files

vpg

Directory actions

More options

Directory actions

More options

Latest commit

History

vpg

Folders and files

parent directory

README.md

Vanilla Policy Gradients

Simple Baselines

CartPole-v0

Pendulum-v0

MuJoCo Baselines

HalfCheetah-v1

Hopper-v1

Walker2d-v1