PPO-Algorithm

I implemented three versions of the PPO-Algorithm as proposed in John Schulman et al. 'Proximal policy optimization algorithms' (https://arxiv.org/abs/1707.06347).

PPO without clipping or penalty
color: red
PPO with clipped objective
color: orange
PPO with adaptive Kullback-Leibler penalty
color: blue

We test these three versions on the 'CartPole-v1' environment.

We see that the PPO with adpative KL-penalty outperforms the other two algorithms in this example. However, the second plot shows that this alogrithm takes the longest on the other hand , but still outperforms on a relative basis.
PPO with adpative KL-Divergence outperforms also while testing.

Note that the first two plots are smoothed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO-Algorithm

Reward per episode:

Relative reward to the time:

Reward per test episode:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PPO-Algorithm

Reward per episode:

Relative reward to the time:

Reward per test episode: