Setting Up an Experiment in OpenAI's Spinning Up

Nov 5, 2019

[ rl spinningup ]

I’ve been familiarizing myself with OpenAI’s spinning up and gym repositories over the past several days, reading over the documentation and working to get my machine configured correctly.

Today, I wanted to play around with setting up an experiment, so tried running some of the included algorithm implementations to see how things worked.

The following worked out of the box to train the CartPole environment. It uses the PPO algorithm, specifying some hyperparameters, such as the hidden layer dimentions and the activation function:

python -m spinup.run ppo --env CartPole-v0 --hid[h] [128,128] --act tf.nn.relu --exp_name cartpole-ppo

Afterwards, there is handy output to the console to see what experiments are actually being trained and run.

================================================================================
ExperimentGrid [cartpole-ppo] runs over parameters:

 env_name                                 [env] 

	CartPole-v0

 ac_kwargs:hidden_sizes                   [h] 

	[128, 128]

 ac_kwargs:activation                     [ac-act] 

	relu

 Variants, counting seeds:               1
 Variants, not counting seeds:           1

================================================================================

Preparing to run the following experiments...

cartpole-ppo

================================================================================

Launch delayed to give you a few seconds to review your experiments.

To customize or disable this behavior, change WAIT_BEFORE_LAUNCH in
spinup/user_config.py.

================================================================================
Running experiment:                                                             

cartpole-ppo

with kwargs:

{
    "ac_kwargs":	{
        "activation":	"relu",
        "hidden_sizes":	[
            128,
            128
        ]
    },
    "env_name":	"CartPole-v0",
    "seed":	0
}

By default, it trains 50 epochs. I’ve pasted the first and last epoch of training here:

---------------------------------------
|             Epoch |               0 |
|      AverageEpRet |            22.7 |
|          StdEpRet |            11.5 |
|          MaxEpRet |              68 |
|          MinEpRet |               8 |
|             EpLen |            22.7 |
|      AverageVVals |         -0.0188 |
|          StdVVals |          0.0517 |
|          MaxVVals |            0.12 |
|          MinVVals |          -0.229 |
| TotalEnvInteracts |           4e+03 |
|            LossPi |        1.14e-07 |
|             LossV |             283 |
|       DeltaLossPi |         -0.0268 |
|        DeltaLossV |            -172 |
|           Entropy |            0.68 |
|                KL |          0.0152 |
|          ClipFrac |           0.223 |
|          StopIter |              64 |
|              Time |            1.97 |
---------------------------------------

...

---------------------------------------
|             Epoch |              49 |
|      AverageEpRet |             198 |
|          StdEpRet |            3.87 |
|          MaxEpRet |             200 |
|          MinEpRet |             187 |
|             EpLen |             198 |
|      AverageVVals |            55.9 |
|          StdVVals |            16.8 |
|          MaxVVals |            72.5 |
|          MinVVals |           -15.1 |
| TotalEnvInteracts |           2e+05 |
|            LossPi |        1.26e-08 |
|             LossV |             318 |
|       DeltaLossPi |        -0.00653 |
|        DeltaLossV |           -38.7 |
|           Entropy |            0.53 |
|                KL |         0.00412 |
|          ClipFrac |          0.0342 |
|          StopIter |              79 |
|              Time |            84.1 |
---------------------------------------

You can see the average episode return starts off at 22.7, but climbs to 198 by the end of training.

You can then easily test the policy like so:

python -m spinup.run test_policy /Users/Kasim/Projects/ml/spinningup/data/cartpole-ppo/cartpole-ppo_s0
...
Episode 0 	 EpRet 200.000 	 EpLen 200
Episode 1 	 EpRet 200.000 	 EpLen 200
Episode 2 	 EpRet 200.000 	 EpLen 200
Episode 3 	 EpRet 200.000 	 EpLen 200
Episode 4 	 EpRet 200.000 	 EpLen 200
Episode 5 	 EpRet 200.000 	 EpLen 200

You should see something like the following:

Archive

chinese tang-dynasty-poetry 李白 python 王维 rl pytorch numpy emacs 杜牧 spinningup networking deep-learning 贺知章 白居易 王昌龄 杜甫 李商隐 tips reinforcement-learning macports jekyll 骆宾王 贾岛 孟浩然 xcode time-series terminal regression rails productivity pandas math macosx lesson-plan helicopters flying fastai conceptual-learning command-line bro 黄巢 韦应物 陈子昂 王翰 王之涣 柳宗元 杜秋娘 李绅 张继 孟郊 刘禹锡 元稹 youtube visdom system sungho stylelint stripe softmax siri sgd scipy scikit-learn scikit safari research qtran qoe qmix pyhton poetry pedagogy papers paper-review optimization openssl openmpi nyc node neural-net multiprocessing mpi morl ml mdp marl mandarin macos machine-learning latex language-learning khan-academy jupyter-notebooks ios-programming intuition homebrew hacking google-cloud github flashcards faker docker dme deepmind dec-pomdp data-wrangling craftsman congestion-control coding books book-review atari anki analogy 3brown1blue 2fa