This document describes how pg_methods
is organized:
Contains implementations of common algorithms. Right now the following are implemented:
VanillaPolicyGradient
contains the implementation of REINFORCE vanilla policy gradient. Baselines are optional and supported.
Contains various utilities to handle data collection and storage from environments. This should be the future home of experience replay and things like that.
obtain_trajectories
Conducts a rollout in the environmentMultiTrajectory
Stores rollouts from the environment. Has a.torchify()
method to quickly convert the internal things to be used with PyTorch
This contains some interfaces to go between PyTorch and OpenAI Gym.
Gym has a few Data objects: Box
, Discrete
etc. There are some utilities to automatically convert
between these types and PyTorch tensors. They contain functions like gym2pytorch
and pytorch2gym
that allows it to work
with the PyTorchWrap
object.
ContinuousProcessor
: Converts betweenBox
datatype and PyTorchSimpleDiscreteProcessor
: Converts the a sample fromDiscrete
into a one hot vector.OneHotProcessor
: Converts the a sample fromDiscrete
into a float that can be fed into PyTorch..
There are some wrappers and parallelized Gym interfaces:
PyTorchWrap
Interface between a single gym instance and pytorchmake_parallelized_gym_env
Interface between multiple gym environments running in parallel in pytorch.
This contains some common neural networks often used as function approximators for policies. Examples are:
MLP_factory
: creates a simple MLPMLP_factory_two_heads
: Used to create networks with a shared body and two heads with different parameters.SharedActorCritic
-- (WIP) used for creating actor critic algorithms with shared heads and bodies.
Contains PolicyGradientObjective
which actually should be the REINFORCE objective.
(Maybe we should consider changing this in a future release?), and NaturalPolicyGradientObjective
which is not yet implemented.
Right now contains two baseline functions: MovingAverageBaseline
, FunctionApproximatorBaseline
.
Functions to help calculate gradients for the policy gradient objectives. These are all found in PolicyGradientObjective
, but a few things that are useful to play with are:
calculate_returns(rewards, discount, masks)
calculates retuns given rewards, discount factor and masks. The arguments are usually found by usingMultiTrajectory
calculate_policy_gradient_terms(log_probs, advantage)
calculates policy gradient termslogprob * advantage
(no mean happens here.)
Includes common policies used in reinforcement learning.
All policies take in a function approximator as the first argument. This is a torch module like a neural network.
Categorical agent that acts randomly.
Actions are picked according to a Gaussian distribution parameterized by mu
and sigma
.
Note that in this case the function approximator should return two outputs corresponding to the mu
and sigma
pg_methods.utils.experiment
: Contains some tools to handle experiments and setup policies quicklypg_methods.utils.logger
: Should contain things to log data. Will be the future home of the Tensorboard logger etc.pg_methods.utils.plotting
: Tools for plotting results of a run.