Skip to content

Latest commit

 

History

History
85 lines (48 loc) · 4.17 KB

README.md

File metadata and controls

85 lines (48 loc) · 4.17 KB

pg_methods organization

This document describes how pg_methods is organized:

pg_methods.algorithms

Contains implementations of common algorithms. Right now the following are implemented:

  1. VanillaPolicyGradient contains the implementation of REINFORCE vanilla policy gradient. Baselines are optional and supported.

Contains various utilities to handle data collection and storage from environments. This should be the future home of experience replay and things like that.

  1. obtain_trajectories Conducts a rollout in the environment
  2. MultiTrajectory Stores rollouts from the environment. Has a .torchify() method to quickly convert the internal things to be used with PyTorch

This contains some interfaces to go between PyTorch and OpenAI Gym.

Gym has a few Data objects: Box, Discrete etc. There are some utilities to automatically convert between these types and PyTorch tensors. They contain functions like gym2pytorch and pytorch2gym that allows it to work with the PyTorchWrap object.

  1. ContinuousProcessor: Converts between Box datatype and PyTorch
  2. SimpleDiscreteProcessor: Converts the a sample from Discrete into a one hot vector.
  3. OneHotProcessor: Converts the a sample from Discrete into a float that can be fed into PyTorch..

There are some wrappers and parallelized Gym interfaces:

  1. PyTorchWrap Interface between a single gym instance and pytorch
  2. make_parallelized_gym_env Interface between multiple gym environments running in parallel in pytorch.

This contains some common neural networks often used as function approximators for policies. Examples are:

  1. MLP_factory: creates a simple MLP
  2. MLP_factory_two_heads: Used to create networks with a shared body and two heads with different parameters.
  3. SharedActorCritic -- (WIP) used for creating actor critic algorithms with shared heads and bodies.

Contains PolicyGradientObjective which actually should be the REINFORCE objective. (Maybe we should consider changing this in a future release?), and NaturalPolicyGradientObjective which is not yet implemented.

Right now contains two baseline functions: MovingAverageBaseline, FunctionApproximatorBaseline.

Functions to help calculate gradients for the policy gradient objectives. These are all found in PolicyGradientObjective, but a few things that are useful to play with are:

  1. calculate_returns(rewards, discount, masks) calculates retuns given rewards, discount factor and masks. The arguments are usually found by using MultiTrajectory
  2. calculate_policy_gradient_terms(log_probs, advantage) calculates policy gradient terms logprob * advantage (no mean happens here.)

Includes common policies used in reinforcement learning.

All policies take in a function approximator as the first argument. This is a torch module like a neural network.

RandomPolicy

Categorical agent that acts randomly.

CategoricalPolicy

BernoulliPolicy

GaussianPolicy

Actions are picked according to a Gaussian distribution parameterized by mu and sigma. Note that in this case the function approximator should return two outputs corresponding to the mu and sigma

  1. pg_methods.utils.experiment: Contains some tools to handle experiments and setup policies quickly
  2. pg_methods.utils.logger: Should contain things to log data. Will be the future home of the Tensorboard logger etc.
  3. pg_methods.utils.plotting: Tools for plotting results of a run.