Difference Critic

Motivation

The goal is to train a RL system that learns a difference of value functions in order to perform effectively under simulation and approximation errors. In other words, there is mis-match between simulated and target domains. This is a OpenAI Request for Research problem "Difference of Value Functions".

What's New

The main idea idea comes from a 1997 paper Differential Training of Rollout Policies, by Bertsekas. The paper introduces a technique called differential training and argues that under simulation and approximation error, learning a difference of value functions can do better than learning vanilla value functions.

Instead of learning a difference of value function as suggested by Bertsekas, in this work I introduce a variant of DDPG (Deep Deterministic Policy Gradients), which instead of learning a Q(state, action) function, learns a difference of Q function Q(state1, action1, state2, action2) which approximates the difference of expected Q-values between two state, action pairs under the current policy. We use the gradient from this function to train the policy network in DDPG.

Implementation Details

The mismatch between simulated and target domains is modeled using Mujoco agents with varying torso masses, similar to EPOpt. As in EPOpt, we train on a ensemble of robot models.

We use the Mujoco physics simulator for training on the HalfCheetah-v1 environment.

We use a Tensorflow Eager adaptation of OpenAI Baselines for Deep Deterministic Policy Gradients (DDPG) as the baseline.

This model has been ported to Tensorflow Eager, which gives us a better Pythonic expression of the model (define-by-run as opposed to define-and-run) and makes it easier to debug in many cases.

Instructions to for installation

Install OpenAI Gym and Mujoco (needs a software license).
Install Tensorflow from the nightly build (we need nightly builds for TF Eager unless you have >=1.5)
Install pybullet
Install numpy

Future work

Apply the concept of differential training to other Deep RL methods and see if this gives us benefits in the presence of simulation error.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
ddpg.py		ddpg.py
main.py		main.py
memory.py		memory.py
models.py		models.py
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Difference Critic

Motivation

What's New

Implementation Details

Instructions to for installation

Future work

About

Releases

Packages

Languages

nikhil-dev/differential-training

Folders and files

Latest commit

History

Repository files navigation

Difference Critic

Motivation

What's New

Implementation Details

Instructions to for installation

Future work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages