The goal is to train a RL system that learns a difference of value functions in order to perform effectively under simulation and approximation errors. In other words, there is mis-match between simulated and target domains. This is a OpenAI Request for Research problem "Difference of Value Functions".
The main idea idea comes from a 1997 paper Differential Training of Rollout Policies, by Bertsekas. The paper introduces a technique called differential training and argues that under simulation and approximation error, learning a difference of value functions can do better than learning vanilla value functions.
Instead of learning a difference of value function as suggested by Bertsekas, in this work I introduce a variant of DDPG (Deep Deterministic Policy Gradients), which instead of learning a Q(state, action)
function, learns a difference of Q function Q(state1, action1, state2, action2)
which approximates the difference of expected Q-values between two state, action
pairs under the current policy. We use the gradient from this function to train the policy network in DDPG.
The mismatch between simulated and target domains is modeled using Mujoco agents with varying torso masses, similar to EPOpt. As in EPOpt, we train on a ensemble of robot models.
We use the Mujoco physics simulator for training on the HalfCheetah-v1
environment.
We use a Tensorflow Eager adaptation of OpenAI Baselines for Deep Deterministic Policy Gradients (DDPG) as the baseline.
This model has been ported to Tensorflow Eager, which gives us a better Pythonic expression of the model (define-by-run as opposed to define-and-run) and makes it easier to debug in many cases.
- Install OpenAI Gym and Mujoco (needs a software license).
- Install Tensorflow from the nightly build (we need nightly builds for TF Eager unless you have >=1.5)
- Install pybullet
- Install numpy
Apply the concept of differential training to other Deep RL methods and see if this gives us benefits in the presence of simulation error.