This is PyTorch implementation of TSDA from
An Investigation of Time Reversal Symmetry in Reinforcement Learning by
Brett Barkley, Amy Zhang, and David Fridovich-Keil.
This repository is built as an extension of the Pytorch implementation of
Improving Sample Efficiency in Model-Free Reinforcement Learning from Images by
Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, Rob Fergus.
If you use this repo in your research, please consider citing the paper as follows
@article{barkley2023TSDA,
title={An Investigation of Time Reversal Symmetry in Reinforcement Learning},
author={Brett Barkley and Amy Zhang and David Fridovich-Keil},
year={2023},
eprint={2311.17008},
archivePrefix={arXiv}
}
We assume you have access to a gpu that can run CUDA 9.2. Then, the simplest way to install all required dependencies is to create an anaconda environment by running the following in the top level directory:
conda env create -f conda_env.yml
After the instalation ends you can activate your environment with:
source activate tsda
To train an SAC+AE agent on the cheetah run
task from image-based observations run:
python train.py \
--domain_name cheetah \
--task_name run \
--encoder_type pixel \
--decoder_type pixel \
--action_repeat 4 \
--save_video \
--save_tb \
--work_dir ./log \
--seed 1
This will produce 'log' folder, where all the outputs are going to be stored including train/eval logs, tensorboard blobs, and evaluation episode videos. One can attacha tensorboard to monitor training by running:
tensorboard --logdir log
and opening up tensorboad in your browser.
The console output is also available in a form:
| train | E: 1 | S: 1000 | D: 0.8 s | R: 0.0000 | BR: 0.0000 | ALOSS: 0.0000 | CLOSS: 0.0000 | RLOSS: 0.0000
a training entry decodes as:
train - training episode
E - total number of episodes
S - total number of environment steps
D - duration in seconds to train 1 episode
R - episode reward
BR - average reward of sampled batch
ALOSS - average loss of actor
CLOSS - average loss of critic
RLOSS - average reconstruction loss (only if is trained from pixels and decoder)
while an evaluation entry:
| eval | S: 0 | ER: 21.1676
which just tells the expected reward ER
evaluating current policy after S
steps. Note that ER
is average evaluation performance over num_eval_episodes
episodes (usually 10).
Empirical evaluations showcase how the synthetic transitions provided by TSDA enhance the sample efficiency of RL agents in time reversible scenarios without friction or contact. In environments where the assumptions of TSDA are not globally satisfied we find that TSDA can significantly degrade sample efficiency and policy performance, but can also improve sample efficiency under the right conditions. Ultimately we conclude that time symmetry shows promise in enhancing the sample efficiency of reinforcement learning if the environment and reward structures are of an appropriate form for TSDA to be employed effectively.
One hypothesis of particular note is that TSDA can lead agents to favor early exploitation over exploration in time symmetric environments, e.g., when important exploratory actions lie at the edge of the action space. The two videos below depict the typical training progress in evaluation episodes with and without TSDA.