Simple demo implementation of meta-RL in pytorch
utils.py
defines the agent and task environmentmain.ipynb
has a simple training loop, and an evaluation of the learnt agent
- Two variants of meta-learning on bandits are implemented:
- easy: bandits switch between episodes. i.e. for a given episode, the bandit with highest probability is fixed
- medium: bandit switches once at given episode step
- difficult: bandit switches within an episode with probability p.
- policy gradient method
- target can be REINFORCE (MC) or ActorCritic (TD)
- forward layers computed in parallel by folding time into batch dimension