How to use the configuration for workflows

Motivation

Several algorithms have different requirements on how often to collect experiences and train a model. For example, the ARS algorithm has a rollout requirement (i.e. how many episodes it should run before training). In addition, learning algorithms can be defined as on or off policy. While Q-learning (e.g. DQN) is always considered off-policy, other algorithms such as SARSA, V-Trace, and PPO are considered on policy. In these cases the relationship between when a model is generated and it is used matters. To this end, we have created several variables that can be used to configure how and when data is collected from agents, passed to the learner, and models are trained.

Batch and Model Configuration Variables

The following variables are configurable within the JSON files.

batch_step_frequency : Int - This scales how often an agent will send its experiences back to the learner. When this is set to 1, at each environment step a batch is sent to the learner. When a batch is received by the learner, it will immediately train on that batch of data.

As an example, if the sync learner with a batch_step_frequency of 1, the total number of times trained should be episodes * steps (assuming not convergence cutoff).

When set > 1, the agent will send every batch_set_frequency steps until the episode is finished. An agent will always send its data when the episode is finished and the batch_step_frequency > 1.

To set the batch_step_frequency to the total number of steps, set batch_step_frequency to -1.

batch_episode_frequency : Int - This is used to create rollouts. In this case, we scale how often an agent sends data back to the learner when an episode is complete. Each agent will run batch_episode_frequency number of episodes before sending data back to the learner. The learner will train on each batch it receives when it receives it.

When the batch_episode_frequency > 1, the batch_step_frequency is ignored.

episode_block : Bool - This is an attempt to simulate on-policy learning. When set to true, the learner will block until it has received a batch of data from each agent. The learner will call train on each batch individually, but the model will not be propagated until the all batches have been processed.

Historically, over the course of ExaRL, the default values would be batch_step_frequency = 1, batch_episode_frequency = 1, and episode_block = False.

Convergence Cutoff

We currently have two conditions for ending training of an ExaRL environment.

The number of completed episodes the learner observes >= n_episodes (json configuration). We currently round the number of episodes to an even number per rank.
We can end training based on learning convergence based on rolling_reward_length and cutoff config variables. We determine if we have converged by looking at the rolling average of the absolute value of the differences across the last rolling_reward_length number of episodes. If this value is <= the cutoff value, we terminate execution. To turn the cutoff off set the cutoff configuration to -1.

To summarize the config variables.

n_episodes : Int - The maximum number of episodes to run.
rolling_reward_length : Int - The number of episodes to perform a rolling average of the absolute difference from the previous episode. We also use this when getting the rolling_reward_length when generating plot.
cutoff : Float - This is the value to compare the absolute average difference against.

Additional New Configuration Variable

clip_rewards : Bool - When set to true, this clips the max and min rewards between -1 and 1.
train_frequency : Int - This is a flag that must be supported by an particular learning algorithm/agent. It will change how often the learner will update the model once it receives a batch. Again this MUST BE IMPLEMENTED IN THE AGENT.

Performance Implications

There are some interesting trade-offs to explore between how often to train and the total number of episodes. Since training is our current bottle neck, changing how often we train will effect the total runtime. At the same time, reducing the number of models we generate also increases the number of episodes required to run. We explore this at a high level by looking at several configurations comparing the training time versus the number of episodes run.

The following is an exploration uses cartpole and three versions of DQN using the async workflow. We change the following parameters:

cutoff 0.00001
rolling_reward_length 25
steps 100
learners 1
batch_step_frequency [1 2 5 10 50 100]
train_frequency [1 2 5 10 100]
batch_size [32 128 256 512]
ranks [2 5 9 17 24]
episode block [False True]

All of the results presented converged with a rolling reward of 100. We present a subset of the following configuration.
The configurations plotted includes all of the batch sizes and ranks. These results were collected on PNNL's Bluesky system.

An interesting conclusion, we see from the data is that there are some cases where training less often can lead to increased number of episodes, but reduce overall training time. This trade-off however bottoms out at a certain point, where the number of episodes increases the overall time or is even not able to converge to the maximum reward. The opposite is also visible when the train_frequency and batch_step_frequency is set to 1. In this case the total number of models is equal to episodes * steps require the synchronization and the most total time.

We parse this data a little further to explore how this behavior occurs when we change the number of actors (while maintaining a single learner).

Here, we see this trend continues as we scale the number of actors.

Provide feedback

Saved searches