-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lasertag self play #26
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left a couple of comments and suggestions, I'll message you on discord about the FSP/PFSP API
lasertag_ppo.py
Outdated
self.task = None | ||
self.episode_return = 0 | ||
self.task_space = TaskSpace(spaces.MultiDiscrete(np.array([[2], [5]]))) | ||
self.possible_agents = np.arange(self.n_agents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this should be a list of agent names
self.possible_agents = np.arange(self.n_agents) | |
self.possible_agents = [f"agent_{i}" for i in range(self.n_agents)] |
lasertag_ppo.py
Outdated
out = {} | ||
for idx, i in enumerate(array): | ||
out[str(idx)] = i | ||
return out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these variable names be more descriptive? Also I assume str(idx)
should instead be self.possible_agents[idx]
if you change that to strings.
lasertag_ppo.py
Outdated
""" | ||
Broadcasts the `done` and `trunc` flags to dictionaries keyed by agent id. | ||
""" | ||
return {str(idx): value for idx in range(self.n_agents)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, maybe just calling idx agent_idx is enough
lasertag_ppo.py
Outdated
action = batchify(action, device) | ||
obs, rew, done, info = self.env.step(action) | ||
obs = obs["image"] | ||
trunc = 0 # there is no `truncated` flag in this environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trunc = 0 # there is no `truncated` flag in this environment | |
trunc = False # there is no `truncated` flag in this environment |
lasertag_ppo.py
Outdated
probs = Categorical(logits=logits) | ||
if action is None: | ||
action = probs.sample() | ||
return action, probs.log_prob(action), probs.entropy(), self.critic(hidden) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't review the agent code super closely but it looks good
lasertag_ppo.py
Outdated
# convert to torch | ||
obs = torch.tensor(obs).to(device) | ||
|
||
return obs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to just use batchify
? They do the exact same thing right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, the original PettingZoo code had an added obs = obs.transpose(0, -1, 1, 2)
in batchify_obs
which I removed. So there's no need for this function anymore
lasertag_ppo.py
Outdated
x = x.cpu().numpy() | ||
x = {a: x[i] for i, a in enumerate(env.possible_agents)} | ||
|
||
return x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this duplicated code isn't intentional
lasertag_ppo.py
Outdated
print(f"Approx KL: {approx_kl.item()}") | ||
print(f"Clip Fraction: {np.mean(clip_fracs)}") | ||
print(f"Explained Variance: {explained_var.item()}") | ||
print("\n-------------------------------------------\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could consider adding in weights and biases integration (it's only like 20 lines of code, check out cleanrl_procgen_plr.py). It's a good tool and generates plots for you automatically, and most RL tools integrate with it. You'll need to learn it eventually for this project, but if you're happy with how you're doing things you can try wandb out later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good improvement, I just used whatever logging was used in the cleanRL script but wandb should provide a clearer overview of training progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put the agent curricula into a file syllabus/curricula/selfplay.py? Also please move the DualCurriculumWrapper to syllabus/core/dual_curriculum_wrapper.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving everything into Syllabus, looks great!. Could you also add some test cases to the multiagent smoke tests (maybe using pettingzoo's chess environment) to test the self play algorithms. We just want something quick that runs through their code to make sure there are no basic runtime errors
self.agent_mp_curriculum, self.agent_task_queue, self.agent_update_queue = ( | ||
make_multiprocessing_curriculum(agent_curriculum) | ||
) | ||
self.sample() # initializes env_task and agent_task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initializer seems hacky. I don't think we should be internally making multiprocessing curricula. Instead let the user create 2 curricula, pass them to this, then in their code, wrap the dual curriculum in a multiprocessing wrapper. This wrapper should implement update functions to pass updates to both the agent and environment curriculum. For example the update_on_step function should call self.env_curriculum.update_on_step and self.agent_curriculum.update_on_step.
I also don't see why we need to call sample here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initializer might also need to update the task space. So the task space of this curriculum should be Tuple(env_curriculum.task_space, agent_curriculum.task_space)
I'll try to merge Nistha's task space updates in today so that you can pull them in here
syllabus/curricula/selfplay.py
Outdated
its priority. | ||
""" | ||
if self.n_stored_agents < self.max_agents: | ||
# TODO: define the expected behaviour when the limit is exceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should delete or overwrite agents when this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I suggest this:
joblib.dump(
agent,
filename=f"{self.storage_path}/{self.name}_agent_checkpoint_{self.current_agent_index % self.max_agents}.pkl",
)
self.current_agent_index += 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go this route, you should save the most recent agent to a file some.where. Since it won't be obvious from the filenames. It might be better to just delete a file and write a new one, but this is fine for now.
…wrapper API changes (still ongoing)
…lasertag_self_play
…ort, added Dict task space to dual curriculum
…pt to automate slurm jobs
…ments, refactoring
…lasertag_self_play
New features: