Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lasertag self play #26

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

Conversation

RPegoud
Copy link

@RPegoud RPegoud commented Mar 28, 2024

New features:

  • Parallel PZ wrapper for Lasertag
  • Selfplay Curriculum
  • PPO training script using selfplay on lasertag
class SelfPlay(Curriculum):
    def __init__(self, agent, device: str, store_agents_on_cpu: bool = False):
        self.store_agents_on_cpu = store_agents_on_cpu
        self.storage_device = "cpu" if self.store_agents_on_cpu else device
        self.agent = deepcopy(agent).to(self.storage_device)

    def update_agent(self, agent):
        self.agent = deepcopy(agent).to(self.storage_device)

    def get_opponent(self):
        # Always return the most recent agent
        return self.agent

    def sample(self, k=1):
        return 0
image

@RPegoud RPegoud marked this pull request as draft March 28, 2024 14:03
Copy link
Owner

@RyanNavillus RyanNavillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left a couple of comments and suggestions, I'll message you on discord about the FSP/PFSP API

lasertag/register.py Outdated Show resolved Hide resolved
lasertag_ppo.py Outdated
self.task = None
self.episode_return = 0
self.task_space = TaskSpace(spaces.MultiDiscrete(np.array([[2], [5]])))
self.possible_agents = np.arange(self.n_agents)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should be a list of agent names

Suggested change
self.possible_agents = np.arange(self.n_agents)
self.possible_agents = [f"agent_{i}" for i in range(self.n_agents)]

lasertag_ppo.py Outdated
out = {}
for idx, i in enumerate(array):
out[str(idx)] = i
return out
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these variable names be more descriptive? Also I assume str(idx) should instead be self.possible_agents[idx] if you change that to strings.

lasertag_ppo.py Outdated
"""
Broadcasts the `done` and `trunc` flags to dictionaries keyed by agent id.
"""
return {str(idx): value for idx in range(self.n_agents)}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, maybe just calling idx agent_idx is enough

lasertag_ppo.py Outdated
action = batchify(action, device)
obs, rew, done, info = self.env.step(action)
obs = obs["image"]
trunc = 0 # there is no `truncated` flag in this environment
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
trunc = 0 # there is no `truncated` flag in this environment
trunc = False # there is no `truncated` flag in this environment

lasertag_ppo.py Outdated
probs = Categorical(logits=logits)
if action is None:
action = probs.sample()
return action, probs.log_prob(action), probs.entropy(), self.critic(hidden)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't review the agent code super closely but it looks good

lasertag_ppo.py Outdated
# convert to torch
obs = torch.tensor(obs).to(device)

return obs
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to just use batchify? They do the exact same thing right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, the original PettingZoo code had an added obs = obs.transpose(0, -1, 1, 2) in batchify_obs which I removed. So there's no need for this function anymore

lasertag_ppo.py Outdated
x = x.cpu().numpy()
x = {a: x[i] for i, a in enumerate(env.possible_agents)}

return x
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this duplicated code isn't intentional

lasertag_ppo.py Outdated
print(f"Approx KL: {approx_kl.item()}")
print(f"Clip Fraction: {np.mean(clip_fracs)}")
print(f"Explained Variance: {explained_var.item()}")
print("\n-------------------------------------------\n")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could consider adding in weights and biases integration (it's only like 20 lines of code, check out cleanrl_procgen_plr.py). It's a good tool and generates plots for you automatically, and most RL tools integrate with it. You'll need to learn it eventually for this project, but if you're happy with how you're doing things you can try wandb out later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good improvement, I just used whatever logging was used in the cleanRL script but wandb should provide a clearer overview of training progress

syllabus/task_space/task_space.py Show resolved Hide resolved
Copy link
Owner

@RyanNavillus RyanNavillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put the agent curricula into a file syllabus/curricula/selfplay.py? Also please move the DualCurriculumWrapper to syllabus/core/dual_curriculum_wrapper.py.

@RPegoud RPegoud marked this pull request as ready for review April 18, 2024 15:14
Copy link
Owner

@RyanNavillus RyanNavillus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving everything into Syllabus, looks great!. Could you also add some test cases to the multiagent smoke tests (maybe using pettingzoo's chess environment) to test the self play algorithms. We just want something quick that runs through their code to make sure there are no basic runtime errors

self.agent_mp_curriculum, self.agent_task_queue, self.agent_update_queue = (
make_multiprocessing_curriculum(agent_curriculum)
)
self.sample() # initializes env_task and agent_task
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initializer seems hacky. I don't think we should be internally making multiprocessing curricula. Instead let the user create 2 curricula, pass them to this, then in their code, wrap the dual curriculum in a multiprocessing wrapper. This wrapper should implement update functions to pass updates to both the agent and environment curriculum. For example the update_on_step function should call self.env_curriculum.update_on_step and self.agent_curriculum.update_on_step.

I also don't see why we need to call sample here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initializer might also need to update the task space. So the task space of this curriculum should be Tuple(env_curriculum.task_space, agent_curriculum.task_space)

I'll try to merge Nistha's task space updates in today so that you can pull them in here

syllabus/curricula/selfplay.py Outdated Show resolved Hide resolved
its priority.
"""
if self.n_stored_agents < self.max_agents:
# TODO: define the expected behaviour when the limit is exceeded
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should delete or overwrite agents when this happens.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I suggest this:

joblib.dump(
            agent,
            filename=f"{self.storage_path}/{self.name}_agent_checkpoint_{self.current_agent_index % self.max_agents}.pkl",
        )
self.current_agent_index += 1

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go this route, you should save the most recent agent to a file some.where. Since it won't be obvious from the filenames. It might be better to just delete a file and write a new one, but this is fine for now.

@RyanNavillus RyanNavillus changed the base branch from nmmo to lasertag May 21, 2024 21:27
@RyanNavillus RyanNavillus changed the base branch from lasertag to main November 24, 2024 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants