Robust PLR #29

AmeenUrRehman · 2024-03-31T12:36:08Z

update the seed function

…v_evals

RyanNavillus

This looks good so far, make sure to test your code because I noticed a lot of errors that the Python runtime would have caught.

RyanNavillus · 2024-04-16T05:54:21Z

syllabus/curricula/plr/task_sampler.py


        elif self.replay_schedule == "proportionate":
            if proportion_seen >= self.rho and np.random.rand() < proportion_seen:
                return self._sample_replay_level()
            else:
-                return self._sample_unseen_level()
+                if self.robust_plr:
+                    return self._evaluate_unseen_level()


Suggested change

return self._evaluate_unseen_level()

self.update_with_episode_data(self._evaluate_unseen_level())

return self.sample(strategy=strategy)

I think this needs to be the same as above, evaluate instead of returning

RyanNavillus · 2024-04-16T06:03:16Z

syllabus/curricula/plr/task_sampler.py

+    def evaluate_task(self, task, env, get_action_and_value_fn, gamma, gae_lambda):
+
+        if env is None:
+            raise ValueError("Environment object is None. Please ensure it is properly initialized.")
+        obs = env.reset(next_task=task)
+        done = False
+        episode_data = {
+            'tasks': [],
+            'masks': [],
+            'rewards': [],
+            'returns': [],
+            'value_preds': [],
+            'policy_logits': []
+        }
+
+        while not done:
+            action, value = get_action_and_value_fn(obs)
+            obs, rew, done, info = env.step(action)
+
+            episode_data['tasks'].append(task)
+            episode_data['masks'].append(not done)
+            episode_data['rewards'].append(rew)
+            episode_data['value_preds'].append(value)
+            episode_data['policy_logits'].append(info['policy_logits'])
+
+        episode_data['returns'] = self.compute_returns(gamma, gae_lambda, episode_data['rewards'],
+                                                       episode_data['value_preds'])
+
+        return episode_data


Suggested change

def evaluate_task(self, task, env, get_action_and_value_fn, gamma, gae_lambda):

if env is None:

raise ValueError("Environment object is None. Please ensure it is properly initialized.")

obs = env.reset(next_task=task)

done = False

episode_data = {

'tasks': [],

'masks': [],

'rewards': [],

'returns': [],

'value_preds': [],

'policy_logits': []

}

while not done:

action, value = get_action_and_value_fn(obs)

obs, rew, done, info = env.step(action)

episode_data['tasks'].append(task)

episode_data['masks'].append(not done)

episode_data['rewards'].append(rew)

episode_data['value_preds'].append(value)

episode_data['policy_logits'].append(info['policy_logits'])

episode_data['returns'] = self.compute_returns(gamma, gae_lambda, episode_data['rewards'],

episode_data['value_preds'])

return episode_data

def evaluate_task(self, task, env, get_action_and_value_fn, gamma, gae_lambda):

if env is None:

raise ValueError("Environment object is None. Please ensure it is properly initialized.")

obs = env.reset(next_task=task)

done = False

rewards = []

masks = []

values = []

while not done:

action, value = get_action_and_value_fn(obs)

obs, rew, term, trunc, info = env.step(action)

rewards.append(rew)

masks.append(not (term or trunc))

values.append(value)

return = self.compute_returns(gamma, gae_lambda, rewards, values, masks)

return {

"tasks": task,

"masks": masks,

"rewards": rewards,

"value_preds": values,

"returns": return

}

I think you can simplify the code here. Also you'll need to be careful, the environments we're using are probably Gymnasium environments not Gym environments, meaning they'll return obs, rew, term, trunc, info rather than obs, rew, done, info.

RyanNavillus · 2024-04-16T06:10:30Z

syllabus/curricula/plr/task_sampler.py

+            raise NotImplementedError(
+                f"Unsupported replay schedule: {self.replay_schedule}. Must be 'fixed' or 'proportionate'.")
+
+    def update_with_episode_data(self, episode_data, score_function):


Make sure whenever you call this function, you're passing in the score_function. It looks like you're passing in different values each time you call it.

RyanNavillus · 2024-04-16T06:13:22Z

syllabus/curricula/plr/task_sampler.py

+            else:
+                # Otherwise, sample a new level
+                return self._sample_unseen_level()
+


This is a good start, but it's going to be very inefficient to do these evaluations in the main process. We'll probably want to batch and multiprocess them in the future, but for now this is good as a proof of concept.

RyanNavillus · 2024-04-17T18:24:37Z

syllabus/curricula/plr/plr_wrapper.py

+        return action, state_value.item()
+
+    return action_value_fn
+


Sorry for not noticing this earlier, but this doesn't make sense here. The user needs to provide an action_value_fn and we call it with the observations. We should assume they return the action and value in a good format (maybe with some asserts to check if its the correct shape). The user should pass it to the initializer of PrioritizedLevelReplay whenever they enable robust_plr.

RyanNavillus · 2024-04-17T18:25:30Z

syllabus/curricula/plr/plr_wrapper.py

-            return [self._task_sampler.sample() for _ in range(k)]
+            if self._robust_plr:
+                if self._eval_envs is None:
+                    raise ValueError("When robust_plr is enabled, eval_envs must not be None.")


Move this check to the initializer, we don't need to wait to check this.

RyanNavillus · 2024-04-17T18:26:33Z

syllabus/curricula/plr/plr_wrapper.py

+            if self._robust_plr:
+                if self._eval_envs is None:
+                    raise ValueError("When robust_plr is enabled, eval_envs must not be None.")
+                return [self._evaluate_task_and_update_score() for _ in range(k)]


Where is this function?

Also why are you defining the robust plr behavior here when you already have it defined in the task sampler?

RyanNavillus · 2024-04-19T05:18:29Z

syllabus/curricula/plr/plr_wrapper.py

+        if robust_plr:
+            self._task_sampler = TaskSampler(self.tasks, action_space=action_space, robust_plr=robust_plr, eval_envs=eval_envs, action_value_fn = action_value_fn, **task_sampler_kwargs_dict)
+        else:
+            self._task_sampler = TaskSampler(self.tasks, action_space=action_space, **task_sampler_kwargs_dict)


Suggested change

if robust_plr:

self._task_sampler = TaskSampler(self.tasks, action_space=action_space, robust_plr=robust_plr, eval_envs=eval_envs, action_value_fn = action_value_fn, **task_sampler_kwargs_dict)

else:

self._task_sampler = TaskSampler(self.tasks, action_space=action_space, **task_sampler_kwargs_dict)

self._task_sampler = TaskSampler(self.tasks, action_space=action_space, robust_plr=robust_plr, eval_envs=eval_envs, action_value_fn = action_value_fn, **task_sampler_kwargs_dict)

You don't need to check for robust_plr here, just pass the None values through

RyanNavillus · 2024-04-19T05:20:57Z

syllabus/curricula/plr/task_sampler.py

+
+        self._update_staleness(task_idx)
+
+        return task


Isn't this function already defined above?

RyanNavillus · 2024-04-19T05:23:36Z

syllabus/curricula/plr/task_sampler.py

+                    return task
+                else:
+                    return self._sample_unseen_level()
+


Why do you need to do this? If you just call sample again recursively, won't it recalculate num_seen and proportion seen for you? Let me know if I'm missing something that prevents you from doing that.

RyanNavillus · 2024-04-19T05:26:02Z

syllabus/curricula/plr/task_sampler.py

+                score = score_function(**score_function_kwargs)
+                self._last_score = score
+                num_steps = len(episode_data['tasks'][start_t:, actor_index])
+                self._partial_update_task_score(actor_index, task_idx_t, score, num_steps)


Could you reduce the code duplication between update_with_rollouts and this? Maybe create an inner function that does most of the computation, and a helper function that converts rollouts to an episode_data dictionary.

So update_with_rollouts will first move data into the episode_data dictionary and then call update_with_episode_data

RyanNavillus · 2024-04-19T05:28:21Z

syllabus/curricula/plr/task_sampler.py

+            "rewards": rewards,
+            "value_preds": values,
+            "returns": returns
+        }


Once you clean up everything else, we should see how fast this is and come up with some ideas to optimize it

RyanNavillus · 2024-04-22T00:08:34Z

syllabus/curricula/plr/plr_wrapper.py

@@ -206,7 +206,7 @@ def __init__(
        get_value=null,
        get_action_log_dist=null,
        robust_plr: bool = False,  # Option to use RobustPLR
-        eval_envs=None,
+        eval_envs: List[gym.Env] = None,


This should probably be a gym.VectorEnv

RyanNavillus · 2024-04-25T04:40:36Z

syllabus/curricula/plr/task_sampler.py

-                            break
-
-                    return task
+                    self.update_with_episode_data(self._evaluate_unseen_level())


You call self.update_with_episode_data here and within self._evaluate_unseen_level. You should only keep one of them.

Suggested change

self.update_with_episode_data(self._evaluate_unseen_level())

self._evaluate_unseen_level()

RyanNavillus · 2024-04-25T04:41:19Z

syllabus/curricula/plr/task_sampler.py

@@ -363,44 +401,32 @@ def sample(self, strategy=None):
                if np.random.rand() > self.nu or not proportion_seen < 1.0:
                    return self._sample_replay_level()

-            # Otherwise, evaluate a new level
+            # Otherwise, sample a new level
            if self.robust_plr:
                self.update_with_episode_data(self._evaluate_unseen_level())


You call self.update_with_episode_data here and within self._evaluate_unseen_level. You should only keep one of them.

Suggested change

self.update_with_episode_data(self._evaluate_unseen_level())

self._evaluate_unseen_level()

RyanNavillus · 2024-04-25T04:42:35Z

syllabus/curricula/plr/task_sampler.py

        done = False
        rewards = []
        masks = []
        values = []

        while not done:
            action, value = action_value_fn(obs)
+
+            if isinstance(action, np.ndarray):
+                action = int(action[0])


You shouldn't do this, instead make sure that action_value_fn returns a single action rather than an np.ndarray

RyanNavillus · 2024-04-25T04:43:55Z

syllabus/curricula/plr/task_sampler.py

+            gae_scal = gae[0] if isinstance(gae, np.ndarray) else gae
+            value_scal = values[step][0] if isinstance(values[step], np.ndarray) else values[step]
+
+            returns[step] = gae_scal + value_scal


Why are you changing this function? You should not need to modify the GAE code, it's already correct.

RyanNavillus · 2024-04-25T04:44:42Z

syllabus/curricula/plr/task_sampler.py

+
+        self._update_staleness(task_idx)
+
+        return task


Suggested change

return task

No need to return the task here, see my other comments

RyanNavillus · 2024-04-25T04:47:32Z

tests/multiprocessing_smoke_tests.py

+           "num_steps": 2048,
+           "robust_plr": True,
+           "eval_envs": create_nethack_env(),
+           "action_value_fn": get_action_value


If you need this to output a single action and value for now, just do something simple for testing

def get_action_value(obs): return 0, 0

…robust_plr

…obust_plr

RyanNavillus

Just some cleanup and minor errors to correct

RyanNavillus · 2024-05-07T21:23:28Z

syllabus/examples/training_scripts/cleanrl_procgen_plr.py

@@ -282,6 +307,7 @@ def get_value(obs):
    )
    envs = wrap_vecenv(envs)

+
    assert isinstance(envs.single_action_space, gym.spaces.Discrete), "only discrete action space is supported"
    print("Creating agent")
    agent = ProcgenAgent(


We shouldn't be creating the agent twice, please remove this

RyanNavillus · 2024-05-07T21:23:45Z

syllabus/examples/training_scripts/cleanrl_procgen_plr.py

@@ -311,6 +337,7 @@ def get_value(obs):
    completed_episodes = 0

    for update in range(1, num_updates + 1):
+        print("Update", update)


Remove print statements throughout code please

RyanNavillus · 2024-05-07T21:23:53Z

syllabus/examples/training_scripts/cleanrl_procgen_plr.py

    def get_value(obs):
        obs = np.array(obs)
+        print(obs.shape)


RyanNavillus · 2024-05-07T21:23:58Z

syllabus/curricula/plr/task_sampler.py

+                self._last_score = score
+                num_steps = len(episode_data["tasks"][start_t:, actor_index])
+                self._partial_update_task_score(actor_index, task_idx_t, score, num_steps)
+        print("Updated")


RyanNavillus · 2024-05-07T21:26:58Z

syllabus/curricula/plr/task_sampler.py

+                f"Unsupported replay schedule: {self.replay_schedule}. Must be 'fixed' or 'proportionate'.")
+
+    def _update_with_episode_data(self, episode_data, score_function):
+        print("Updating")


Remove print

RyanNavillus · 2024-05-07T21:31:53Z

syllabus/curricula/plr/task_sampler.py

+    def evaluate_task(self, task, env, action_value_fn):
+        if env is None:
+            raise ValueError("Environment object is None. Please ensure it is properly initialized.")
+        print("Evaluating")


Remove print

RyanNavillus · 2024-05-07T21:33:03Z

syllabus/curricula/plr/task_sampler.py

+            action, value = action_value_fn(obs)
+
+            obs, rew, term, trunc, _ = env.step(action)
+
+            task_encoded = self.task_space.encode(task)
+
+            mask = torch.FloatTensor([0.0] if term or trunc else [1.0])
+            self._robust_rollouts.insert(mask, value_preds=value, rewards=torch.Tensor([rew]), tasks=torch.Tensor([task_encoded]))
+
+
+            # Check if the episode is done
+            if term or trunc:
+                done = True
+


Suggested change

action, value = action_value_fn(obs)

obs, rew, term, trunc, _ = env.step(action)

task_encoded = self.task_space.encode(task)

mask = torch.FloatTensor([0.0] if term or trunc else [1.0])

self._robust_rollouts.insert(mask, value_preds=value, rewards=torch.Tensor([rew]), tasks=torch.Tensor([task_encoded]))

# Check if the episode is done

if term or trunc:

done = True

action, value = action_value_fn(obs)

obs, rew, term, trunc, _ = env.step(action)

task_encoded = self.task_space.encode(task)

mask = torch.FloatTensor([0.0] if term or trunc else [1.0])

self._robust_rollouts.insert(mask, value_preds=value, rewards=torch.Tensor([rew]), tasks=torch.Tensor([task_encoded]))

# Check if the episode is done

if term or trunc:

done = True

Remove spaces

RyanNavillus · 2024-05-07T21:33:20Z

syllabus/curricula/plr/task_sampler.py

+
+        _, next_value = action_value_fn(obs)
+        self._robust_rollouts.compute_returns(next_value, self.gamma, self.gae_lambda)
+        print("Evaluated")


Remove print

RyanNavillus · 2024-05-07T21:35:01Z

syllabus/curricula/plr/task_sampler.py

+                if self.robust_plr:
+                    self._evaluate_unseen_level()
+                    return self.sample(strategy=strategy)


Suggested change

if self.robust_plr:

self._evaluate_unseen_level()

return self.sample(strategy=strategy)

if self.robust_plr:

self._evaluate_unseen_level()

self._robust_rollouts.after_update()

self.after_update()

return self.sample(strategy=strategy)

RyanNavillus · 2024-05-07T21:35:39Z

syllabus/tests/utils.py

    env = NethackTaskWrapper(env)

+


Remove space

Suggested change

RyanNavillus

Looks great. 1 Minor change you missed and this should be good!

RyanNavillus · 2024-05-09T03:18:03Z

syllabus/curricula/plr/task_sampler.py

+        advantages = np.asarray(np.abs(returns - value_preds))

-        advantages = returns - value_preds
-
-        return advantages.abs().mean().item()
+        return advantages.mean().item()


This code should be

advantages = returns - value_preds return advantages.abs().mean().item()

minor update to accept _venv

2b7e080

AmeenUrRehman changed the title ~~minor update to accept _venv~~ Robust PLR Apr 7, 2024

AmeenUrRehman added 2 commits April 8, 2024 00:28

updated the task_sampler to accept evaluation task ,robust_plr and en…

d098b40

…v_evals

revert the changes for seed function

3b3cea1

RyanNavillus reviewed Apr 16, 2024

View reviewed changes

AmeenUrRehman added 2 commits April 17, 2024 21:36

fixs the issues and update the functions

39aa1bc

update the plr_wrapper to accept the robust_plr option

992bdab

RyanNavillus reviewed Apr 18, 2024

View reviewed changes

AmeenUrRehman added 2 commits April 18, 2024 21:59

clean-up code and initialize the action_value_fn and callables

d92f64c

optimize the sample_fn and update tp accept the action_value_fn

d24434b

RyanNavillus reviewed Apr 19, 2024

View reviewed changes

AmeenUrRehman added 7 commits April 21, 2024 09:36

add the robust_plr option for existence plr

03b6b77

clean-up code for plr_wrapper and fixes minor issues

cda5ddb

optimize code based on the suggestions

908688d

add user defined get_action_value func

8364ff9

fixes the eval_envs parameters

cb71cb8

updating the task_sampler to fix issues

75697e5

fix errors and optimize codes

a6e2b31

RyanNavillus reviewed Apr 25, 2024

View reviewed changes

AmeenUrRehman and others added 11 commits April 26, 2024 01:39

added other parameters for Task Sampler initialization

c6b7ade

initialisation for get_value_fn

16fd2dc

added rollout class for task_sampler and fixed the GAE.

91d7060

updated the task_Sampler to fix errors.

e40c3ad

Fix Robust PLR

4a1d9dc

Merge branch 'main' of https://github.com/RyanNavillus/Syllabus into …

adbf1c5

…robust_plr

Merge branch 'main' of https://github.com/RyanNavillus/Syllabus into …

2ed4865

…robust_plr

Merge branch 'main' of https://github.com/RyanNavillus/Syllabus into …

b3afe49

…robust_plr

Fix procgen script for robust plr

85c7504

pulled everything and continue on robust_plr

2b281e6

Add Storage file

1cc51c2

RyanNavillus and others added 3 commits May 2, 2024 14:45

Merge branch 'robust_plr' of github.com:AmeenUrRehman/Syllabus into r…

be3753d

…obust_plr

updated the curricula to accept robust_plr

81fa16e

fixes errors, minor changes and updated the task_sampler

2ea5979

RyanNavillus reviewed May 7, 2024

View reviewed changes

minor cleanup

31c63fe

RyanNavillus reviewed May 9, 2024

View reviewed changes

minor cleanup final

b96686d

	return self._evaluate_unseen_level()
	self.update_with_episode_data(self._evaluate_unseen_level())
	return self.sample(strategy=strategy)

	self.update_with_episode_data(self._evaluate_unseen_level())
	self._evaluate_unseen_level()

Robust PLR #29

Are you sure you want to change the base?

Robust PLR #29

Conversation

AmeenUrRehman commented Mar 31, 2024 • edited Loading

RyanNavillus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanNavillus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanNavillus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmeenUrRehman commented Mar 31, 2024 •

edited

Loading