Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better maually cleanup gpu memory when loading motions #98

Open
luoye2333 opened this issue Nov 13, 2024 · 2 comments
Open

better maually cleanup gpu memory when loading motions #98

luoye2333 opened this issue Nov 13, 2024 · 2 comments

Comments

@luoye2333
Copy link

often meets CUDA out of memory in the stage of evaluating the model (which periodically called after 1500 iterations of training).

In motion_lib_real.py line 199 we load the motions in memory and then transfer them into gpu tensors. Then given to class variables (e.g. self.gts). Perhaps tensors loaded previously in self.gts are not cleaned automatically.

self.gts = torch.cat([m.global_translation for m in motions], dim=0).float().to(self._device)
self.grs = torch.cat([m.global_rotation for m in motions], dim=0).float().to(self._device)
self.lrs = torch.cat([m.local_rotation for m in motions], dim=0).float().to(self._device)
self.grvs = torch.cat([m.global_root_velocity for m in motions], dim=0).float().to(self._device)
self.gravs = torch.cat([m.global_root_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gavs = torch.cat([m.global_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gvs = torch.cat([m.global_velocity for m in motions], dim=0).float().to(self._device)
self.dvs = torch.cat([m.dof_vels for m in motions], dim=0).float().to(self._device)

So better manually clean the cache before we load.

self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs = None, None, None, None, None, None, None, None
gc.collect(); torch.cuda.empty_cache()

the same in line 208

self.gts_t, self.grs_t, self.gvs_t, self.gavs_t = None, None, None, None
gc.collect(); torch.cuda.empty_cache()

and line 214

self.dof_pos = None
gc.collect(); torch.cuda.empty_cache()

Helps me train on single RTX 4090. But im not sure if this is the case. It is wierd that the memory is not cleaned up automatically after assigning new data on the old variables.

@luoye2333
Copy link
Author

luoye2333 commented Nov 13, 2024

Found a strange thing. There is already cleanup codes in motion_lib_real.py line 77, but it is commented out.

# if "gts" in self.__dict__:
#     del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
#     del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
#     if "gts_t" in self.__dict__:
#         self.gts_t, self.grs_t, self.gvs_t
#     if flags.real_traj:
#         del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

change to this

if "gts" in self.__dict__:
    del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
    del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
if "gts_t" in self.__dict__:
    del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
if "dof_pos" in self.__dict__:
    del self.dof_pos
if flags.real_traj:
    del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

@luoye2333
Copy link
Author

can further cut down gpu memory usage by cleaning variables after the last evaluation. The variables in env._motion_env_lib is not cleared when switching _motion_lib to motion_train_lib after finished evaluation, and we dont need it when training. Also we still need to load it the next time after 1500 epoches of train.

in phc/learning/im_amp.py line 227

humanoid_env._motion_eval_lib.clear_cache() # add this
humanoid_env._motion_lib = humanoid_env._motion_train_lib

in phc/utils/motion_lib_real.py add this function

def clear_cache(self):
    if "gts" in self.__dict__:
        del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
        del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
    if "gts_t" in self.__dict__:
        del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
    if "dof_pos" in self.__dict__:
        del self.dof_pos
    if flags.real_traj:
        del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

It is also possible to clear _motion_train_lib when entering evaluation in line 178 in im_amp.py, but it seems ok now.

A typical gpu memory usage time line when using num_envs=2048 :
5 GB is allocated by gym simulation
12.5 GB to load training variables
5 GB (at peak) to loading evaluation variables.
After evaluation, it comes back to 5+12.5 GB.
gpu0_memory_usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant