better maually cleanup gpu memory when loading motions #98

luoye2333 · 2024-11-13T06:11:26Z

often meets CUDA out of memory in the stage of evaluating the model (which periodically called after 1500 iterations of training).

In motion_lib_real.py line 199 we load the motions in memory and then transfer them into gpu tensors. Then given to class variables (e.g. self.gts). Perhaps tensors loaded previously in self.gts are not cleaned automatically.

self.gts = torch.cat([m.global_translation for m in motions], dim=0).float().to(self._device)
self.grs = torch.cat([m.global_rotation for m in motions], dim=0).float().to(self._device)
self.lrs = torch.cat([m.local_rotation for m in motions], dim=0).float().to(self._device)
self.grvs = torch.cat([m.global_root_velocity for m in motions], dim=0).float().to(self._device)
self.gravs = torch.cat([m.global_root_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gavs = torch.cat([m.global_angular_velocity for m in motions], dim=0).float().to(self._device)
self.gvs = torch.cat([m.global_velocity for m in motions], dim=0).float().to(self._device)
self.dvs = torch.cat([m.dof_vels for m in motions], dim=0).float().to(self._device)

So better manually clean the cache before we load.

self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs = None, None, None, None, None, None, None, None
gc.collect(); torch.cuda.empty_cache()

the same in line 208

self.gts_t, self.grs_t, self.gvs_t, self.gavs_t = None, None, None, None
gc.collect(); torch.cuda.empty_cache()

and line 214

self.dof_pos = None
gc.collect(); torch.cuda.empty_cache()

Helps me train on single RTX 4090. But im not sure if this is the case. It is wierd that the memory is not cleaned up automatically after assigning new data on the old variables.

The text was updated successfully, but these errors were encountered:

luoye2333 · 2024-11-13T08:56:52Z

Found a strange thing. There is already cleanup codes in motion_lib_real.py line 77, but it is commented out.

# if "gts" in self.__dict__:
#     del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
#     del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
#     if "gts_t" in self.__dict__:
#         self.gts_t, self.grs_t, self.gvs_t
#     if flags.real_traj:
#         del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

change to this

if "gts" in self.__dict__:
    del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
    del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
if "gts_t" in self.__dict__:
    del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
if "dof_pos" in self.__dict__:
    del self.dof_pos
if flags.real_traj:
    del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

luoye2333 · 2024-11-18T05:54:27Z

can further cut down gpu memory usage by cleaning variables after the last evaluation. The variables in env._motion_env_lib is not cleared when switching _motion_lib to motion_train_lib after finished evaluation, and we dont need it when training. Also we still need to load it the next time after 1500 epoches of train.

in phc/learning/im_amp.py line 227

humanoid_env._motion_eval_lib.clear_cache() # add this
humanoid_env._motion_lib = humanoid_env._motion_train_lib

in phc/utils/motion_lib_real.py add this function

def clear_cache(self):
    if "gts" in self.__dict__:
        del self.gts, self.grs, self.lrs, self.grvs, self.gravs, self.gavs, self.gvs, self.dvs
        del self._motion_lengths, self._motion_fps, self._motion_dt, self._motion_num_frames, self._motion_bodies, self._motion_aa
    if "gts_t" in self.__dict__:
        del self.gts_t, self.grs_t, self.gvs_t, self.gavs_t
    if "dof_pos" in self.__dict__:
        del self.dof_pos
    if flags.real_traj:
        del self.q_gts, self.q_grs, self.q_gavs, self.q_gvs

It is also possible to clear _motion_train_lib when entering evaluation in line 178 in im_amp.py, but it seems ok now.

A typical gpu memory usage time line when using num_envs=2048 :
5 GB is allocated by gym simulation
12.5 GB to load training variables
5 GB (at peak) to loading evaluation variables.
After evaluation, it comes back to 5+12.5 GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better maually cleanup gpu memory when loading motions #98

better maually cleanup gpu memory when loading motions #98

luoye2333 commented Nov 13, 2024

luoye2333 commented Nov 13, 2024 •

edited

Loading

luoye2333 commented Nov 18, 2024

better maually cleanup gpu memory when loading motions #98

better maually cleanup gpu memory when loading motions #98

Comments

luoye2333 commented Nov 13, 2024

luoye2333 commented Nov 13, 2024 • edited Loading

luoye2333 commented Nov 18, 2024

luoye2333 commented Nov 13, 2024 •

edited

Loading