Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training with pytorch +2 results in memory leakage #262

Open
YJonmo opened this issue Feb 22, 2024 · 4 comments
Open

Distributed training with pytorch +2 results in memory leakage #262

YJonmo opened this issue Feb 22, 2024 · 4 comments

Comments

@YJonmo
Copy link

YJonmo commented Feb 22, 2024

Thanks for this work.

I was trying to train the model using the conda environment:

pytorch                   2.1.2           py3.11_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

But it ended up using up all of the RAM after a few hours.

When I downgraded my torch then it worked:

pytorch                   1.13.1          py3.10_cuda11.7_cudnn8.5.0_0    pytorch
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch

Here is the error message after the memory got filled:

...
[GPU0] Training epoch: 14
  0%|                                                                                                                  | 0/2948 [00:00<?, ?it/s][GPU1] Training epoch: 14
  8%|████████                                                                                                | 228/2948 [01:49<12:51,  3.53it/s][GPU0] Model saved
 25%|█████████████████████████▋                                                                              | 728/2948 [04:39<16:02,  2.31it/s][GPU0] Model saved
 42%|██████████████████████████████████████████▉                                                            | 1228/2948 [07:49<10:03,  2.85it/s][GPU0] Model saved
 59%|████████████████████████████████████████████████████████████▎                                          | 1728/2948 [10:39<06:38,  3.06it/s][GPU0] Model saved
 76%|█████████████████████████████████████████████████████████████████████████████▊                         | 2228/2948 [13:54<04:46,  2.51it/s][GPU0] Model saved
 93%|███████████████████████████████████████████████████████████████████████████████████████████████▎       | 2728/2948 [17:08<01:15,  2.92it/s][GPU0] Model saved
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00,  2.66it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2948/2948 [18:27<00:00,  2.66it/s]
[GPU0] Validating at the start of epoch: 15
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2032/2032 [02:59<00:00, 11.34it/s]
[GPU0] Validation set average loss: 0.2255345582962036
[GPU0] Training epoch: 15
[GPU1] Training epoch: 15
  0%|                                                                                                                  | 0/2948 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/xx/DL/Repos/RobustVideoMatting/train_xx.py", line 501, in <module>
    mp.spawn(
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
 /home/xx/mambaforge/envs/pytorch-pip/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 138 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Is there any reason for Pytorch 2 to have memory leakage when combined with the distributed training?
I disabled the distributed training by modifying the code to avoid memory leakage when using Pytorch 2. Pytorch 2 trains a lot faster than the 1.X.

@onmygame2
Copy link

I tried train with a single gpu but still has memory leak. I've tried a lot method but still can't fix it.

@SuyueLiu
Copy link

Have you solved this problem?

@YJonmo
Copy link
Author

YJonmo commented Jun 27, 2024

I am not using the DDP anymore.

@SuyueLiu
Copy link

got it, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants